Using Python's finditer to highlight search items
I am trying to search through various text and highlight certain search terms within that text using HTML markup. As an example, if I take a paragraph of text from Paul Prescod's essay, I would like to highlight the search terms "lisp", "python", "perl", "java", and "C" each in different colors. My first attempt at this problem looked somthing like:
for sentence in re.split(r"[?.]\s+", text):
match = re.search(r"\blisp\b", sentence, re.I)
if match:
color = 'red'
else:
match = re.search(r"\bpython\b", sentence, re.I)
if match:
color = 'blue'
else:
match = re.search(r"\bperl\b", sentence, re.I)
if match:
color = 'orange'
I didn't finish it because, not only is it ugly and verbose, it doesn't do what I want. Instead of matching all the search terms, it only matches the first one in each sentence. Fortunately, I took some time to rethink the problem (i.e. search the internet (this thread on the Python mailing list was helpful (I guess my Perl background is still showing) as was this article which I previously referenced. (hmmm, this is starting to look like Lisp.))) and made a prettier (and correct) version using my new favorite regular expression method, finditer
, and the MatchObject
's lastindex
attribute. Here is the working example:
import re
COLOR = ['red', 'blue', 'orange', 'violet', 'green']
text = """Graham says that Perl is cooler than Java and Python than Perl. In some circles, maybe. Graham uses the example of Slashdot, written in Perl. But what about Advogato, written in C? What about all of the cool P2P stuff being written in all three of the languages? Considering that Perl is older than Java, and was at one time the Next Big Language, I think you would have a hard time getting statistical evidence that programmers consider Perl "cooler" than Java, except perhaps by virtue of the fact that Java has spent a few years as the "industry standard" (and is thus uncool for the same reason that the Spice Girls are uncool) and Perl is still "underground" (and thus cool, for the same reason that ambient is cool). Python is even more "underground" than Perl (and thus cooler?). Maybe all Graham has demonstrated is that proximity to Lisp drives a language underground. Except that he's got the proximity to Lisp argument backwards too."""
regex = re.compile(r"(\blisp\b)|(\bpython\b)|(\bperl\b)|(\bjava\b)|(\bc\b)", re.I)
i = 0; output = "<html>"
for m in regex.finditer(text):
output += "".join([text[i:m.start()],
"<strong><span style='color:%s'>" % COLOR[m.lastindex-1],
text[m.start():m.end()],
"</span></strong>"])
i = m.end()
print "".join([output, text[m.end():], "</html>"])
finditer
. For each match, non-matching text and matching text surrounded with the HTML <span>
tag are appended to the output
string. start()
and end()
return the indices to the start and end positions of the matching text. The color of the text is determined by using lastindex
to index into a list of colors. lastindex
is the index of the group of the last match. So, it is "1" if "lisp" is matched, "2" if "python" is matched, "3" if "perl" is matched, and so on. I need to subtract 1 because the list indexing starts at 0. The last line adds on the rest of the non-matching text, and prints it. When viewed in a browser, it looks something like this:Graham says that Perl is cooler than Java and Python than Perl. In some circles, maybe. Graham uses the example of Slashdot, written in Perl. But what about Advogato, written in C? What about all of the cool P2P stuff being written in all three of the languages? Considering that Perl is older than Java, and was at one time the Next Big Language, I think you would have a hard time getting statistical evidence that programmers consider Perl "cooler" than Java, except perhaps by virtue of the fact that Java has spent a few years as the "industry standard" (and is thus uncool for the same reason that the Spice Girls are uncool) and Perl is still "underground" (and thus cool, for the same reason that ambient is cool). Python is even more "underground" than Perl (and thus cooler?). Maybe all Graham has demonstrated is that proximity to Lisp drives a language underground. Except that he's got the proximity to Lisp argument backwards too.
Related posts
- (Not too successfully) trying to use Unix tools instead of Python utility scripts — posted 2011-04-20
- How to search C code for division or sqrt — posted 2008-07-24
- How to remove C style comments using Python — posted 2007-11-28
- Using Python's finditer for Lexical Analysis — posted 2007-10-16
- Python finditer regular expression example — posted 2007-10-03
Comments
Good advice on ORing the different terms into one regex. I started something like this as you did; with multiple searches. This is much cleaner. In Java I don't see an equivalent of lastindex, but a dict/map of terms to colors works fine.
cool, i was seeking for a text highlight :)