Python finditer regular expression example
I often process text line by line using the splitlines()
method with a for
loop. This works great most of the time, however, sometimes, the text is not neatly divisible into lines, or, I need to match multiple items per line. This is where the re
module's finditer
function can help. finditer
returns an iterator over all non-overlapping matches for the regular expression pattern in the string. (See docs.) It is a powerful tool for text processing and one that I don't use often enough.
Here is a simple example which demonstrates the use of finditer
. It reads in a page of html text, finds all the occurrences of the word "the" and prints "the" and the following word. It also prints the character position of each match using the MatchObject
's start()
method. (See docs.) Note that, for simplicity, I didn't mess with the HTML tags at all. I just pretended it was plain text. Oh, and the example text is taken from Steve Yegge's article: How To Make a Funny Talk Title Without Using The Word "Weasel"
Python code:
import re
import urllib2
html = urllib2.urlopen('http://steve-yegge.blogspot.com/2007/08/how-to-make-funny-talk-title-without.html').read()
pattern = r'\b(the\s+\w+)\s+'
regex = re.compile(pattern, re.IGNORECASE)
for match in regex.finditer(html):
print "%s: %s" % (match.start(), match.group(1))
Results:
1301: The Word 12291: The Word 13367: the cut 14025: the car 15050: the free 15513: the third 15558: the sessions 15617: the ONLY 15684: the ground 15911: the OSI 15933: The Attack 16051: The gist 16115: the term 16178: the creator 16741: the thing 16850: the same 16877: the thing 16942: the next 17131: the talk 17374: the room 17727: the hell 17782: the term 17830: the 1980s 18083: the whole 18158: the same 18230: the mountain 18305: the seat 18537: The pro 18718: the banner 18928: the poor 19006: the midst 19223: the buzzwagon 19326: the source 19437: the OSI 19855: the OSI 19927: the other 20055: the Ten 20404: The 22 20517: the OSI 20616: the book 21098: the collective 21553: the proposed 21681: the Five 21932: the nearest 22690: The rest 22858: the entertaining 23255: the crap 23561: the next 23661: the registration 23963: the registration 24114: the restaurant 24289: the people 24456: the second 24597: the current 24871: The Style 24929: the front 25047: the curtain 25132: the movie 25159: The hospital 25249: the night 25881: the way 25892: the rear 25927: the crowd 26194: the podium 26262: the front 26521: the door 26593: the front 26622: The economist 27128: the thing 27228: The next 27290: the Pirate 27409: the material 27461: the crowd 27621: the next 27916: The technician 28084: the way 28487: the technician 28735: the exciting 35709: The Next 36587: The Pinocchio 45436: the Kingdom 45679: The Truth 51623: the same 52526: The Word
Related posts
- (Not too successfully) trying to use Unix tools instead of Python utility scripts — posted 2011-04-20
- How to search C code for division or sqrt — posted 2008-07-24
- How to remove C style comments using Python — posted 2007-11-28
- Using Python's finditer to highlight search items — posted 2007-10-16
- Using Python's finditer for Lexical Analysis — posted 2007-10-16
Comments
Yes regex are best for this kind of thing, you can optimize your code :
r'\b(the\s+\w+)\s+'
r'\b([Tt][Hh][Ee]\s+\w+)\s+'
re.compile(pattern, re.IGNORECASE)
regex = re.compile(pattern)
print "%s: %s" % (match.start(), match.group(1))
print match.start(), match.group(1)
Use
it =regex.finditer(html)
for match in it:
disqus:3138539470
I'm facing with problem: https://stackoverflow.com/q...
disqus:3267705333