Using Python's finditer for Lexical Analysis
Fredrik Lundh wrote a good article called Using Regular Expressions for Lexical Analysis which explains how to use Python regular expressions to read an input string and group characters into lexical units, or tokens. The author's first group of examples read in a simple expression, "b = 2 + a*10"
, and output strings classified as one of three token types: symbols (e.g. a
and b
), integer literals (e.g. 2
and 10
), and operators (e.g. =
, +
, and *
). His first three examples use the findall
method and his fourth example uses the undocumented scanner
method from the re
module. Here is the example code from the fourth example. Note that the "1" in the first column of the results corresponds to the integer literals token group, "2" corresponds to the symbols group, and "3" to the operators group.
import re
expr = "b = 2 + a*10"
pos = 0
pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))")
scan = pattern.scanner(expr)
while 1:
m = scan.match()
if not m:
break
print m.lastindex, repr(m.group(m.lastindex))
2 'b' 3 '=' 1 '2' 3 '+' 2 'a' 3 '*' 1 '10'
Since this article was dated 2002, and the author was using Python 2.0, I wondered if this was the most current approach. The author notes that recent versions (i.e. version 2.2 or later) of Python allow you to use the finditer
method which uses an internal scanner
object. Using finditer
makes the example code much simpler. Here is Fredrik's example using finditer
:
import re
expr = "b = 2 + a*10"
regex = re.compile("\s*(?:(\d+)|(\w+)|(.))")
for m in regex.finditer(expr):
print m.lastindex, repr(m.group(m.lastindex))
Running it produces the same results as the original.
Related posts
- (Not too successfully) trying to use Unix tools instead of Python utility scripts — posted 2011-04-20
- How to search C code for division or sqrt — posted 2008-07-24
- How to remove C style comments using Python — posted 2007-11-28
- Using Python's finditer to highlight search items — posted 2007-10-16
- Python finditer regular expression example — posted 2007-10-03