Find the N longest lines in a file with Python
Here's a Python problem I attempted recently:
Write a program to read a multiple line text file and write the N longest lines to a new file. Where N and the file to be read are specified on the command line. Optimization is important.
Here's my solution:
import sys
def main(filename=sys.argv[1],
N=int(sys.argv[2])):
"""Find the N longest lines in filename and write to filename + ".new"
"""
lines = open(filename).readlines()
lines.sort(cmp=lambda x,y: cmp(len(y), len(x)))
open(filename+".new", "w").write("".join(lines[:N]))
if __name__ == '__main__':
main()
What do you think? Is there a faster way?
Related posts
- Find all combinations of a set of lists with itertools.product — posted 2011-11-01
- Some more python recursion examples — posted 2011-10-05
- Free Computer Science courses online — posted 2009-06-30
- How to reverse words in a sentence using Python and C — posted 2009-04-22
- Python recursion example to navigate tree data — posted 2008-08-19
Comments
Using a heap is faster.
import sys,heapq
def main(filename=sys.argv[1],
N=int(sys.argv[2])):
""" Finds the N longest lines in filename and writes to filename + ".new"
"""
heap = []
lines = []
lines = open(filename).readlines()
heap = heapq.nlargest(N,lines,len)
open(filename+".new", "w").write("".join(heap))
if __name__ == '__main__':
main()
Charlie,
Thank you very much for your solution. I profiled both methods and your method is much faster. For my 60MB test file, my algorithm using "sort" ran in 12.6 CPU seconds and your heap algorithm ran in 0.4 CPU seconds.
For those interested, here is my test:
import heapq
import sys
import cProfile
filename = "saltycrane.com.access.log"
N = 20
def main():
print "running saltycrane"
cProfile.run('saltycrane(filename, N)')
print "running charlie"
cProfile.run('charlie(filename, N)')
def saltycrane(filename, N):
""" Finds the N longest lines in filename and writes to filename + ".new"
"""
lines = open(filename).readlines()
lines.sort(cmp=lambda x,y: cmp(len(y), len(x)))
open(filename+".new", "w").write("".join(lines[:N]))
def charlie(filename, N):
""" Finds the N longest lines in filename and writes to filename + ".new"
"""
heap = []
lines = []
lines = open(filename).readlines()
heap = heapq.nlargest(N,lines,len)
open(filename+".new", "w").write("".join(heap))
if __name__ == '__main__':
main()
And the results:
running saltycrane
10981328 function calls in 12.584 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.089 0.089 12.584 12.584 <string>:1(<module>)
1 2.288 2.288 12.494 12.494 n_longest.py:14(saltycrane)
2745330 6.273 0.000 9.955 0.000 n_longest.py:18(<lambda>)
2745330 1.381 0.000 1.381 0.000 {cmp}
5490660 2.301 0.000 2.301 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.252 0.252 0.252 0.252 {method 'readlines' of 'file' objects}
1 0.000 0.000 0.000 0.000 {method 'write' of 'file' objects}
2 0.000 0.000 0.000 0.000 {open}
running charlie
12 function calls in 0.379 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.035 0.035 0.379 0.379 <string>:1(<module>)
1 0.000 0.000 0.170 0.170 heapq.py:324(nlargest)
1 0.000 0.000 0.343 0.343 n_longest.py:21(charlie)
1 0.170 0.170 0.170 0.170 {_heapq.nlargest}
1 0.000 0.000 0.000 0.000 {itertools.tee}
1 0.000 0.000 0.000 0.000 {map}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.173 0.173 0.173 0.173 {method 'readlines' of 'file' objects}
1 0.000 0.000 0.000 0.000 {method 'write' of 'file' objects}
2 0.000 0.000 0.000 0.000 {open}
Here is the documentation for heapq: http://docs.python.org/library/heapq.html
Using a priority queue for this sort of thing is of course optimal but if I'm not mistaken the result of "open(filename)" is iterable. You can save massive amounts of memory and quite noticeable performance by just calling heapq on the result of "open(filename)". Gives 2.5s vs 3s on a 135MB file for me.
Boris,
Thank you for the improvement. I ran the following code:
def boris(filename, N):
heap = []
heap = heapq.nlargest(N, open(filename), len)
open(filename+".new", "w").write("".join(heap))
and got the following results for my 102MB file:
saltycrane: 22.3 CPU seconds
charlie: 0.76 CPU seconds
boris: 0.55 CPU seconds