SaltyCrane Blog — Notes on JavaScript and web development

Python MongoDB notes

MongoDB is a popular new schemaless, document-oriented, NoSQL database. It is useful for logging and real-time analytics. I'm working on a tool to store log files from multiple remote hosts to MongoDB, then analyze it in real-time and print pretty plots. My work in progress is located on github.

Here are my first steps using PyMongo. I store an Apache access log to MongoDB and then query it for the number of requests in the last minute. I am running on Ubuntu Karmic 32-bit (though I think MongoDB really wants to run on 64-bit).

Install and run MongoDB

  • Download and install MongoDB (Reference)
    cd ~/lib
    curl http://downloads.mongodb.org/linux/mongodb-linux-i686-latest.tgz | tar zx
    ln -s mongodb-linux-i686-2010-02-22 mongodb
  • Create data directory
    mkdir -p ~/var/mongodb/db
  • Run MongoDB (Reference)
    ~/lib/mongodb/bin/mongod --dbpath ~/var/mongodb/db

Install PyMongo

Simple Example

writer.py:

import re
from datetime import datetime
from subprocess import Popen, PIPE, STDOUT
from pymongo import Connection
from pymongo.errors import CollectionInvalid

HOST = 'us-apa1'
LOG_PATH = '/var/log/apache2/http-mydomain.com-access.log'
DB_NAME = 'mydb'
COLLECTION_NAME = 'apache_access'
MAX_COLLECTION_SIZE = 5 # in megabytes

def main():
    # connect to mongodb
    mongo_conn = Connection()
    mongo_db = mongo_conn[DB_NAME]
    try:
        mongo_coll = mongo_db.create_collection(COLLECTION_NAME,
                                                capped=True,
                                                size=MAX_COLLECTION_SIZE*1048576)
    except CollectionInvalid:
        mongo_coll = mongo_db[COLLECTION_NAME]

    # open remote log file
    cmd = 'ssh -f %s tail -f %s' % (HOST, LOG_PATH)
    p = Popen(cmd, shell=True, stdout=PIPE, stderr=STDOUT)

    # parse and store data
    while True:
        line = p.stdout.readline()
        data = parse_line(line)
        data['time'] = convert_time(data['time'])
        mongo_coll.insert(data)

def parse_line(line):
    """Apache combined log format
    %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"
    """
    m = re.search(' '.join([
                r'(?P<host>(\d+\.){3}\d+)',
                r'.*',
                r'\[(?P<time>[^\]]+)\]',
                r'"\S+ (?P<url>\S+)',
                ]), line)
    if m:
        return m.groupdict()
    else:
        return {}

def convert_time(time_str):
    time_str = re.sub(r' -\d{4}', '', time_str)
    return datetime.strptime(time_str, "%d/%b/%Y:%H:%M:%S")

if __name__ == '__main__':
    main()

reader.py:

import time
from datetime import datetime, timedelta
from pymongo import Connection

DB_NAME = 'mydb'
COLLECTION_NAME = 'apache_access'

def main():
    # connect to mongodb
    mongo_conn = Connection()
    mongo_db = mongo_conn[DB_NAME]
    mongo_coll = mongo_db[COLLECTION_NAME]

    # find the number of requests in the last minute
    while True:
        d = datetime.now() - timedelta(seconds=60)
        N_requests = mongo_coll.find({'time': {'$gt': d}}).count()
        print 'Requests in the last minute:',  N_requests
        time.sleep(2)

if __name__ == '__main__':
    main()

Running python writer.py in one terminal and python reader.py in another terminal, I get the following results:

Requests in the last minute: 13
Requests in the last minute: 14
Requests in the last minute: 14
Requests in the last minute: 14
Requests in the last minute: 13
Requests in the last minute: 14
Requests in the last minute: 15
...

Related Documentation

Comments


#1 Tiago Almeida commented on :

Nice! Thanks :)


#2 Mike Dirolf commented on :

Thanks for the post! Glad you're enjoying PyMongo - feel free to let me know if you have any questions.


#3 Eliot commented on :

Mike: thank you for your support! (and for the great software and documentation!)


#4 Christopher commented on :

You might find http://Graylog2.org really interesting. The server side is a java app that listens for syslog messages over UDP 514, and puts them in a capped MongoDB collection. There's also a web front end (Ruby) that lets you view, search and sort the messages. I am using it to capture log info, and am working on something similar to your project to analyze what's in the DB.

Also, this post from Grig Gheorghiu is relevant: http://agiletesting.blogspot.com/2010/07/tracking-and-visualizing-mail-logs-with.html

Good luck with the project!


#5 Eliot commented on :

Thanks for the links. It looks some good stuff. I have enjoyed a lot of Grig Gheorghiu's articles, but I hadn't seen this one before.

Good luck on you project as well!


#6 guanqun commented on :

Thanks, this helps during my setup with mongodb.


#7 Francis commented on :

hi,

I would like to know, how to connect to mongodb which is running remotely. Can u please send me a code

Regards, Francis