SaltyCrane Blog — Notes on JavaScript and web development

Python urlparse example

Here is an example of how to parse a URL using Python's urlparse module. See the urlparse module documentation for more information.

from urlparse import urlparse

url = 'http://www.gurlge.com:80/path/file.html;params?a=1#fragment'
o = urlparse(url)
print o.scheme
print o.netloc
print o.hostname
print o.port
print o.path
print o.params
print o.query
print o.fragment
print o.username
print o.password

Results:

http
www.gurlge.com:80
www.gurlge.com
80
/path/file.html
params
a=1
fragment
None
None

How to get stdout and stderr using Python's subprocess module

I wrote previously about how to get stdout and stderr using os.popen4. However, per the Python documentation, using the subprocess module is preferred:

The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. This module intends to replace several other, older modules and functions, such as:

os.system
os.spawn*
os.popen*
popen2.*
commands.*

See the subprocess module documentation for more information.

Here is how to get stdout and stderr from a program using the subprocess module:

from subprocess import Popen, PIPE, STDOUT

cmd = 'ls /etc/fstab /etc/non-existent-file'
p = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=STDOUT, close_fds=True)
output = p.stdout.read()
print output

Results:

ls: cannot access /etc/non-existent-file: No such file or directory
/etc/fstab

How to monitor an Apache web server using Monit

Monit is a tool that can monitor your Apache web server, MySQL database, or other daemon process. It can restart the service based on configurable conditions such as CPU usage, memory usage, number of children, etc. It can log status to a file, email status, and it has a web interface for monitoring or restarting the service. Here are the steps I took to install and configure the Monit tool on Ubuntu Hardy. It merely monitors the status of my Apache web server and restarts it if it stops. It also checks if the memory used by Apache is greater than 1 MB and logs it in /var/log/monit.log. For more configuration options, see the examples in the default /etc/monit/monitrc file or the configuration examples in the monit documentation. I also found Ubuntu Geek's guide to be very helpful.

  • Install monit
    $ sudo apt-get install monit
  • Edit the config file
    $ sudo nano /etc/monit/monitrc
    Insert the following:
    # check services every 2 minutes
    set daemon 120
    
    # logging 
    set logfile /var/log/monit.log
    
    # web interface
    set httpd port 2812 and
        use address localhost # only accept connection from localhost
        allow localhost       # allow localhost to connect to the server    
        allow admin:monit     # require user ‘admin’ with password ‘monit’
    
    # monitor apache
    check process apache2 with pidfile /var/run/apache2.pid
        start program = "/etc/init.d/apache2 start"
        if totalmem > 1.0 MB for 2 cycles then alert
  • Check the file syntax
    $ sudo monit -t
  • Enable the service
    $ sudo nano /etc/default/monit
    Change the following line:
    startup=1
  • Start monit
    $ sudo /etc/init.d/monit start
  • Point your browser at http://localhost:2812 and log in using the user "admin" and the password "monit".
  • Click on "apache2" and you can see information about the Apache process.

Simple cron example

Simple cron example (tested on Ubuntu):

  • Edit your (user) crontab file
    $ crontab -e
    This will bring up your editor (nano by default in Ubuntu)

  • Enter the following inside. This will append the current date to a log file every minute. The 6 fields of the crontab file are: minute, hour, day of month, month, day of week, command.
    * * * * * /bin/date >> /tmp/cron_output
    
    Be sure to put a blank line at the end of the file.
    (NOTE 1: >> only redirects STDOUT to a file. To redirect both STDOUT and STDERR, use something like /bin/date >> /tmp/cron_output 2>&1)
    (NOTE 2: If output is not redirected, cron will try to email the output to you. To do this, a mail transfer agent such as sendmail or postfix must be installed.)
    (NOTE 3 (added 2015-06-24): When I created my cron script in /etc/cron.d with Emacs using sudo::, cron didn't pick up my script. When I created it with nano, cron picked it up. It seems the cause is the permissions of the cron script. Emacs created the script with 664 permissions while nano created the script with 644 permissions. When I changed the permissions to 644, it started working. I am running Ubuntu 15.04. This Ask Ubuntu answer confirms that a 644 permission is problematic because it is considered insecure. See /var/log/syslog for cron messages. The Ask Ubuntu page has a lot of other good tips: Reasons why crontab does not work)

  • Exit the editor. It should output:
    crontab: installing new crontab
  • Check that it is working:
    tail -f /tmp/cron_output
    You should see the date updated every minute on the minute (or close to it):
    Tue Sep 16 23:58:01 PDT 2008
    Tue Sep 16 23:59:01 PDT 2008
    Wed Sep 17 00:00:01 PDT 2008
    Wed Sep 17 00:01:01 PDT 2008
    ...
    

See also my post: Postgres backup with cron

Django Blog Project #16: Adding URL redirects using the Blogger API

I wanted to insert URL redirects on my old Blogger posts pointing to my new blog articles. A comment on my Migrating Blogger Posts post suggested that I use the (Python) Blogger API. This was a great suggestion. The Blogger API was well documented and easy to use. Here is the script I used to insert the URL redirects on each of my old Blogger posts.

from gdata import service
import re
import gdata
import atom

NEW_HTML = """
<script language="javascript">
  setTimeout('location.href="%s"', 2000);
</script>
<br /><br />
<b>
  </b><p>This is my OLD blog. I've copied this post over to my NEW blog at:</p>
  <p><a href="%s">%s</a></p>
  <p>You should be redirected in 2 seconds.</p>

<br /><br />
"""

# authenticate
blogger_service = service.GDataService('[email protected]', 'mypassword')
blogger_service.service = 'blogger'
blogger_service.account_type = 'GOOGLE'
blogger_service.server = 'www.blogger.com'
blogger_service.ProgrammaticLogin()

# get list of blogs
query = service.Query()
query.feed = '/feeds/default/blogs'
feed = blogger_service.Get(query.ToUri())

# get blog id
blog_id = feed.entry[0].GetSelfLink().href.split("/")[-1]

# get all posts
query = service.Query()
query.feed = '/feeds/%s/posts/default' % blog_id
query.published_min = '2000-01-01'
query.published_max = '2009-01-01'
query.max_results = 1000
feed = blogger_service.Get(query.ToUri())
print feed.title.text

for entry in feed.entry:
    # create link to article on new blog
    new_link = re.sub(r'http://iwiwdsmi\.blogspot\.com/(.*)\.html',
                      r'http://www.saltycrane.com/blog/\1/',
                      entry.link[0].href)
    print new_link

    # update post
    to_add = NEW_HTML % (new_link, new_link, new_link)
    entry.content.text = to_add + entry.content.text
    blogger_service.Put(entry, entry.GetEditLink().href)

Notes on parallel processing with Python and Twisted

Twisted is a networking engine written in Python, that among many other things, can be used to do parallel processing. It is very big, though, so I had a hard time finding what I needed. I browsed through the Twisted Documentation and the Twisted O'Reilly book. There is also a Recipe in the Python Cookbook. However, I found Bruce Eckel's article, Concurrency with Python, Twisted, and Flex to be the most helpful. (See also Bruce Eckel's initial article on Twisted: Grokking Twisted)

Here are my notes on running Bruce Eckel's example. I removed the Flex part because I didn't need or know anything about it. This example runs a Controller which starts a number of separate parallel processes running Solvers (a.ka. workers). It also allows for communication between the Controller and Solvers. Though this example only runs on one machine, the article said extending this to multiple machines is not difficult. For a good explanation of how this works, please see the original article.

Here is solver.py which is copied from the original article. The actual "work" is done in the step method. I only added some debugging print statements for myself.

"""
solver.py
Original version by Bruce Eckel
Solves one portion of a problem, in a separate process on a separate CPU
"""
import sys, random, math
from twisted.spread import pb
from twisted.internet import reactor

class Solver(pb.Root):

    def __init__(self, id):
        print "solver.py %s: solver init" % id
        self.id = id

    def __str__(self): # String representation
        return "Solver %s" % self.id

    def remote_initialize(self, initArg):
        return "%s initialized" % self

    def step(self, arg):
        print "solver.py %s: solver step" % self.id
        "Simulate work and return result"
        result = 0
        for i in range(random.randint(1000000, 3000000)):
            angle = math.radians(random.randint(0, 45))
            result += math.tanh(angle)/math.cosh(angle)
        return "%s, %s, result: %.2f" % (self, str(arg), result)

    # Alias methods, for demonstration version:
    remote_step1 = step
    remote_step2 = step
    remote_step3 = step

    def remote_status(self):
        print "solver.py %s: remote_status" % self.id
        return "%s operational" % self

    def remote_terminate(self):
        print "solver.py %s: remote_terminate" % self.id
        reactor.callLater(0.5, reactor.stop)
        return "%s terminating..." % self

if __name__ == "__main__":
    port = int(sys.argv[1])
    reactor.listenTCP(port, pb.PBServerFactory(Solver(sys.argv[1])))
    reactor.run()

Here is controller.py. This is also copied from the original article but I removed the Flex interface and created calls to start and terminate in the Controller class. I'm not sure if this makes sense, but at least this allowed me to run the example. I also moved the terminate method from the FlexInterface to the Controller.

"""
Controller.py
Original version by Bruce Eckel
Starts and manages solvers in separate processes for parallel processing.
"""
import sys
from subprocess import Popen
from twisted.spread import pb
from twisted.internet import reactor, defer

START_PORT = 5566
MAX_PROCESSES = 2

class Controller(object):

    def broadcastCommand(self, remoteMethodName, arguments, nextStep, failureMessage):
        print "controller.py: broadcasting..."
        deferreds = [solver.callRemote(remoteMethodName, arguments) 
                     for solver in self.solvers.values()]
        print "controller.py: broadcasted"
        reactor.callLater(3, self.checkStatus)

        defer.DeferredList(deferreds, consumeErrors=True).addCallbacks(
            nextStep, self.failed, errbackArgs=(failureMessage))
    
    def checkStatus(self):
        print "controller.py: checkStatus"
        for solver in self.solvers.values():
            solver.callRemote("status").addCallbacks(
                lambda r: sys.stdout.write(r + "\n"), self.failed, 
                errbackArgs=("Status Check Failed"))
                                                     
    def failed(self, results, failureMessage="Call Failed"):
        print "controller.py: failed"
        for (success, returnValue), (address, port) in zip(results, self.solvers):
            if not success:
                raise Exception("address: %s port: %d %s" % (address, port, failureMessage))

    def __init__(self):
        print "controller.py: init"
        self.solvers = dict.fromkeys(
            [("localhost", i) for i in range(START_PORT, START_PORT+MAX_PROCESSES)])
        self.pids = [Popen(["python", "solver.py", str(port)]).pid
                     for ip, port in self.solvers]
        print "PIDS: ", self.pids
        self.connected = False
        reactor.callLater(1, self.connect)

    def connect(self):
        print "controller.py: connect"
        connections = []
        for address, port in self.solvers:
            factory = pb.PBClientFactory()
            reactor.connectTCP(address, port, factory)
            connections.append(factory.getRootObject())
        defer.DeferredList(connections, consumeErrors=True).addCallbacks(
            self.storeConnections, self.failed, errbackArgs=("Failed to Connect"))

        print "controller.py: starting parallel jobs"
        self.start()

    def storeConnections(self, results):
        print "controller.py: storeconnections"
        for (success, solver), (address, port) in zip(results, self.solvers):
            self.solvers[address, port] = solver
        print "controller.py: Connected; self.solvers:", self.solvers
        self.connected = True

    def start(self):
        "controller.py: Begin the solving process"
        if not self.connected:
            return reactor.callLater(0.5, self.start)
        self.broadcastCommand("step1", ("step 1"), self.step2, "Failed Step 1")

    def step2(self, results):
        print "controller.py: step 1 results:", results
        self.broadcastCommand("step2", ("step 2"), self.step3, "Failed Step 2")

    def step3(self, results):
        print "controller.py: step 2 results:", results
        self.broadcastCommand("step3", ("step 3"), self.collectResults, "Failed Step 3")

    def collectResults(self, results):
        print "controller.py: step 3 results:", results
        self.terminate()
        
    def terminate(self):
        print "controller.py: terminate"
        for solver in self.solvers.values():
            solver.callRemote("terminate").addErrback(self.failed, "Termination Failed")
        reactor.callLater(1, reactor.stop)
        return "Terminating remote solvers"

if __name__ == "__main__":
    controller = Controller()
    reactor.run()

To run it, put the two files in the same directory and run python controller.py. You should see 2 CPUs (if you have 2) go up to 100% usage. And here is the screen output:

controller.py: init
PIDS:  [12173, 12174]
solver.py 5567: solver init
solver.py 5566: solver init
controller.py: connect
controller.py: starting parallel jobs
controller.py: storeconnections
controller.py: Connected; self.solvers: {('localhost', 5567): , ('localhost', 5566): }
controller.py: broadcasting...
controller.py: broadcasted
solver.py 5566: solver step
solver.py 5567: solver step
controller.py: checkStatus
solver.py 5566: remote_status
Solver 5566 operational
solver.py 5567: remote_status
controller.py: step 1 results: [(True, 'Solver 5567, step 1, result: 683825.75'), (True, 'Solver 5566, step 1, result: 543177.17')]
controller.py: broadcasting...
controller.py: broadcasted
Solver 5567 operational
solver.py 5566: solver step
solver.py 5567: solver step
controller.py: checkStatus
solver.py 5566: remote_status
Solver 5566 operational
solver.py 5567: remote_status
controller.py: step 2 results: [(True, 'Solver 5567, step 2, result: 636793.90'), (True, 'Solver 5566, step 2, result: 335358.16')]
controller.py: broadcasting...
controller.py: broadcasted
Solver 5567 operational
solver.py 5566: solver step
solver.py 5567: solver step
controller.py: checkStatus
solver.py 5566: remote_status
Solver 5566 operational
solver.py 5567: remote_status
controller.py: step 3 results: [(True, 'Solver 5567, step 3, result: 847386.43'), (True, 'Solver 5566, step 3, result: 512120.15')]
controller.py: terminate
Solver 5567 operational
solver.py 5566: remote_terminate
solver.py 5567: remote_terminate

Notes on starting processes in Python

Using os.fork()

Here is an example using os.fork() to spawn 5 processes each running the python function, myfunc. Don't forget the os._exit() at the end. Per the docs, normally, sys.exit() is used, but os._exit() can be used in child processes after a fork. It does not call cleanup handlers, flush stdio buffers, etc.

import os
import time

def myfunc(i):
    print "sleeping 5 seconds from process %s" % i
    time.sleep(5)
    print "finished sleeping from process %s" % i

for i in range(5):
    pid = os.fork()
    if pid == 0:
        myfunc(i)
        os._exit(0)

Results:

sleeping 5 seconds from process 0
sleeping 5 seconds from process 1
sleeping 5 seconds from process 2
sleeping 5 seconds from process 3
sleeping 5 seconds from process 4

And 5 seconds later...

finished sleeping from process 0
finished sleeping from process 1
finished sleeping from process 2
finished sleeping from process 3
finished sleeping from process 4
Running an external script in subprocesses

Alternatively, if you want to run an external script in multiple processes, you can use the Popen class in the subprocess module. For example, to run the following script, called "myscript.py":

"myscript.py"
import sys
import time

def myfunc(i):
    print "sleeping 5 seconds from process %s" % i
    time.sleep(5)
    print "finished sleeping from process %s" % i

if __name__ == '__main__':
    myfunc(sys.argv[1])

use the following Python code stored in the same directory:

"popen_ex.py"
from subprocess import Popen

for i in range(5):
    Popen(['python', './myscript.py', str(i)])

The screen output is the same as the previous example. What's the differnce? fork() copies the process memory space including open file descriptors to the child process. In the second example, since I am executing a new Python interpreter from scratch, I get a "cleaner" start but probably more overhead as well.

Django Blog Project #15: New site logo

I now have a new site logo design drawn by my wife, Angela! Doesn't it look great? My previous logo was a crane picture I had just pulled from the web somewhere. So it is nice to have a custom logo done for me. Luckily my wife is artistic and didn't mind drawing it for me. I also made some minor changes to the title block to make things look a little better up there. Now to figure out how to style the rest of the page.

I also got a "memory over limit" warning from Webfaction this week. Over the weekend, I had redirected all my old Blogger posts to this blog, so apparently the small increase in traffic brought to light some of my inefficient code. To help solve the problem, I switched over to django-tagging. This eliminated a bunch of my inefficient code and I appear to be within the memory limits now. There is still another section of code I need to rework, but this solves the problem for now. Django-tagging is pretty cool-- I haven't quite got everything working correctly, but I will be sure to write some notes on it when I get the time.

Simplistic Python Thread example

Here is a simple Python example using the Thread object in the threading module.

import time
from threading import Thread

def myfunc(i):
    print "sleeping 5 sec from thread %d" % i
    time.sleep(5)
    print "finished sleeping from thread %d" % i

for i in range(10):
    t = Thread(target=myfunc, args=(i,))
    t.start()

Results:

sleeping 5 sec from thread 0
sleeping 5 sec from thread 1
sleeping 5 sec from thread 2
sleeping 5 sec from thread 3
sleeping 5 sec from thread 4
sleeping 5 sec from thread 5
sleeping 5 sec from thread 6
sleeping 5 sec from thread 7
sleeping 5 sec from thread 8
sleeping 5 sec from thread 9

...and 5 seconds later:

finished sleeping from thread 0
finished sleeping from thread 1
finished sleeping from thread 2
finished sleeping from thread 3
finished sleeping from thread 4
finished sleeping from thread 5
finished sleeping from thread 6
finished sleeping from thread 7
finished sleeping from thread 8
finished sleeping from thread 9

How to iterate over an instance object's data attributes in Python

To list the attributes of a Python instance object, I could use the built-in dir() function, however, this will return the instance object's methods as well data attributes. To get just the data attributes, I can use the instance object's __dict__ attribute:

class A(object):
    def __init__(self):
        self.myinstatt1 = 'one'
        self.myinstatt2 = 'two'
    def mymethod(self):
        pass

a = A()
for attr, value in a.__dict__.iteritems():
    print attr, value
Results:
myinstatt2 two
myinstatt1 one