Django Blog Project #9: Migrating Blogger posts with Beautiful Soup
Last post, I talked about adding comments to my new sample blog application. This was about the last basic feature I needed to add before I started actually using it for real. Of course there are still a number of features I'd like to add, such as automatic syntax highlighting with Pygments, and incorporating django-tagging and some more intersting views, not to mention comment moderation. But I think those will have to wait-- I want to start using my new blog for real sometime.
So for the past few days, I've been working on my Beautiful Soup screen scraper script to copy all my Blogger posts over to my new Django blog. Initial results came quickly (it's pretty cool to see such a huge data dump after only a few lines of Beautiful Soup'ing) but the details (especially with the comments) kind of slowed me down. I've finally got everything copied over to my satisfaction. Below is the script I used to do it. Note, I realize it's not pretty-- just a one time use hack. But hopefully someone else doing the same thing might find it useful.
#!/usr/bin/env python
import datetime
import os
import re
import urllib2
from BeautifulSoup import BeautifulSoup
from myblogapp.models import Post, LegacyComment
from django.contrib.comments.models import FreeComment
URL = ''.join([
'http://iwiwdsmi.blogspot.com/search?',
'updated-min=2006-01-01T00%3A00%3A00-08%3A00&'
'updated-max=2009-01-01T00%3A00%3A00-08%3A00&',
'max-results=1000'
])
html = urllib2.urlopen(URL).read()
soup = BeautifulSoup(html)
for post in soup.html.body.findAll('div', {'class': 'post'}):
print
print '--------------------------------------------------------------'
# save the post title and permalink
h3 = post.find('h3', {'class': 'post-title'})
post_href = h3.find('a')['href']
post_title = h3.find('a').string
post_slug = os.path.basename(post_href).rstrip('.html')
print post_slug
print post_href
print post_title
# save the post body
div = post.find('div', {'class': 'post-body'})
[toremove.extract() for toremove in div.findAll('script')]
[toremove.extract() for toremove in div.findAll('span', {'id': 'showlink'})]
[toremove.extract() for toremove in div.findAll('div', {'style': 'clear: both;'})]
[toremove.parent.extract() for toremove in div.findAll(text='#fullpost{display:none;}')]
post_body = ''.join([str(item)
for item in div.contents
]).rstrip()
post_body = re.sub(r"iwiwdsmi\.blogspot\.com/(\d{4}/\d{2}/[\w\-]+)\.html",
r"www.saltycrane.com/blog/\1/",
post_body)
# count number of highlighted code sections
highlight = div.findAll('div', {'class': 'highlight'})
if highlight:
hl_count += len(highlight)
hl_list.append(post_title)
# save the timestamp
a = post.find('a', {'class': 'timestamp-link'})
try:
post_timestamp = a.string
except:
match = re.search(r"\.com/(\d{4})/(\d{2})/", post_href)
if match:
year = match.group(1)
month = match.group(2)
post_timestamp = "%s/01/%s 11:11:11 AM" % (month, year)
print post_timestamp
# save the tags (this is ugly, i know)
if 'error' in post_title.lower():
post_tags = ['error']
else:
post_tags = []
span = post.find('span', {'class': 'post-labels'})
if span:
a = span.findAll('a', {'rel': 'tag'})
else:
a = post.findAll('a', {'rel': 'tag'})
post_tags = ' '.join([tag.string for tag in a] + post_tags)
if not post_tags:
post_tags = 'untagged'
print post_tags
# add Post object to new blog
if True:
p = Post()
p.title = post_title
p.body = post_body
p.date_created = datetime.datetime.strptime(post_timestamp, "%m/%d/%Y %I:%M:%S %p")
p.date_modified = p.date_created
p.tags = post_tags
p.slug = post_slug
p.save()
# check if there are comments
a = post.find('a', {'class': 'comment-link'})
if a:
comm_string = a.string.strip()
else:
comm_string = "0"
if comm_string[0] != "0":
print
print "COMMENTS:"
# get the page with comments
html_single = urllib2.urlopen(post_href).read()
soup_single = BeautifulSoup(html_single)
# get comments
comments = soup_single.html.body.find('div', {'class': 'comments'})
cauth_list = comments.findAll('dt')
cbody_list = comments.findAll('dd', {'class': 'comment-body'})
cdate_list = comments.findAll('span', {'class': 'comment-timestamp'})
if not len(cauth_list)==len(cbody_list)==len(cdate_list):
raise "didn't get all comment data"
for auth, body, date in zip(cauth_list, cbody_list, cdate_list):
# create comment in database
lc = LegacyComment()
lc.body = str(body.p)
# find author
lc.author = "Anonymous"
auth_a = auth.findAll('a')[-1]
auth_no_a = auth.contents[2]
if auth_a.string:
lc.author = auth_a.string
elif auth_no_a:
match = re.search(r"\s*([\w\s]*\w)\s+said", str(auth_no_a))
if match:
lc.author = match.group(1)
print lc.author
# find website
try:
lc.website = auth_a['href']
except KeyError:
lc.website = ''
print lc.website
# other info
lc.date_created = datetime.datetime.strptime(
date.a.string.strip(), "%m/%d/%Y %I:%M %p")
print lc.date_created
lc.date_modified = lc.date_created
lc.post_id = p.id
lc.save()
I also made some changes to my Django blog code as I migrated my Blogger posts. The main addition was a LegacyComment
model along with the associated views and templates. My Blogger comments consisted of HTML markup, but I didn't want to allow arbitrary HTML in my new comments for fear of cross site scripting. So I separated my legacy Blogger comments from my new Django site comments.
models.py
Here are my model changes. I added a LegacyComment
class which contains pertinent comment attributes and a ForeignKey
to the post that it belongs to. I also added a lc_count
(for legacy comment count) field to the Post
class which stores the number of comments for the post. It is updated by the save()
method in the LegacyComment
class every time a comment is saved. Hmmm, I just realized the count will be wrong if I ever edit these comments. Well, since these are legacy comments, hopefully I won't have to edit them.
~/src/django/myblogsite/myblogapp/models.py
:import re from django.db import models class Post(models.Model): title = models.CharField(maxlength=200) slug = models.SlugField(maxlength=100) date_created = models.DateTimeField() #auto_now_add=True) date_modified = models.DateTimeField() tags = models.CharField(maxlength=200) body = models.TextField() body_html = models.TextField(editable=False, blank=True) lc_count = models.IntegerField(default=0, editable=False) def get_tag_list(self): return re.split(" ", self.tags) def get_absolute_url(self): return "/blog/%d/%02d/%s/" % (self.date_created.year, self.date_created.month, self.slug) def __str__(self): return self.title class Meta: ordering = ["-date_created"] class Admin: pass class LegacyComment(models.Model): author = models.CharField(maxlength=60) website = models.URLField(core=False) date_created = models.DateTimeField() date_modified = models.DateTimeField() body = models.TextField() post = models.ForeignKey(Post) def save(self): p = Post.objects.get(id=self.post.id) p.lc_count += 1 p.save() super(LegacyComment, self).save() class Meta: ordering = ["date_created"] class Admin: pass
views.py
Here is an excerpt from my views.py file showing the changes:
~/src/django/myblogsite/myblogapp/views.py
:import re from datetime import datetime from django.shortcuts import render_to_response from myblogsite.myblogapp.models import Post, LegacyComment MONTH_NAMES = ('', 'January', 'Feburary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December') MAIN_TITLE = "Sofeng's Blog 0.0.7" def frontpage(request): posts, pagedata = init() posts = posts[:5] pagedata.update({'post_list': posts, 'subtitle': '',}) return render_to_response('listpage.html', pagedata) def singlepost(request, year, month, slug2): posts, pagedata = init() post = posts.get(date_created__year=year, date_created__month=int(month), slug=slug2,) legacy_comments = LegacyComment.objects.filter(post=post.id) pagedata.update({'post': post, 'lc_list': legacy_comments,}) return render_to_response('singlepost.html', pagedata)
Templates
In the list page template I used the truncatewords_html
template filter to show a 50 word post summary on the list pages instead of the full post. I also added the legacy comment count with the Django free comment count to display the total number of comments.
~/src/django/myblogsite/templates/listpage.html
:{% block main %} <br> {% for post in post_list %} <h4><a href="/blog/{{ post.date_created|date:"Y/m" }}/{{ post.slug }}/"> {{ post.title }}</a> </h4> {{ post.body|truncatewords_html:"50" }} <a href="{{ post.get_absolute_url }}">Read more...</a><br> <br> <hr> <div class="post_footer"> {% ifnotequal post.date_modified.date post.date_created.date %} Last modified: {{ post.date_modified.date }}<br> {% endifnotequal %} Date created: {{ post.date_created.date }}<br> Tags: {% for tag in post.get_tag_list %} <a href="/blog/tag/{{ tag }}/">{{ tag }}</a>{% if not forloop.last %}, {% endif %} {% endfor %} <br> {% get_free_comment_count for myblogapp.post post.id as comment_count %} <a href="{{ post.get_absolute_url }}#comments"> {{ comment_count|add:post.lc_count }} Comment{{ comment_count|add:post.lc_count|pluralize}}</a> </div> <br> {% endfor %} {% endblock %}
In the single post template, I added the display of the Legacy comments in addition to the Django free comments.
Excerpt from~/src/django/myblogsite/templates/singlepost.html
: <a name="comments"></a>
{% if lc_list %}
<h4>{{ lc_list|length }} Legacy Comment{{lc_list|length|pluralize}}</h4>
{% endif %}
{% for legacy_comment in lc_list %}
<br>
<a name="lc{{ legacy_comment.id }}" href="#lc{{ legacy_comment.id }}">
#{{ forloop.counter }}</a>
{% if legacy_comment.website %}
<a href="{{ legacy_comment.website }}">
<b>{{ legacy_comment.author|escape }}</b></a>
{% else %}
<b>{{ legacy_comment.author|escape }}</b>
{% endif %}
commented,
on {{ legacy_comment.date_created|date:"F j, Y" }}
at {{ legacy_comment.date_created|date:"P" }}:
{{ legacy_comment.body }}
{% endfor %}
<br>
That's it. Hopefully, I can start using my new blog soon. Please browse around on the new Django site and let me know if you run across any problems. When everything looks to be OK, I'll start posting only on my new Django site.
Here is a snapshot screenshot of version 0.0.8:
The live site can be viewed at: http://saltycrane.com/blog
Related posts:
Django Blog Project #1: Creating a basic blog
Django Blog Project #2: Deploying at Webfaction
Django Blog Project #3: Using CSS and Template Inheritance
Django Blog Project #4: Adding post metadata
Django Blog Project #5: YUI CSS and serving static media
Django Blog Project #6: Creating standard blog views
Django Blog Project #7: Adding a simple Atom feed
Django Blog Project #8: Adding basic comment functionality
Comments
Why not use the Blogger API ?
I migrated from Blogger to my own blog that i wrote in Django and i just wrote a script that used the Blogger API to get all my posts and save them as Post objects.
bulkan-savun evcimen,
wow, i haven't seen that Blogger API before. it seems i've done things the hard way again. this may be useful for me to update my Blogger posts to point to my new Django blog. do you mind sharing your script? thanks for the tip. btw, you have a great looking Django blog!
TIB Academy is best Django Training Institute in Bangalore. We Offers Hands-On Training with Live project.
disqus:3677799621