Monthly Archives: December 2006

Recursively checking for missing alt attributes on img tags with 38 lines of Python code.

Today I was the first time I’ve used my new found Python skills in a professional capacity. As part of a larger project, the client wanted me to check each HTML document for image tags missing their “alt” attributes. Knowing how tedious and impercise it would be to do this manually, writing a script was the only logical choice. And it would be handy to be able to run this both locally on their Windows-based workstations before the files were uploaded and on the Linux-based Apache web server. Since I had already worked with the HTMLParser library, I figured this would be a perfect opportunity for Python.

Here is the entire source. Notice that I don’t have to contend with any regular expression torture, nor do I have to implement my own function to recursively search the directory structure. Python can do both these things automatically.

import urllib
import htmllib
import formatter
import os
from os.path import join
import sys
filename = ''

class ImgParser(htmllib.HTMLParser):
       global filename

       def __init__(self, formatter):
               htmllib.HTMLParser.__init__(self, formatter)
               self.imgs = []

       def start_img(self, attrs):
              tag = dict(attrs)
              if not 'alt' in tag:
                      self.imgs.append( (filename, tag['src']) )

       def get_imgs(self):
              return self.imgs

format = formatter.NullFormatter()
htmlparser = ImgParser(format)

search_root = sys.argv[1]

for root, dirs, files in os.walk(search_root):
   for name in files:
       if name[-5:] == '.html':
              filename = join(root, name)
              f = open(filename, 'r')
              htmlparser.feed(f.read())
              f.close()
htmlparser.close()

imgs = htmlparser.get_imgs()
current_file = ''
print "\n\nThe following files contain images with missing alt tags:\n"
for filename, src in imgs:
       if current_file != filename:
               current_file = filename
               print filename
       print "   " + src

I’m normally pretty careful to comment code that other people are likely to read, however this is so simple, I didn’t bother. The “os.walk” method automatically handles the recursive directory walking and if the filename ends in “.html”, then the file is opened and run through an instance of the “ImgParser” class. ImgParser is defined as a new class that inherits the HTMLParser class from the “htmllib” module (library). The “start_img” function overrides the handler that is called when the parser encounters the start of an “img” tag. If an img tag is encountered without an alt attribute, then the filename and img’s src attibute is added to the list. Once all of the files are processed, the script loops through the list and outputs the incomplete img tags and the files in which they are contained.

It’s neither sexy or full of features, but it only took me (a Python noob) 10 minutes to write. The path that is searched is read in as the second command line argument, so the script has to be executed as python scan_img_tags.py /var/html or python scan_img_tags.py c:/myhtmlstuff. The script did exactly what I needed it to do, so I’m not likely to polish it up. But if I were, I would probably add line number reporting, a little error trapping, and the ability to specify additional file extensions from the command line.

Hopefully someone else can find some use for this snippet.

Share

A PHP Developer Spends a Week with Python

Last week, I wrote about my discontent with PHP and why I chose to give Python a shot at becoming my new tool of choice. Since then, Python and I have spent a lot of time together getting acquainted. I’m finally becoming accustomed to the lack of semicolons and curly braces and have accepted Python’s whitespace neurosis. It hasn’t completely dethroned PHP yet because I am still evaluating which web framework I want to invest the time into learning first (Django, TurboGears, Pylons) and then picking my templating system (Kid, Myghty, Cheetah, Genshi, Clearsilver, etc.) But the more I use Python, the bigger fan I become. After only one week, I’m fluent enough to be writing useful stuff and I’m having more fun doing it. This is how I got started…

Snakes on my brain

First, I highly recommend picking up two books: Learning Python (ISBN: 0596002815) and Beginning Python (ISBN: 159059519X). Learning Python reads like a dry tome that would be right at home in an introductory CS class. It is very heavy on the theory (you don’t actually get to start doing anything until about page 60) and, being published almost five years ago, is somewhat outdated. However, it explains crucial details about how Python works under the hood. For example, unlike any language I’ve ever used, variables are simply references to objects. This might sound insignificant, but consider the following code and it’s output in a Python interactive prompt:

>>> languages = ["Pascal", "Python", "Perl", "PHP"]
>>> thepeas = languages
>>> print thepeas
['Pascal', 'Python', 'Perl', 'PHP']
>>> print languages.pop()
PHP
>>> print thepeas
['Pascal', 'Python', 'Perl']
>>> print thepeas.pop()
Perl
>>> print languages
['Pascal', 'Python']

As you can see, both languages and thepeas are two references to the same object, changing one affects the other. This would have been a very frustrating concept to initially encounter if I hadn’t read Learning Python first.

On the other hand, Beginning Python was published in only 2005 and seems to be better aimed at the self-instruction crowd. It moves along quite a bit faster and includes a greater number of examples. Plus, I find the author’s style of writing more approachable. However, pace and praticality comes at the cost of glazing over some rather important theory. I’ve been reading both books concurrently and find that this approach allows me to soak up more information and helps keep my attention. If you have to choose one book though, get Beginning Python.

Tale of two committees

Finallly getting into the mix; writing code, reading the online docs and my books, and browsing tutorials, I am immediately struck with a key difference in philosophy: PHP was created soley with web development in mind, while Python is a Swiss Army Knife. When using PHP, I just fire up my text editor, add a “<?php” and get cranking. Testing normally consists of FTPing or SCPing the .php file to my server and hitting “Reload” in my browser. Form data is easily accessable with $_POST['stuff'] and MySQL extensions are most likely installed too. PHP is preconfigured on almost every Apache web server and everything Just Works™–which is why I (and everyone else) learned PHP in the first place.

Python, however, doesn’t know what you want to do with it out of the box. It’s a very modular tool that feels right at home either crunching scientific data or processing HTML forms. Fitting with Python’s “batteries included” mantra, you can do almost all of these things without having to download any additional libraries (and if you can’t find a built-in library that does what you want, just head over to the Cheese Shop). However, because it would be hugely wasteful to automatically load every library into memory, Python does require this be done manually with the import statement: “import re” will get you regular expressions, “import pdfgen” will allow you to generate PDFs, “import this” will give you geek poetry (yes, really), etc. Using Python for web development will most likely entail installing and configuring mod_python on Apache, or using the CGI handler. Database interaction modules also have to be manually imported. Doing all of this isn’t difficult, or even time-comsuming, but it does require adapting to the Python way of thinking: “Batteries included… Some assembly required.”

Flexibility, thy name is Python!

If you have the patience to relearn a few things, Python can be very rewarding. I best learn by building stuff that I need, so I decided that my first project would be an offline web-browser / site cacher. It’s not the simplest thing to start with, but it would help me evaluate Python in a real-world application. The program would connect to a site, download the HTML document and its associated images and stylesheets, convert absolute URLs to relative URLs, and save everything to disk with the same directory structure. Having never used Python before, from start to finish, this project took me two days to complete. “Bah!,” I can hear the readers now, “I could have done the same thing in PHP in a quarter of the time and I wouldn’t have had to learn anything new!” While that may be true, four hours after finishing the initial version, I had also converted the crawler into a multithreaded application that improved its performance by a factor of ten. Try that with PHP.

Share

From PHP to Python

Before I wrote my “Dear John” letter to PHP the other day, I had largely decided to move to Python as my primary tool. I had been on the fence for months about choosing to pursue Ruby or to follow the Python path. I purchased a book about each and did quite a bit of other research as well. The fact that there are so many passionate flamewars about which is “better” was actually quite helpful. It proved a suspicion that I had formed during my research: it didn’t really matter which one I chose. There would be nothing to argue about if one were clearly the best language to learn. There isn’t a whole lot of debate about which is better for web development: Python or Lisp, Ruby or Delphi? Since Python and Ruby both are very suited to web development, it really boils down to personal preference. Having experimented with both, here is why I chose Python. Your mileage may vary.

Similar purpose, different philosophy

I primarily dinged PHP for not having a what I call a guiding philosophy. There seems to be a disconnect between Zend Technologies and the PHP community and that lack of focus is crippling PHP. In some ways it reminds me of the relationship Red Hat has with the OSS community. Neither seem to completely trust each other and there’s friction when deciding on how to move a project forward. Almost from Day 1 (well, make that PHP 3.0) Zend seems to have focused more on building a language around a business rather than the other way around. Trying to be everything to everyone, focusing more on the quanity of installations rather than security, and borrowing features rather than innovating isn’t a recipe for success in the world of open source. That sounds more like how commerical software is churned out.

I find both Python and Ruby are far more community-driven and focused. However, the communities both have distinct, and dissimilar, philosophies. Ruby is the kindergarden teacher who gives each of her students the big, 64-color box of Crayola crayons. The children are given the freedom to draw a dog in whatever manner they choose. In Python school, students assigned only four crayons: red, yellow, blue, and black and are given step-by-step instructions on how to draw the best dog possible. Now, I’m not saying that Python has less features than Ruby. The Python community simply has a more formal structure and and a greater focus on the nuts-and-bolts of improving the language as a whole. Ruby is less inclined to define specific guidelines and prefers to give programmers several options, allowing her/her to pick and choose.

Ruby’s community more closely maps to my personality (hey, my company is named RightBrain Networks). However, I am finding myself involved with increasingly larger web projects, both solo and with other developers. When developing a large project, even solo, consistency is paramount. The less time I spend figuring out what the other guy’s code is supposed to do (or what I wrote last night at 3am), the better. It also saves me from having to research which one of the choices is best in my current situation. Ruby subscribes to the “Principle of Least Surprise”, meaning that the language ought to bend to the programmer. However, I find myself preferring Python’s approach, which is that if the language already knows the best approach, then I should bend to it. If we all agree that the “Python Way” is best, then we don’t need to adapt to each other’s individual styles. Hence, larger projects are both more efficent and fun because I can spend more time writing code than deciphering/researching it.

Is Ruby a one-trick-pony?

I’ve met more than one person who’s wanted to learn (or wanted to hire someone who learned) “Ruby on Rails”, not knowing that Ruby existed a full decade before being attatched at the hip with Rails. I’ve played with Rails and am quite impressed with its capabilities. However, what happens once the honeymoon is over? It seems like a risky career move to jump from buzzword to buzzword. I know that Ruby exists outside of the Rails, but I’m not sure how practical it would be to learn Ruby if I wasn’t jumping on the Rails bandwagon.

Both Python and Ruby have been around since the early/mid 90′s. Which has been better adopted by organizations that may wish to hire me? On Monster.com there are currently 347 job listings with “Ruby” contained as a keyword. Of those 347, 95 don’t specifically mention “Ruby on Rails”. There are also 923 listings looking for someone with Python experience. While not strictly scientific, I feel that indicates that Python stands on its own merits, apart from its web frameworks, moreso than Ruby. Again though, I’m not talking technical points, which I feel both are very comparable. However, I would argue that employability should be given more weight than technical minutia when planning a career.

If the Rails framework started to take a direction I didn’t care for, I don’t really have many options. My skills wouldn’t be very portable. However, by learning Python, I can pick from several different, well-known frameworks: Django, TurboGears, Pylons, or web.py. Or if I even wanted to get out of web development, Python is used in many other applications.

Conclusion

While I like certain aspects of Ruby more than Python (as a newbie, the whitespace thing is still driving me nuts!), I feel that Python is overall a better fit for me. It’s greater emphaisis on structure will help make developing and maintaining large projects easier. And because both the language and the jobs aren’t so intimately tied to one framework, my skills are more portable and future-proof. I’m sure there are people out there that will disagree with my analysis, so if you’re trying to decide with path to take, please take the time to play with both of them and ask advice from both camps. Thanks for reading.

Share