Thursday, September 13, 2007

Counting Words in Files With HTML Markup

I write blog posts with HTML markup, and I sometimes want to get a fairly accurate word count of my posts. By accurate I mean that HTML tags themselves as well as quoted values are not counted as words. There are a lots of utilities and scripts that do word counting, from the venerable Unix 'wc' to an elisp subroutine in the FSF's An Introduction to Programming in Emacs Lisp. The ones I looked at all suffered from the same problem - they counted markup as 'words'. If there was some way to strip out or ignore markup, the various methods of word counting would work.

First I tried a few ready-made utilities. The Unix text-mode browser lynx has a 'dump' option that will output formatted text content from a given html file (lynx -dump -nolist foo.html), however, it outputs formatted text, and some of the formatting markup is itself counted as a word by the 'wc' utility. w3m is similar in its output, so has the same problems. I found a Debian package called unhtml that seemed to do what I wanted, but after experimenting a bit with it, I found that it could not handle multiple opening and closing tags on the same line (it counted them as one tag, meaning any real words in that line were skipped). Thinking I might have to write my own utility, I set out to not reinvent the wheel and did a CPAN search - and had success on the first hit. After a few tests I found that HTML::Strip did indeed handle multiple tags on a line as well as HTML comments and values properly.

The next step was to write a wrapper around HTML::Strip for command line use. After a bit of hacking, I came up with unhtml.pl. From the script header:
Script that strips HTML tags from text. It uses HTML::Strip to do the real work; this is a wrapper around that module that allows you to specify command line arguments - standard input/output is assumed if no args are given. If only one arg is given, it is assumed to be the input pathname.

Requires HTML::Strip (perl -MCPAN -e 'install HTML::Strip' as root on any Unix-based OS will work).

Examples (the following have equivalent results):

unhtml.pl < foo.html > foo.txt
unhtml.pl foo.html > foo.txt
unhtml.pl foo.html foo.txt


I also needed a way to integrate this into Emacs, here is an elisp snippet you can put in your .emacs (don't forget to modify the path to the script):
(defun word-count nil "Count words in region" (interactive) (shell-command-on-region (point) (mark) "/home/dmaxwell/bin/unhtml.pl | wc -w")) (global-set-key "\C-c=" 'word-count)

As a bonus, it also handles XML and SGML properly. To use it while editing, just type C-x= to get a word count of the current region (use C-xh to make the region the entire buffer), minus HTML tags.

No comments: