tadhg.com
tadhg.com
 

Improving a Python Word Counting Function

13:33 Thu 29 Apr 2010
[, , , , , , ]

This post could be summarized as “regular expressions are a lot faster than naive for loops”.

I’ve been working on improving the script I use for live wordcount in Vim, partly for performance and partly so that I can package it up as a plugin and share it with other people. Along the way I’ve improved the speed of the script rather significantly, and will go through the key part of that change here.

The core part of the script, unsurprisingly, is the count_words() function. As is common, it treats groups of letters surrounded by spaces as “words”. New lines also mark the end of a word. So do a variety of other characters, such as em dashes—in fact, one of the motivations for writing my own wordcount was because I wanted em dashes to be recognized as word separators.

In the following code, self.WORD_SEPS and self.LINE_SEPS are lists containing characters that separate words and lines. self.REPEATER_SEPS contains characters like “-”, which in my opinion does not separate words when used singly (e.g. “twenty-three” is one word) but does when used consecutively; “–” is a simple way to write an em dash when you’re restricted to ASCII. self.IGNORE is a list of characters that cannot form words on their own but aren’t necessarily separators, either.

The code is quite naive in its approach. It goes through the text letter by letter, checks the letter to see whether or not it’s in one of those lists, adjusts the count accordingly, and also stores previous_char in order to be able to handle the repeater separators.

def count_words(self, text):
    words, lines = 0, 1
    word, previous_char = 0, None
    lineword_separators = self.LINE_SEPS + self.WORD_SEPS
    repeater_test = lambda c, pc: bool(
        c in self.REPEATER_SEPS and pc in self.REPEATER_SEPS)
    separator_test = lambda c, pc: bool(
        char in lineword_separators or repeater_test(c, pc))
    for char in text:
        if separator_test(char, previous_char):
            word = 0
            if char in (self.LINE_SEPS):
                lines = lines + 1
        elif char in (self.IGNORE):
            pass
        else:
            #it’s part of a word.
            if not word:
                words = words + 1
                word = 1
        previous_char = char

    return (words, lines)

With that code as the central function, the script takes about three seconds to process a text file that’s approximately 400K/68,000 words.

The new version eliminates the need for a separate list of repeater separators, is fewer lines of code, and is faster:

def count_words(self, text):

    def ors(l): return r"|".join([re.escape(c) for c in l])
    def retext(text, chars, sub):
        return re.compile(ors(chars)).sub(sub, text)

    lines = text and len(re.compile(ors(self.LINE_SEPS)).split(text)) or 0

    text = retext(text, self.WORD_SEPS + self.LINE_SEPS, u" ")
    text = retext(text.strip(), self.IGNORE, u"")
    words = text and len(re.compile(r"[ ]+").split(text)) or 0

    return (words, lines)

The basic approach here is different. Rather than going through the text and evaluating it character by character, instead it replaces all of the characters in the separator lists with spaces, removes all the characters in the ignore list, uses split() to get a list of all the words in the text, and then counts the length of that list.

The most awkward part of it is having to construct the regular expressions out of the lists of characters, which is what the ors() function does. I could have replaced the lists with regular expressions, but I want it to be easy for users to alter the lists to suit their own preferences, and so left them as plain lists.

(Note that I’m restricted to Python 2.3 here since this needs to run in MacVim; hence the old-style and and or instead of new-style ternary expressions.)

Using that code instead of the previous version, and operating on the same 400K/68,000-word file, the script takes about 0.12 seconds—a 25-fold speed increase. Since this is a “live” wordcount, one that has to recount the entire file whenever you do something other than typing or adding a single new line, that’s an important difference, and it makes it much more usable when working with large files.

I had no illusions that my original version was anything close to optimal when I first wrote it, but I didn’t realize that the speed gains from doing it better would be so significant. Python’s regular expression handling is done by a C module, and I suspect that C module is quite highly optimized indeed. So, if speed is important, let highly-optimized C do the work, instead of doing it in your own code.

One Response to “Improving a Python Word Counting Function”

  1. Phil Gyford Says:

    Thanks for this – very interesting. It would be useful to know what you have in your LINE_SEPS, WORD_SEPS, REPEATER_SEPS and IGNORE lists. I can make some up but I’m sure that over time you’ve compiled more useful lists than I can off the top of my head!

Leave a Reply