tadhg.com
tadhg.com
 

Better Word Count in Vim

23:40 Sun 17 Jan 2010
[, , , , , , , ]

I’m currently trying out Vim (again), and have made more progress this time, mainly due to Seth’s help. The key things that have made it better:

  • :set hidden. Absolutely critical, this. Stops Vim from complaining when you try to switch buffers and your current buffer has unsaved changes.
  • bufexplorer. Makes switching buffers a lot easier.
  • A better Python syntax file. I didn’t like the defaults.
  • My own indentation and syntax files for reStructuredText.

Really, though, the key first one was :set hidden. Before that I felt that I had completely misunderstood Vim’s file management model.

Once I got the syntax highlighting to a reasonable state, I ported my word count macro over to Vim. This wasn’t too hard after the inevitable character encoding problems. For working within Vim scripts, I strongly suggest using the following:

ulines = [unicode(line, "utf-8") for line in vim.current.buffer]

(Assuming your Vim encoding is set to UTF-8, of course.)

The Vim version of my script isn’t as versatile yet as the jEdit version, because I don’t know how to make it work with only selected lines (something that’s easy in jEdit). Apart from that, though, it seems to work quite well; the next version of it might do “live” word count in the status bar.

I’m currently writing this in Vim, or kind of: the actual writing is in Vim, but the creation of the file and the template, the automated expansion of various reStructuredText entities, output to reStructuredText, and the publication, are in jEdit as I haven’t ported those over yet.

The script:

function! WordCount()
python << EOF
import re
import vim


class WordCounter(object):
    """
    Vim script for better word count.
    """

    LINE_SEPARATORS = (
        "\r",
        "\n"
    )

    WORD_SEPARATORS = (
        " ",        # space
        "\t",       # tab
        "/",        # slash
        "&",        # ampersand
        ’"’,        # double quotation mark, straight
        u"\u201C",  # double quotation mark, left
        u"\u201D",  # double quotation mark, right
        u"\u2018",  # single quotation mark, left
        u"\u2013",  # en dash
        u"\u2014",  # em dash
        ">",        # greater than symbol
        "<",        # less than symbol
        "+",        # plus
        "=",        # equals
    )

    REPEATER_SEPARATORS = (
    #These are only separators if they’re present consecutively, e.g. -- or ..
        "-",
        "."
    )

    IGNORE = (
    #Not separators per se, but should not be treated as word content
        "’",        # single quotation mark, straight
        u"\u2019",  # single quotation mark, right
        "(",        # left parenthesis
        ")",        # right parenthesis
        "[",        # left bracket
        "]",        # right bracket
        "{",        # left curly bracket
        "}",        # right curly bracket
        "|",        # bar
        "-",        # hyphen
        "#",        # hash mark
        ".",        # period
        "_",        # underscore
        "`",        # backtick
        "\\",        # backslash
    )

    def word_count(self):

        ulines = [unicode(line, "utf-8") for line in vim.current.buffer]
        text = u"\n".join(ulines)

        chars = len(text) #Pretty sure I want the actual char count, not the adjusted char count.

        text = self.remove_directives(text)
        text = self.adjust_for_rest(text)
        words, lines = self.count_words(text)

        print "chars: %s, words: %s, lines: %s" % (chars, words, lines)

    def remove_directives(self, text):
        textlines = text.split("\n")
        newlines = []
        comment = re.compile(r"[ ]*\.\. [a-zA-Z0-9_\|]")
        argument = re.compile(r"    :[^\:]*:")
        for line in textlines:
            if not comment.match(line) and not argument.match(line):
                newlines.append(line)
        return "\n".join(newlines)

    def adjust_for_rest(self, text):
        """
            Go through each of the special cases for reST.
        """
        text = self.rest_adjust_pipe_space(text)

        return text

    def rest_adjust_pipe_space(self, text):
        """
            Special-case "|\ " to make sure e.g. "|Hypnotic Specter|\ s"
            doesn’t get counted as three words.

            |Incinerate|\ s |Hypnotic Specter|\ —|Hypnotic Specter|\ s
            The above line should be counted as five words.
        """
        spacere = re.compile(r"\|\\ ([^ ]{1})")
        finds = spacere.findall(text)
        text = spacere.sub("|\g<1>", text)
        return text

    def count_words(self, text):
        words, lines = 0, 1
        #go through the text character by character:
        word, previous_character = 0, None
        for character in text:
            if character in (self.LINE_SEPARATORS + self.WORD_SEPARATORS) or (character in self.REPEATER_SEPARATORS and previous_character in self.REPEATER_SEPARATORS):
                #it’s a separator
                word = 0
                if character in (self.LINE_SEPARATORS):
                    lines = lines + 1
            elif character in (self.IGNORE):
                pass
            else:
                #it’s part of a word.
                if not word:
                    words = words + 1
                    word = 1
            previous_character = character

        return (words, lines)

WordCounter().word_count()
EOF
endfunction

if !exists(":WW")
  command! WW  :call WordCount()
endif

One Response to “Better Word Count in Vim”

  1. Seth Milliken Says:

    :help :user-commands since this should be a command, not a function. Because…
    :help :command-range gives you access to a range.
    :help skeleton for templating, or check out one of the snippet plugins on vim.org.

    Also, for completeness, don’t forget the importance of having ‘set nocompatible’ and ‘filetype plugin indent on’ in your .vimrc.

Leave a Reply