RTF/Word–reStructuredText Toolchain

12:05 Thu 24 Sep 2009. Updated: 14:19 06 Oct 2009
[, , , , , , ]

It took me a while to get there, but I now have a working toolchain to automate going from an RTF file (or a Word document) to reStructuredText. The final link took the longest to find, and turned out to have been right there all along (no, I’m not going to turn this into a retelling of The Alchemist). But if you’re interested in how to get from Word to a sane format (like reStructuredText), this post will interest you.

What you need:

  • textutil. If you’re not running OS X, I discuss some alternatives below.
  • pandoc. In Haskell, so you need to install the Haskell Platform for your system, and then run sudo cabal install pandoc.

Once you have those, it becomes pretty simple:

textutil -convert html -stdout {source_file} | pandoc -f html -t rst --no-wrap > {output_file}

If you don’t have access to textutil, you could try UnRTF, some Perl scripts, rtf2latex2e (if you don’t mind the lack of Unicode support), or rtf2html. If your source files are Word documents rather than RTF documents, you could use OpenOffice.org’s batch conversion utility to get them into RTF and proceed from there. (AbiWord is also an option.)

This is a rather powerful set of tools. It’s extremely useful for me to be able to automate the conversion of my older files, and once they’re in reStructuredText it’s not that hard to get them to other useful formats, including LaTeX and PDF. I have some work yet to do (mainly on styling), but soon I should, for example, be able to have a single reStructuredText file for my résumé that a script can automatically convert to HTML, PDF, HTML suitable for my blog, RTF, Word doc, OpenDocument Text, and plain text. The same should apply for any of my other documents.

It took me quite some time to find the right pieces for this process, which I’ll describe below in case it’s of use to others.

A while ago I came across pandoc, an extremely useful tool that translates between quite a few different formats—including reStructuredText. It took me a while to get it working, as it’s in Haskell and the MacPorts version of Haskell just wouldn’t install for me, so I eventually figured out that I should just use the standard Haskell Platform OS X release. Having access to pandoc meant that I wouldn’t have to write my own parser from some sane format to reStructuredText.

That did still leave me with the problem of getting from RTF to that unspecified sane format. rtf2latex2e looked extremely promising, but it turned out not to support Unicode in RTF. I found UnRTF, but it also had Unicode issues and furthermore would segfault whenever I ran it. rtf2latex had some other issues. I found a set of Perl scripts aimed at the problem, but they gave me odd errors that I couldn’t fix. Finally I came across rtf2txt, which had a critical reference to textutil.

textutil is built into OS X, at least versions starting with 10.4, and handles formats including text, RTF, HTML… and even Word documents. Furthermore, the HTML it produces is quite sane, and in many respects better than the OpenOffice.org or AbiWord HTML output, particularly for simple documents—e.g. italics are surrounded with <i> tags and not <span> tags with some difficult-to-parse class defined elsewhere in the document.

I do some additional work in Python to infer document structure from my files and insert that into the reStructuredText, but that’s fairly trivial.

One Response to “RTF/Word–reStructuredText Toolchain”

  1. Darren Griffith Says:

    Thanks for the textutil tip! I’ve been helping friends publish ePubs lately, and Pandoc has been a major part of my toolkit. I’ve been using OpenOffice’s DOC to HTML conversion, but as you’ve said, textutil produces cleaner HTML markup.


Leave a Reply