Document Conversion

03:08 Mon 19 Mar 2007. Updated: 23:18 19 Mar 2007
[, , , ]

I spent a significant chunk of the weekend getting files into my Subversion repository. As I was trying to recreate historical versioning from a a bunch of files that weren’t in version control, it took rather a lot of time to do this. I’m rather happy to finally have all my personal stuff under version control. However, there’s a lot of cleaning up left to do.

There are duplicate files, and, sadly, a bunch of corrupt files, so I need to go through every single file in the repository to check it. Time-consuming, but once that’s done, I can rest a lot easier about data integrity.

Another project of mine is to get my files into better formats. I wrote about document formats a while ago, but that was referring primarily to documents I was in the process of editing. For documents that I’m unlikely to edit, (X)HTML seems like a good document format. I like the idea of having all of my stuff in a plain-text format and theoretically available via a browser.

Friends have argued for LaTeX, but I prefer XHTML for now. Naturally, I want valid, semantically-rich, well-structured, clean XHTML, which means that my conversion process isn’t quick. Especially for old Word documents…

(I used to use Word a lot, more or less before I learned better, so tons of my stuff is still in that format. Getting it out of that format and into XHTML will be a good thing.)

RTF files, which are what I still write in, aren’t that much easier to convert to XHTML. Finding acceptable translations from .doc and RTF isn’t that easy, because a lot of them focus (understandably) on trying to preserve formatting details that I simply don’t care about. For the most part, I don’t care what font text was in, or what its margins were. I just care about what that means about that text—was it a heading? A quotation?

So, I’m trying to get from .doc or RTF to clean markup. This will require manual steps, I don’t think I can avoid that. The current process is to open the document in OpenOffice, check to see that it’s readable there, then save it as RTF (if it’s in Word format). Then use AbiWord at the command line: C:\usr\office\abiword\AbiWord\bin\AbiWord.exe --to=html sample_file.rtf to convert the file to XHTML. AbiWord’s XHTML is verbose, but for my purposes is still the best of the converters I’ve found (others might be better if you’re trying to preserve tables or other layout).

At that point, I have an XHTML files that’s full of cruft, and I started out with a series of regular expressions as a jEdit macro (since I have to look at the file to make sure it’s okay anyway, and hence will be opening it in jEdit, I decided I might as well write the cleaning script in jEdit too). I made a lot of regular expressions before realizing that I would need a DOM parser to do it right… so now I’m exploring jTidy with jEdit.

Eventually I’ll have something that converts the files to my liking (or very close), and then I’ll start going through them, a process which is likely to take a while. However, I can get through that working on a few files a day, and that’s not going to be too much hassle. In the end, I’ll have files that are much more easily searched, indexed, diffed, and read.

A bunch of the regular expressions I was using (and might still use) follows, just in case anyone out there might find them interesting.

// Replace useless empty spans
SearchAndReplace.setSearchString("<span xml:lang=\"en-US\" lang=\"en-US\">([^<]*)</span>");
//Remove Style Element
SearchAndReplace.setSearchString("<style type=\"text/css\">\\n[^<]*<[^<]*</style>");
// Remove AWML:Style attributes
SearchAndReplace.setSearchString(" awml:style=\"[^\"]*\"");
// Remove left-to-right directional attributes
SearchAndReplace.setSearchString(" dir=\"ltr\"");
// Replace inline text-style blockquotes
SearchAndReplace.setSearchString(" style=\"[^\"]*margin-left:[1-9]*[^\"]*\"");
SearchAndReplace.setReplaceString(" class=\"blockquote\"");
// Replace inline text-align styles with classes
SearchAndReplace.setSearchString(" style=\"[^\"]*text-align:([a-z]+)[^\"]*\"");
SearchAndReplace.setReplaceString(" class=\"align-$1\"");
// Remove body_text classes
SearchAndReplace.setSearchString(" class=\"body_text\"");
// Remove elements containing only breaks
SearchAndReplace.setSearchString("<[^>]*><br /></[^>]*>(\n)?");
// Replace multiple class attributes with one, space-delimited
SearchAndReplace.setSearchString("class=\"([^\"]*)\" class=\"([^\"]*)\"");
SearchAndReplace.setReplaceString("class=\"$1 $2\"");
// Remove align-left (it should be the default)
SearchAndReplace.setSearchString(" class=\"align-left\"");
SearchAndReplace.setReplaceString("\"class=\\\"\" + _1.replaceAll(\" align-left\\\"\", \"\\\"\")");
SearchAndReplace.setSearchString(" class=\"([^ ]*) align-left\"");
SearchAndReplace.setReplaceString(" class=\"$1\"");
SearchAndReplace.setSearchString(" class=\"align-left ([^\"]*)\"");
SearchAndReplace.setReplaceString(" class=\"$1\"");
SearchAndReplace.setReplaceString("\"class=\\\"\" + _1.replaceAll(\" align-left \", \" \")");
// Add extra line break after paragraphs
// Remove empty paragraphs
SearchAndReplace.setSearchString("<p[^>]*>[ ]*</p>\n");
// Replace blockquotes
SearchAndReplace.setSearchString("<p class=\"blockquote\">([^<]*)</p>");
// Remove empty spans
SearchAndReplace.setSearchString("<span[^>]*>[ ]*</span>");
SearchAndReplace.setReplaceString(" ");
// Remove xml:lang and lang
SearchAndReplace.setSearchString(" xml:lang=\"[^\"]*\"");
SearchAndReplace.setSearchString(" lang=\"[^\"]*\"");
// Remove font styling
SearchAndReplace.setSearchString(" style=\"font-size:[0-9]*pt[;]*\"");
// Remove spans with font styling
SearchAndReplace.setSearchString("<span style=\"font-size:[^\"]*\">([^<]*)</span>");
// Remove useless spans
// Remove font styling
// Replace italics
SearchAndReplace.setSearchString("<span style=\"font-style:italic[;]*\">([^<]*)</span>");
// Replace underline
SearchAndReplace.setSearchString("<span style=\"text-decoration:underline[;]\">([^<]*)</span>");
SearchAndReplace.setReplaceString("<em class=\"underline\">$1</em>");
// Replace bold
SearchAndReplace.setSearchString("<span style=\"font-weight:bold[;]\">([^<]*)</span>");
// re-do footnotes
SearchAndReplace.setSearchString("<span class=\"ABI_FIELD_footnote_ref\" (id=\"footnote_ref-[0-9]+\")><a (href=\"#footnote_anchor-[0-9]+\")>([0-9]+)</a></span>");
SearchAndReplace.setReplaceString("<a class=\"footnote_anchor\" $1 $2>$3</a>");
SearchAndReplace.setSearchString("<span class=\"ABI_FIELD_footnote_anchor\" (id=\"footnote_anchor-[0-9]+\")><a (href=\"#footnote_ref-[0-9]+\")>([0-9]+)</a></span>");
SearchAndReplace.setReplaceString("<a class=\"footnote_ref\" $1 $2>$3</a>");
// Make headings
SearchAndReplace.setSearchString("<p class=\"_heading_([0-9]+)( heading)\"([^>]*)>([^<]*)</p>");
// Remove useless space after p tags
SearchAndReplace.setSearchString("(<p[^>]*>)[ \t]*");
// Replace XML-style ids
// Replace extra amp; for em dashes
// Remove empty lines

Leave a Reply