### GPS, URLs, Math, Python, Featuritis

.Earlier this evening Gever suggested a service dedicated to shortening URLs that had geolocation data in them. My immediate responses were that a) this was a great idea, and b) that I wanted the shortened URLs to still be human-readable in some sense—specifically, I wanted a person to be able to look at two URLs returned by this service and have some idea of how close to each other they were.

Gever hadn’t had that feature in mind at all, and so I had added a major feature as soon as I encountered this idea for a service.

We were at PyWebSF, so there were a number of programmers present, all of whom liked the idea, and all of whom appeared to have entirely different approaches to implementing it. Quadtrees and octrees were mentioned by the more CS-minded, but my approach was more basic (no pun intended): I thought it would be possible simply to encode the numbers in some higher base system and effectively shorten them that way.

I started with base 64, and looked at Base64 and at the relatively easy Python implementation of an encoding system using it, but found that I didn’t like it, for a few reasons. The main one was that the encoding just wasn’t that friendly: incrementing by one wasn’t that trivial, for example, and to me that made comparison of numbers difficult. I decided that the Base64 approach was clearly not designed with human readability in mind, and that one that was would map to base 10 for the numbers 0–9, then go to lowercase letters a–z, with a being 10, b 11 … z 35, and then uppercase letters A-Z, with A being 36, B 37 … Z 61. Then I tacked on _ for 62 and * for 63, for no particular reason. Apart from the last two, my choices were based on what I felt most people would grasp easily: numbers were their own values, uppercase letters were more than lowercase letter, and letters progress in alphabetical order.

Of course, that meant implementing my own system for the conversion, which I did in likely inefficient manner using Python `dict`s and some math for converting base systems in rather hacky fashion.

Given that I wanted the coordinates to remain somewhat readable, I decided to leave the integer parts of them alone. So the coordinates `53.10680448323265, -9.580078125` would still have 53 and 9—those numbers are always small enough that there’s not much point changing them to something else. That just left me with the task of compressing the numerals after the decimal point.

I had to account for the possibility of a string of zeroes, and chose a rude hack for this: the first numeral in my compressed number would always represent how many zeroes there were after the decimal point. (I intend to only deal with about 18 digits after the point, so there’s no chance of there being more than 63 zeroes in any data I’m planning to accept.) This means an extra digit in all cases where there are no zeroes, which is unfortunate.

After the zeroes are handled, I convert the rest to a base 64 number—which is really a string composed of the characters `[0-9a-zA-Z_*]`—using the method of storing the modulo 64 of the number and then dividing by 64 and repeating. I had to look up this method because I’d entirely forgotten how to convert from different base systems.

I got this working, and it converts e.g.

53.10680448323265, -9.580078125

to:

53.02rqYl0X1, -9.0yAQEJ

Unfortunately, I’m not that impressed with those results. Yes, we get from 17 to 12 and 12 to 9 characters, but I’m not convinced that’s sufficient progress to make the whole thing worth it.

At this point a certain degree of madness crept into the proceedings.

Clearly, the problem wasn’t with the overall approach, but with the disgraceful paucity of symbols. 64 just wasn’t a large enough base. Using a larger base would ensure fewer characters to represent large numbers. The only problem here is that while progressing from zero to 63 with `[0-9a-zA-Z_*]` makes at least some logical sense to most people familiar with the English alphabet, how to go beyond that?

I started looking at larger character sets. Unicode can be supported in URLs (e.g. http://?.ws/??), so why not start using whatever Unicode symbols make sense in a higher base system?

I thought that base 128 would be ideal, but ran into immediate problems. Unfortunately, there doesn’t seem to be any single additional feature or accent that’s available for all of the letters of the English alphabet—I looked for dots, macrons, umlauts, overlines, underlines… I even looked at the possibility of upside-down letters, but not all of them are supported. If they were, I might have pursued that, but there’s an additional problem: they don’t aid experimentation.

I wanted the user of this shortening system to be able to change the characters with relative ease, and once you get into additional character sets, that’s tricky. Even if the upside-down letters had all been available, and the user realized that the progression was something like zero to nine in numerals, ten to thirty-five in lowercase letters, thirty-six to sixty-one in uppercase letters, sixty-two to eighty-seven in upside-down lowercase letters, eighty-eight to one hundred and thirteen in upside-down uppercase letters, and the remaining fourteen numbers in some pseudo-logical progression through (URL-safe) other symbols, how could they apply this knowledge to manipulate the strings? I don’t know of any OS that supports easily producing any of those upside-down letters. The same goes for any of the other alterations I considered, such as dots, umlauts, etc.

Having spent plenty of time exploring this, I eventually backed off it, and decided that my implementation took my approach about as far as it would go. We’ll see if any of the others end up with a better solution.

Incidentally, one positive aspect of my way of doing it is that unlike with traditional URL shorteners, no storage would be necessary, just a translation service, which could be trivially written in just about any language.

09 Dec 2009 at 03:05

One way of removing one more character would be to alway use a letter to represent the number of zeroes. Then you don’t need the dot. The first letter is both the dot and the number of zeroes.