Safely Base-36-Encoding Integers

I’m working on a project at Ancestry.com which involves URLs with fairly long integers—that is, integers up to eight digits or more in length. To shorten these IDs, we’ve decided to base-36-encode them. So, for example, 17,012,00010 = A4MJK36. Now, the number of IDs that will be encoded is on the order of 108. What is the likelihood that at least one of those IDs encodes to something vulgar in English? See, for example, what 1,329,077 encodes to.

I cannot take the risk that this may happen. What, then, is to be done? Note that the digits in “normal” base-36 encoding are 0, 1, 2, …, 9, a, b, c, …, z. But suppose we used a different alphabet? What if we just removed the vowels, say? That would leave us with an alphabet of 31, not 36 characters, so we would then be base-31-encoding the integers instead of base-36-encoding them. Is that OK? Well, for my purposes, it is. Note that any base-31 representation is a valid base-36 one, too (for a different number), just as 10 is both a valid base-2 and base-10 representation—meaning 2 in base-2, and 10 in base-10.

In Python, the encoding and decoding algorithms are as follows:

ALPHABET = "0123456789bcdfghjklmnpqrstvwxyz"

def encode(n):
    if n == 0:
        return ALPHABET[0]

    # We're only dealing with nonnegative integers.
    if n < 0:
        raise Exception() # Raise a better exception than this in real life.

    result = ""

    while (n > 0):
        result = ALPHABET[n % len(ALPHABET)] + result
        n /= len(ALPHABET)

    return result

def decode(encoded):
    result = 0

    for i in range(len(encoded)):
        place_value = ALPHABET.index(encoded[i])
        result += place_value * (len(ALPHABET) ** (len(encoded) - i - 1))

    return result

This can never give vulgar English words, since all English words have vowels in them. At least, it’s good enough; I suppose dirty acronyms could still be made, but that’s obscure enough for me.

Note that the code above will encode with any alphabet of two or more symbols, as long as the symbols are distinct. So, an alphabet of "01" will give you the binary encoding, "012" the ternary, "0123456789abcdef" the hexadecimal, and so on. Of course, "-+" gives a binary encoding, just with different symbols than those normally used. Since capital letters are distinct from lower-case letters in ASCII, you can base-62-encode a number with "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" if you wanted to.

blog comments powered by Disqus