Diacritics / Text

Dirty Phrasebook – Part 1

On 1st April 2015 I published a joke app to Google Play named Dirty Phrasebook which is based upon Monty Python’s Dirty Hungarian Phrasebook sketch. I this series of articles we’ll take a look in to the code (which will be open-sourced along with the final article). In the first article we’ll begin to look at the translation mechanism used to (hopefully) reliably and repeatable translate the user’s input in to one of the phrases from the sketch.

ic_launcherFor those unfamiliar with the Dirty Hungarian Phrasebook sketch, the basic premise is around a phrasebook which has incorrect, but rather amusing translations. In an example given during the trial of the publisher the example is given that the phrase “Can you direct me to the station?” is actually translated to “Please fondle my bum”. I wanted to create an app which would take any string that the user entered and translate it to one of the phrases from the sketch. I then enlisted the help of some volunteer translators (details of whom can be found at the end of this post) to translate the target phrases in to as many languages as possible, so that the user can choose what language they wish to ‘translate’ their string to.

The first thing I did was extract nine target translation strings from the sketch (in English) – these will be the phrases that will later get translated in to as many languages as possible:

One key thing that I wanted to achieve was that if the user typed in the same string multiple times, it should always return the same translation. The basic mechanism I decided to use was to get the hash code of the user’s string and get the modulus of the hash code divided by the number of target strings. This value will always be between 0 and 8 (because there are nine target strings) would be used to index the target translation string which would be presented back to the user as the translation.

One complication of this was I wanted to cater for subtle differences in what the user entered being handled. Some obvious examples of this are if the capitalisation, punctuation, or number of spaces in the string change. But a less obvious example is if one of the strings contains diacritics (for example the Italian ‘comprerò’) and the other does not (‘comprero’).

To achieve this I created a utility class named StringSanitiser which would perform some basic transformations on a string to, as far as possible, remove any of these subtle yet irrelevant (for my purposes, anyway) differences:

Most of these transformations are pretty simple. For example we convert the string to lower case, and use regular expressions to strip out any punctuation – the regex \p(Punct} will match all punctuation marks, and we simply replace them with an empty string. Similarly, the rexex \s{2,} will match 2 or more whitespace characters and replace them with an empty string – thus removing any multiple spaces.

What is worthy of a little bit of explanation is how we strip out the diacritics in the removeDiacritics() method. There’s not an awful lot of code here, but what’s there is pretty powerful stuff. To understand how it works we’ll need a little bit of explanation of how UTF character encoding handles diacritics.

UTF supports variable width encoding which means that, for example, in UTF-8 the character units are 8 bits each, but multiple bits can be combined based upon the value of the first unit in the sequence. The most significant bit of the following units. Values below 0x7F provide ASCII compatibility, but higher values indicate multiple units that are combined in order to address code points higher than those which can be addressed with 8 bits.

UTF actually supports diacritics in two distinct ways:

Firstly you can use a precomposed character which is a single character code which references a character glyph containing both the letter and the diacritic mark as a single character. For example ‘ò’ as a precomposed character with UTF-8 character code 0xC3 0xB2 (LATIN SMALL LETTER O WITH GRAVE) – note the 2 character unit encoding that was mentioned earlier.

The second way is using a standard letter character followed by a combining character (which modifies the preceding character – effectively two glyphs are rendered, the letter first, then the combining character on top). For example ‘ò’ would be UTF-8 character code 0x6F (LATIN SMALL LETTER O) followed by 0xCC 0x80 ‘ ̀’ (COMBINING GRAVE ACCENT).

It’s important to understand the distinction between the variable width encoding (which addresses a single character using multiple character units), and combining character which is a separate character to the one which precedes it and itself may be of variable width.

So with explanation it may be a little clearer how we can really easily strip out the diacritics: We need to convert all of the diacritic characters to the second form, and then we can strip out the combining combining character to transform ‘ò’ to ‘o’. The first part of this can be achieved thanks to the java.text.Normalizer class which converts between these forms. In our case we wand the decomposed form so we normalise to Normalisation Form D. Once we have this we just need to strip out the combining characters (which is done using the regex \p{M}):

I added some unit tests to help me with testing that this was behaving as I expected, and to provide regression tests to ensure that changes didn’t break anything.

It’s worth mentioning that this technique is a general Java one, and is not specific to Android.

So running a string through StringSanitiser.sanitise() will perform some simplifications and standardisations which will help to remove any small changes in what the user types in. What this won’t handle is changes in wording which have the same meaning. For example “Can you direct me to the station?” and “Could you please direct me to the station?” (which will translate to different things because the sanitised versions are different and have different hashcodes) but I can live with that.

In the next article we’ll look at the actual translation mechanism itself and explore some additional translation requirements I wanted to include.

The source code for this series is available here.

I am deeply indebted to my fantastic team of volunteer translators who generously gave their time and language skills to make this project sooo much better. They are Sebastiano Poggi (Italian), Zvonko Grujić (Croatian), Conor O’Donnell (Gaelic), Stefan Hoth (German), Hans Petter Eide (Norwegian), Wiebe Elsinga (Dutch), Imanol Pérez Iriarte (Spanish), Adam Graves (Malay), Teo Ramone (Greek), Mattias Isegran Bergander (Swedish), Morten Grouleff (Danish), George Medve (Hungarian), Anup Cowkur (Hindi), Draško Sarić (Serbian), Polson Keeratibumrungpong (Thai), Benoit Duffez (French), and Vasily Sochinsky (Russian). Thank you so much guys – you rock!

© 2015, Mark Allison. All rights reserved.

CC BY-NC-SA 4.0 Dirty Phrasebook – Part 1 by Styling Android is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at http://blog.stylingandroid.com/license-information.

2 Comments

  1. Hello,I love Monty Python very much. And the app is interesting too.
    Maybe I can be a volunteer translators – Chinese.
    I want to send a email to you ,but I can’ find your email here.
    So I comment at here.
    If you want translate it to Chinese, send a e-mail to me.

    1. Thanks for your kind comments and your extremely generous offer of a Chinese translation. What I intend to do is open source the app with the last article in this series. There will be full instructions of how to provide additional translations via Pull Requests when the source is published.

Leave a Reply

Your email address will not be published. Required fields are marked *