In order to solve a few tokenization problems and to delimit the sentences properly I decided not to fight with the tokenization anymore and to use an efficient script that would do it for me. There are taggers which integrate a tokenization process of their own, but that’s precisely why I need an independent one, so that I can let the several taggers downstream work on an equal basis.
I found an interesting script written by Stefanie Dipper of the University of Bochum, Germany. It is freely available here : Rule-based Tokenizer for German.
- It’s written in Perl.
- It performs a tokenization and a sentence boundary detection.
- It can output the result in text mode as well as in XML format, including a detailed version where all the space types are qualified.
- It was created to perform well on German.
- It comes with an abbreviation list which fits German standards (e.g. the street names like Hauptstr.)
- It tries to address the problem of the dates in German, which are often written using dots (e.g. 01.01.12), using a “hard-wired list of German date expressions” according to its author.
- The code is clear and well documented, which is useful if you want to adapt it to your needs.
I am quite happy with the sentence detection, so far the tests I made gave satisfying results, especially for the dates. The only systematical problem I found concerned the names written like Arnold A. Schwarzenegger, where the A. is considered as the end of the sentence. It may be difficult to address, although sentences which end by
[A-Z]\. are odd and quite rare to my opinion.
I had a problem concerning the output : there is an “xmlIn” option available which prevents the tokenizer from analyzing the XML tags, but when one also wants to make use of the “xmlOut” option, it does not seem to work anymore. Maybe I did not pay enough attention, I found an easy workaround in my workflow that also simplifies the spacing management (because in the end you don’t want the tagger to handle words like
I also had problems with apostrophes, which are sometimes considered as part of a token and sometimes taken apart, depending on the encoding (‘ in the first case, ´ in the second).
But all in all I recommend this tokenizer.