Saturday, August 16, 2008

Sun's or IBM's BreakIterator

Sun's java.text.BreakIterator is supposed to be synchronized with IBM's ICU4J BreakIterator. Apparently not... The code below shows a puzzling difference in behavior between the sentence iterators included in Java 6 and ICU4J 4.0.
public class Sentences {

public static void main(String[] args) {
String testText = "Elle courut à son père et l'embrassa, en l'étreignant.\n - Eh bien, partons-nous? dit-elle.";

java.text.BreakIterator sunSentenceTokenizer = java.text.BreakIterator.getSentenceInstance();
com.ibm.icu.text.BreakIterator icuSentenceTokenizer = com.ibm.icu.text.RuleBasedBreakIterator.getSentenceInstance();

sunSentenceTokenizer.setText(testText);
icuSentenceTokenizer.setText(testText);

int sentenceStart = 0;
int sentenceOffset = 0;
int sentenceCounter = 0;
System.out.println("sun");
while ((sentenceOffset = sunSentenceTokenizer.next()) != sunSentenceTokenizer.DONE) {
System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
sentenceStart = sentenceOffset;
sentenceCounter++;
}

sentenceStart = 0;
sentenceOffset = 0;
sentenceCounter = 0;
System.out.println("icu");
while ((sentenceOffset = icuSentenceTokenizer.next()) != icuSentenceTokenizer.DONE) {
System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
sentenceStart = sentenceOffset;
sentenceCounter++;
}
}
}

This what I get:
sun
0 Elle courut à son père et l'embrassa, en l'étreignant
1 .
2
- Eh bien, partons-nous?
3 dit-elle.
icu
0 Elle courut à son père et l'embrassa, en l'étreignant.

1 - Eh bien, partons-nous?
2 dit-elle.


I did try it with French locale and with other languages and corresponding locales. The presence of followed by causes Sun's break iterator to break the preceding as a sentence apart. Decided to use IBM's ICU4J.

'\n' is left at the beginning of the sentence when using Sun's iterator.

Thursday, August 07, 2008

Google Translation Center: The World’s Largest Translation Memory - GigaOM

Google Translation Center: The World’s Largest Translation Memory - GigaOM: "Google has been investing significant resources in a multi-year effort to develop its statistical machine translation technology. Statistical MT works by comparing large numbers of parallel texts that have been translated between languages and from these learns which words and phrases usually map to others — similar to the way humans acquire language. The problem with statistical MT is that it requires a large number of directly translated sentences. These are hard to find, and because of this SMT systems use sources like the proceedings from the European Parliament, United Nations, etc. Which are fine if you’re writing in bureaucrat-speak, but aren’t so great for other texts. Google Translation Center is a straightforward and very clever way to gather a large corpus of parallel texts to train its machine translation systems.

Part machine translator and part translation memory (a sort of search engine for translation that helps translators to recall translations), GTC will help translators by providing a free, global translation memory, and in turn drive costs down by reducing the amount of work needed to complete a text. It will help Google by providing an excellent source of high quality parallel texts that can be fed back into the statistical translation systems."