Saturday, August 16, 2008

Sun's or IBM's BreakIterator

Sun's java.text.BreakIterator is supposed to be synchronized with IBM's ICU4J BreakIterator. Apparently not... The code below shows a puzzling difference in behavior between the sentence iterators included in Java 6 and ICU4J 4.0.
public class Sentences {

public static void main(String[] args) {
String testText = "Elle courut à son père et l'embrassa, en l'étreignant.\n - Eh bien, partons-nous? dit-elle.";

java.text.BreakIterator sunSentenceTokenizer = java.text.BreakIterator.getSentenceInstance();
com.ibm.icu.text.BreakIterator icuSentenceTokenizer = com.ibm.icu.text.RuleBasedBreakIterator.getSentenceInstance();

sunSentenceTokenizer.setText(testText);
icuSentenceTokenizer.setText(testText);

int sentenceStart = 0;
int sentenceOffset = 0;
int sentenceCounter = 0;
System.out.println("sun");
while ((sentenceOffset = sunSentenceTokenizer.next()) != sunSentenceTokenizer.DONE) {
System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
sentenceStart = sentenceOffset;
sentenceCounter++;
}

sentenceStart = 0;
sentenceOffset = 0;
sentenceCounter = 0;
System.out.println("icu");
while ((sentenceOffset = icuSentenceTokenizer.next()) != icuSentenceTokenizer.DONE) {
System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
sentenceStart = sentenceOffset;
sentenceCounter++;
}
}
}

This what I get:
sun
0 Elle courut à son père et l'embrassa, en l'étreignant
1 .
2
- Eh bien, partons-nous?
3 dit-elle.
icu
0 Elle courut à son père et l'embrassa, en l'étreignant.

1 - Eh bien, partons-nous?
2 dit-elle.


I did try it with French locale and with other languages and corresponding locales. The presence of followed by causes Sun's break iterator to break the preceding as a sentence apart. Decided to use IBM's ICU4J.

'\n' is left at the beginning of the sentence when using Sun's iterator.

No comments: