This what I get:public class Sentences {
public static void main(String[] args) {
String testText = "Elle courut à son père et l'embrassa, en l'étreignant.\n - Eh bien, partons-nous? dit-elle.";
java.text.BreakIterator sunSentenceTokenizer = java.text.BreakIterator.getSentenceInstance();
com.ibm.icu.text.BreakIterator icuSentenceTokenizer = com.ibm.icu.text.RuleBasedBreakIterator.getSentenceInstance();
sunSentenceTokenizer.setText(testText);
icuSentenceTokenizer.setText(testText);
int sentenceStart = 0;
int sentenceOffset = 0;
int sentenceCounter = 0;
System.out.println("sun");
while ((sentenceOffset = sunSentenceTokenizer.next()) != sunSentenceTokenizer.DONE) {
System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
sentenceStart = sentenceOffset;
sentenceCounter++;
}
sentenceStart = 0;
sentenceOffset = 0;
sentenceCounter = 0;
System.out.println("icu");
while ((sentenceOffset = icuSentenceTokenizer.next()) != icuSentenceTokenizer.DONE) {
System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
sentenceStart = sentenceOffset;
sentenceCounter++;
}
}
}
sun
0 Elle courut à son père et l'embrassa, en l'étreignant
1 .
2
- Eh bien, partons-nous?
3 dit-elle.
icu
0 Elle courut à son père et l'embrassa, en l'étreignant.
1 - Eh bien, partons-nous?
2 dit-elle.
I did try it with French locale and with other languages and corresponding locales. The presence of
'\n' is left at the beginning of the sentence when using Sun's iterator.
No comments:
Post a Comment