This what I get:public class Sentences {
public static void main(String[] args) {
String testText = "Elle courut à son père et l'embrassa, en l'étreignant.\n - Eh bien, partons-nous? dit-elle.";
java.text.BreakIterator sunSentenceTokenizer = java.text.BreakIterator.getSentenceInstance();
com.ibm.icu.text.BreakIterator icuSentenceTokenizer = com.ibm.icu.text.RuleBasedBreakIterator.getSentenceInstance();
sunSentenceTokenizer.setText(testText);
icuSentenceTokenizer.setText(testText);
int sentenceStart = 0;
int sentenceOffset = 0;
int sentenceCounter = 0;
System.out.println("sun");
while ((sentenceOffset = sunSentenceTokenizer.next()) != sunSentenceTokenizer.DONE) {
System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
sentenceStart = sentenceOffset;
sentenceCounter++;
}
sentenceStart = 0;
sentenceOffset = 0;
sentenceCounter = 0;
System.out.println("icu");
while ((sentenceOffset = icuSentenceTokenizer.next()) != icuSentenceTokenizer.DONE) {
System.out.println(sentenceCounter + " " + testText.substring(sentenceStart, sentenceOffset));
sentenceStart = sentenceOffset;
sentenceCounter++;
}
}
}
sun
0 Elle courut à son père et l'embrassa, en l'étreignant
1 .
2
- Eh bien, partons-nous?
3 dit-elle.
icu
0 Elle courut à son père et l'embrassa, en l'étreignant.
1 - Eh bien, partons-nous?
2 dit-elle.
I did try it with French locale and with other languages and corresponding locales. The presence of
'\n' is left at the beginning of the sentence when using Sun's iterator.