Recently I read an article that looked at the size of several authors' sentences. It was a stylistic study of his works.
How do I read a text (with several paragraphs) and extract its sentences? Preferably in Java.
Recently I read an article that looked at the size of several authors' sentences. It was a stylistic study of his works.
How do I read a text (with several paragraphs) and extract its sentences? Preferably in Java.
Starting with the simplest example, we assume that a sentence ends in a dot, followed by a space (or line break):
She is from Rio. He, Paulista.
It would suffice to have a% of the String using the dot followed by any space character, remembering to escape characters with split()
:
s.split("\.\s+");
But we also have to consider exclamation point and question mark:
Where have you been? I was worried!
For this we will use a positive lookbehind of RegEx:
s.split("(?<=[.!?])\s+");
But we have to consider that some sentences can be in single or double quotation marks, in case of dialogs.
"Am I old today?" Said my father.
In order to do this we will incorporate these elements into the pattern, remembering that the indent is a character that can be removed or maintained (depending on the desire of the programmer):
s.split("(?<=[.!?]|[.!?][\'\"])\s+");
But we still have the abbreviations. What to do when a dot followed by space does not indicate the end of a sentence, but rather an abbreviation (Mrs. to Mrs., Mr. to Mr., Dr. to Doctor, etc.)?
Mrs. Pereira met Geoge W. Bush.
So we used RegEx's negative lookbehind :
String pattern = "(?<=[.!?]|[.!?][\'\"])(?<!Sr\.|Sra\.|Dr\.|Dra\.|W\.|)\s+";
Note that RegEx has already started to get complicated, and it's best to put the abbreviations in a separate structure to check them one by one. More complex cases (eg U.K.) that need to be treated may arise.
In summary, you can greatly refine your code, but consider this a problem of Natural Language Processing , and there is still no perfect solution. The best algorithms range from 90% to 99% accuracy depending on the text.
If you need a more robust and accurate solution, I suggest searching for Stanford NLP Parser that has algorithms in Java for this.