Separate one paragraph for sentences

2

I need to break a paragraph in a set of sentences.

For example:

var paragrafo = "Sou Dr. José. Meu passatempo é assistir séries. Adoro animais!! E você?";
var frases = paragrafo.split('.')

But he ends up breaking the word Dr. too:

var array = [
      "Sou Dr."
      " José."
      "Meu passatempo é assistir séries."
      "Adoro animais!! E você?"
    ];

What I expect to return is:

var array = [
  "Sou Dr. José."
  "Meu passatempo é assistir séries."
  "Adoro animais!!"
  "E você?"
];
    
asked by anonymous 05.02.2018 / 16:18

1 answer

7

Solution (ECMAScript 2018 / ES9):

.*?[.!?](?![.!?])(?<!\b\w\w.)

Demo:

var paragrafo = "Sou Dr. José. Meu passatempo é assistir séries. Adoro animais!! E você?";
var frases = paragrafo.match(/.*?[.!?](?![.!?])(?<!\b\w\w\.)/g);
console.log(frases);

Explanation:

First of all, I notice that this regex has the limitation of passing abbreviations of only two letters (eg "Dr.", "Mr.", "Fr.", "Mr." etc.) p>

  • .*?[.!?] - Here we are capturing any finished text exclamation point, exclamation mark or question mark. I use a lazy quantifier to capture each part separately.
  • (?![.!?]) - This is a negative lookahead . Here we are saying not to accept a match if in front of it there is one of these scores (I used to also capture the repeated punctuations, as in the Adoro animais!! excerpt).
  • (?<!\b\w\w.) - This is a negative lookbehind . Here we are saying not to capture when our match ends with a \b (represents a word breaker) and two more characters of type \w (which means same as [a-zA-Z0-9_] ). This will cause texts such as Dr. José to still be considered within the same sentence, but will continue to separate if something like Dra. Maria occurs.

That's the idea of this regular expression. However, if we want to improve, such as removing the spaces left at the beginning of the separations, we can add another negative lookahead to ignore spaces:

(?! ).*?[.!?](?![.!?])(?<!\b\w\w.)

And instead of trying to generalize all instances of abbreviations, you might want to insert each specific case into that negative lookbehind from before:

(?! ).*?[.!?](?![.!?])(?<!\bDr\.|Dra\.|Srs\.|Sras\.)

End result:

var paragrafo = "Sras. e Srs., eu sou Dr. José. Minha esposa é a Dra. Maria. Meu passatempo é assistir séries. Adoro animais!! E vocês?";
var frases = paragrafo.match(/(?! ).*?[.!?](?![.!?])(?<!\bDr\.|Dra\.|Srs\.|Sras\.)/g);
console.log(frases);

I hope I have helped.

Update:

The solution presented above uses the new lookbehind functionality implemented in ES9. As the OP, in a comment below, said that you are using a browser that does not yet support this implementation, I also present a solution that does not use lookbehind :

(?! )(.*?(\b\w\w\.))*.*?[.?!](?![.?!])

Explanation:

  • (?! ) - It's a lookahead that I used to not capture the spaces that lie behind the sentences.
  • (.*?(\b\w\w\.))* - Here I capture any character until it reaches the exceptions. I set the same pattern as explained above (% with%), but you also have the option to add the exceptions separately as in the example with lookbehind . This pattern is put into a catch group, and I put a% quantizer of% after it, to say that it can be repeated zero or more times.
  • \b\w\w\. - Here all characters are captured using a lazy quantifier , until you reach the end point, exclamation, or question mark.
  • * - This is a negative lookahead . I mean I do not want any matches followed by a score. Use to capture phrases like .*?[.?!] .

Demo:

var paragrafo = "Sou Dr. José. Meu passatempo é assistir séries. Adoro animais!! E você?";
var frases = paragrafo.match(/(?! )(.*?(\b\w\w\.))*.*?[.?!](?![.?!])/g);
console.log(frases);
    
05.02.2018 / 17:37