Find quotation marks with whitespace with RegEx

8

I need to find errors where inside quotation marks have spaces at the beginning or end of them.

Examples of errors:

  • The news was given by "Jornal do Brasil".
  • Paris is considered the "City of Light".
  • Note that in the first case, inside the quotation marks, it starts with a blank space, and in the second case it ends with a blank space.

    I want to remove those unnecessary blanks by using regular expressions to point the error.

    I used two RegEx for this:

    " .*?"
    ".*? "
    

    At first, I can point to the quotation marks that start with the space in the second, when it ends.

    It turns out that there is a problem with these expressions.

    Example:

  • I like the colors "blue" and "black".
  • Please note that there is no error in the phrase. The two words "blue" and "black" do not begin or end with whitespace, but using the regular expression above, he finds a false positive in and .

    I have tried in many ways, but my knowledge of regular expressions is still very poor and I have not been able to correct this error.

    What ExpReg should I use in this case?

    Thank you!

        
    asked by anonymous 26.08.2014 / 22:37

    3 answers

    4

    The best I can suggest is a regex that matches the string as a whole. For the problem here is that a local parsing can produce different results from a global parsing.

    My attempt at a solution would be:

    ^[^"]*(?:"(?:[^"\s]|[^"\s][^"]*[^"\s])?"[^"]*)*$
    

    Example in Rubular. Explanation:

    • ^ - start of string
    • [^"]* - followed by zero or more non-quotationable characters (text outside the quotation marks)
    • (?:...)* - followed by zero or more of:
      • " - opens quotation marks
      • (?:...|...)? - with or without:
        • [^"\s] - a single character that is not quotation marks or spaces; or:
        • [^"\s] - a character that is not quotation marks or spaces, followed by
        • [^"]* - zero or more characters that are not double quotes, followed by
        • [^"\s] - a character that is not quotation marks or spaces, followed by
      • " - quoted quotes
      • [^"]* - zero or more non-quotationable characters (text outside the quotation marks)
    • $ - end of string
    Explaining in natural language, it takes an excerpt from the quotes, then an excerpt, an excerpt, an excerpt, and so on. The excerpts within quotation marks can be of three types: a) empty - "" ; b) with a single character - "a" ; c) with a before and after character, and anything in the middle - "a...b" .

    Note that everything this regex talks about is whether the string is valid or invalid: it can not tell you what character the error is in.

    Update: If what you want is a regex that casts strings with error - and tells you where the error is - this was the best I could do: / p>

    ^[^"]*(?:"(?:[^"\s]|[^"\s][^"]*[^"\s])?"[^"]*)*("(?:\s[^"]*|[^"]*\s)")[^"]*(?:"(?:[^"\s]|[^"\s][^"]*[^"\s])?"[^"]*)*$
    

    Example in jsFiddle . This "monstrosity" boils down to:

    ^ regex_original ("(?:\s[^"]*|[^"]*\s)") regex_original $
    

    That is, "marry something that is correct, followed by something that is incorrect, followed by something that is correct." It will detect one and only one error of that type - if the string has two or more errors, or if it has a quotation mark that opens but does not close, etc., regex will not be able to get it. >

    I think with a little more effort you can improve this a bit, but we are getting to the point where regex is no longer the most suitable tool for the job ...

        
    26.08.2014 / 23:14
    3

    Here is a suggestion, which works in the text I've tried:

    var texto = '" Jornal do Brasil". Paris é considerada a "Cidade Luz " Gosto das cores "azul" e "preto".';
    
    var textoLimpo = texto.replace(/"([^"]*)"/g, function (match, r) {
        return '"' + r.replace(/^\s+|\s+$/g, '') + '"';
    });
    console.log(textoLimpo); // "Jornal do Brasil". Paris é considerada a "Cidade Luz" Gosto das cores "azul" e "preto".
    

    Demo: link

    In the background I divide the process into two parts. First isolating pieces starting and ending in " (quotation marks) and then cleaning one by one with r.replace(/^ | $/g, '') .

    The first part /"([^"]*)"/g takes everything that is between two quotes, that is, using [^"]* I look for everything that does not have quotation marks, because I close the regex with " .

    The second part uses the string start and end flag (resp: ^ and $ ) and using the | switch in the middle.

        
    26.08.2014 / 23:13
    1

    This expression will get anything inside quotation marks (including quotation marks). Ex: "This is not done"

    "(.*?)(\w+)\b"
    
        
    22.07.2015 / 20:28