REGEX JAVA Recognize different character group

5

I'm breaking my head at link

I'm trying to make a REGEX code that validates some .txt where it does not contain the JJ or M3 characters at the end of the line.

For example: I have the 3 .txt with the lines below:

.txt 1

  • 4481; 77831853; 4461; 60; CAD; VCP; M3
  • 4647; 86940830; 4847; 35; FRA; VCP; M3
  • .txt 2

  • 3287; 69872804; 3297; 37; ANT; VCP; JJ
  • 3827; 72247849; 3857; 38; DEC; VCP; JJ
  • .txt3

  • 5634; 7082850; 5634; 40; MAR; VCP; PZ
  • 4362; 3882867; 4382; 41; PAU; VCP; PZ
  • I need a REGEX code that does not accept .txt 1 and 2, only .txt 3 because the last two characters of them are different from JJ and M3.

        
    asked by anonymous 25.10.2017 / 21:22

    4 answers

    6

    Whoever accompanies me here in SOpt knows that I do not do much features other than those provided for regular languages . / p>

    The response provided by @nunks.lol uses negative lookbehind , which is not regular in the mathematical sense. But this is certainly a great solution.

    But I can do without lookbehind !

    Expression of words that do not end with JJ

    The fact that it does not contain these two letters at the end makes the question easier. Just see the response response to see the work that denies a subword anywhere.

    In order not to end with JJ , we have 4 alternatives:

  • The line is blank so it matches with ^$
  • The line contains exactly one character, so ^.$
  • The last character is not J, ^.*[^J]$
  • The penultimate character is not J, ^.*[^J].$
  • So, the following expression matches this:

    ^$|^.$|^.*[^J]$|^.*[^J].$
    

    Ugly, is not it? But fortunately it can be simplified:

    ^(.|.*([^J]|[^J].))?$
    
      

    I could have simplified even more [^J]|[^J]. , but then I would lose the format to be used in the next expression

    Expression of words that do not end with M3

    In order not to end with JJ , we have 4 alternatives:

  • The line is blank so it matches with ^$
  • The line contains exactly one character, so ^.$
  • The last character is not 3, '^. * [^ 3] $
  • The penultimate character is not M, ^.*[^M].$
  • I could put the ugly version and then simplified it, but I can also abbreviate it:

    ^(.|.*([^3]|[^M].))?$
    

    Putting it all together

    To put everything together, there are some special cases to consider:

  • Can end in J if the penultimate letter is M
  • Can end in 3 if the penultimate letter is J
  • Moreover, agglutinating the denied lists does the service. These would be the only cases not treated by the previous abstraction.

    ^(.|.*([^J3]|[^JM].|J3|MJ))?$
    
        
    26.10.2017 / 04:40
    5

    Use negative lookbehind notation to ensure that the string does not end in the patterns you set just before the line break, that is, a $ that does not have JJ or M3 before it. So your regular expression is:

    ^.*(?<!JJ|M3)$
    

    Detailing:

    ^          # início da linha
     .*        # qualquer caractere, zero ou mais vezes
       (?<!    # abertura do negative lookbehind
           JJ  # sequencia literal "JJ"
            |  # condicional "ou"
           M3  # sequencia literal "M3"
       )       # fechamento do negative lookbehind
    $          # final da linha
    

    Example on regex101.com: link

    Interesting explanation of lookaround in regular expressions: link

    Complementing : a question to take into account when using lookahead and lookbehind lookaround ) is performance. The use of lookaround tends to use a bit more CPU than the match of regular expressions "traditional". If you want to apply this expression in a massive way, with many calls per second, it may be advantageous to use more "verbosity" methods, such as answer given by @JeffersonQuesado.

        
    26.10.2017 / 02:49
    1

    I think you are reading the file line by line, so using this should resolve. The simplest regex I imagined was (.*(?!(M3|JJ))..|^.)$ .

    • .* : Accepts the first characters of the line
    • (?!(M3|JJ)) : Checks whether the string contains the characters M3 or JJ
    • .. : Ensures that there will be two characters at the end of the line, otherwise M3 and JJ would pass
    • ^. : Allows a line with only one character
    • $ : End of entry, to ensure that the last accepted characters are the last of the line
    25.10.2017 / 21:59
    1

    A differentiated approach will be to look for the negative result:

    ...
    
    String pattern = "/(JJ|M3)/";
    Pattern regexp = Pattern.compile(pattern);
    Matcher m = regexp.matcher(linhaLidaDoArquivo);
    
    // Existe também o método find() para o objeto m
    // o qual você poderá iterar e verificar quantas
    // ocorrências encontrou no padrão fornecido
    if(pattern.matches(m)) {
        System.out.println("O seu arquivo é INVÁLIDO");
    else
        System.out.println("O seu arquivo é VÁLIDO");
    

    In other words: you will know and control what you do not want to happen. The code above looks for the combinations (uppercase) with the pattern "JJ" and "M3" in any string position. If they occur within the read line, then the Pattern and Matcher classes will identify such an occurrence. If found, if will return true and negative treatment can be used. For other cases, you will treat them with the expected and valid scenario.

        
    23.04.2018 / 18:47