How to capture only the first part of a text that fits the regex?

4
<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;

It is as follows: above is a text and below the regex to capture the text information. Something that should take into account, is what text from where the regex will do the capture, is a semi-structured text and has some repetitions. Below is the regex. To contextualize, it is a regex that captures addresses.

  

, (established | localized | localized) (in | no | em) ([^ (Municipality | State)] ([0-9A-Za-zçãàáâéêíóôõúÂÃÁÀÉÊÍÓÔÕÚÇ \ Q () < - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I want to capture each of the addresses in the document and put each of the addresses between the <END> and </END> tags. It is considered address, only the part delimited by

  

>

That is, the remainder is considered "normal text", which should not be captured, but should not be discarded. So, for the given example, make sure it looks like this:

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05 e Inscrição Estadual nº 110.250.399;

However, as you can see from the text, I only get all the addresses at once. I thought about using regex, because that's how I was capturing other things. But if there's any way you can fix it, fine.

    
asked by anonymous 22.01.2017 / 03:06

1 answer

3

I see that in your text, the different addresses are separated by semi-colons. This makes the task very simple:

import java.util.Arrays;
import java.util.stream.Collectors;

public class Enderecos {

    private static String localizarInicio(String s) {
        String[] loc = {"estabelecida ", "estabelecido ", "localizada ", "localizado "};
        String[] cj = {"em ", "na ", "no "};
        for (String a : loc) {
            for (String b : cj) {
                if (s.contains(a + b)) return s.replace(a + b, a + b + "<END>");
            }
        }
        return "<END>" + s;
    }

    private static String localizarFim(String s) {
        String busca = ", com CNPJ";
        if (s.contains(busca)) return s.replace(busca, "</END>" + busca);
        return s + "</END>";
    }

    public static String formatarListaEnderecos(String malformatado) {
        return Arrays
                .asList(malformatado.split(";"))
                .stream()
                .map(t -> t.replace("<END>", "").replace("</END>", "").trim())
                .filter(t -> !t.isEmpty())
                .map(Enderecos::localizarInicio)
                .map(Enderecos::localizarFim)
                .collect(Collectors.joining());
    }

    public static void main(String[] args) {
        String texto = "<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;<END>";
        String formatado = formatarListaEnderecos(texto);
        System.out.println(formatado);
    }
}

The method I did for the purpose of doing what you want is formatarListaEnderecos(String) . This method does the following:

  • It divides everything into semicolons, generating an array of addresses, which is then converted into a list and a Stream .

  • Remove the " <END> " and " </END> " tags that already exist for each address, as they will not be correctly applied at startup (they will be replaced later).

  • Remove spaces at the beginning and end of each address with trim() .

  • Removes "addresses" that reduce only to empty strings.

  • Finds where to put " <END> " and places it at each address.

  • Finds where to put " </END> " and places it at each address.

  • Put everything into a single string and return the result.

  • The place where " <END> " is set is determined by the localizarInicio(String) method. It searches for "(estabelecid|localizad)(o|a) (em|na|no) " and places <END> later. If he does not find anything, he puts it in the beginning.

    The place where " </END> " is placed is before the text ", com CNPJ" . If he does not find it, put it in the end.

    The main(String[]) method is there so you can test this method. When you run it, here is the output:

    <END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43e Inscrição Estadual nº 16.218.7157(NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5(NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399
    

    Regarding the use of regex, I think the idea of using them in this is an example of XY problem . That is, I think you're looking at a tool that might not be the best one to solve this problem.

        
    22.01.2017 / 03:34