How to capture only the first part of a text that fits the regex?

Question

How to capture only the first part of a text that fits the regex?

Navigation

#1 by (3 votes)

4

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;

It is as follows: above is a text and below the regex to capture the text information. Something that should take into account, is what text from where the regex will do the capture, is a semi-structured text and has some repetitions. Below is the regex. To contextualize, it is a regex that captures addresses.

, (established | localized | localized) (in | no | em) ([^ (Municipality | State)] ([0-9A-Za-zçãàáâéêíóôõúÂÃÁÀÉÊÍÓÔÕÚÇ \ Q () < - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I want to capture each of the addresses in the document and put each of the addresses between the <END> and </END> tags. It is considered address, only the part delimited by

>

That is, the remainder is considered "normal text", which should not be captured, but should not be discarded. So, for the given example, make sure it looks like this:

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43 e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05 e Inscrição Estadual nº 110.250.399;

However, as you can see from the text, I only get all the addresses at once. I thought about using regex, because that's how I was capturing other things. But if there's any way you can fix it, fine.

java regex

asked by anonymous 22.01.2017 / 03:06

1 answer

Android Studio - No accent appears Know the name of the object in JSON

score 3 · Accepted Answer

I see that in your text, the different addresses are separated by semi-colons. This makes the task very simple:

import java.util.Arrays;
import java.util.stream.Collectors;

public class Enderecos {

    private static String localizarInicio(String s) {
        String[] loc = {"estabelecida ", "estabelecido ", "localizada ", "localizado "};
        String[] cj = {"em ", "na ", "no "};
        for (String a : loc) {
            for (String b : cj) {
                if (s.contains(a + b)) return s.replace(a + b, a + b + "<END>");
            }
        }
        return "<END>" + s;
    }

    private static String localizarFim(String s) {
        String busca = ", com CNPJ";
        if (s.contains(busca)) return s.replace(busca, "</END>" + busca);
        return s + "</END>";
    }

    public static String formatarListaEnderecos(String malformatado) {
        return Arrays
                .asList(malformatado.split(";"))
                .stream()
                .map(t -> t.replace("<END>", "").replace("</END>", "").trim())
                .filter(t -> !t.isEmpty())
                .map(Enderecos::localizarInicio)
                .map(Enderecos::localizarFim)
                .collect(Collectors.joining());
    }

    public static void main(String[] args) {
        String texto = "<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB, com CNPJ nº 07.526.557/0013-43e Inscrição Estadual nº 16.218.7157; (NR) II - Sergipe, localizada na Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5; (NR) III - Camaçari, localizada na Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399;<END>";
        String formatado = formatarListaEnderecos(texto);
        System.out.println(formatado);
    }
}

The method I did for the purpose of doing what you want is formatarListaEnderecos(String) . This method does the following:

It divides everything into semicolons, generating an array of addresses, which is then converted into a list and a Stream .

Remove the " <END> " and " </END> " tags that already exist for each address, as they will not be correctly applied at startup (they will be replaced later).

Remove spaces at the beginning and end of each address with trim() .

Removes "addresses" that reduce only to empty strings.

Finds where to put " <END> " and places it at each address.

Finds where to put " </END> " and places it at each address.

Put everything into a single string and return the result.

The place where " <END> " is set is determined by the localizarInicio(String) method. It searches for "(estabelecid|localizad)(o|a) (em|na|no) " and places <END> later. If he does not find anything, he puts it in the beginning.

The place where " </END> " is placed is before the text ", com CNPJ" . If he does not find it, put it in the end.

The main(String[]) method is there so you can test this method. When you run it, here is the output:

<END>Av. Dr. Walter Belian, nº 2.230, Distrito Industrial, João Pessoa-PB</END>, com CNPJ nº 07.526.557/0013-43e Inscrição Estadual nº 16.218.7157(NR) II - Sergipe, localizada na <END>Rodovia BR-101, s/nº, km 133, Distrito Industrial, Estância-SE</END>, com CNPJ nº 07.526.577/0012-62 e Inscrição Estadual nº 27.142.202-5(NR) III - Camaçari, localizada na <END>Rua João Úrsulo, nº 1.620, Polo Petroquímico, Camaçari-BA</END>, com CNPJ nº 07.526.557/0015-05e Inscrição Estadual nº 110.250.399

Regarding the use of regex, I think the idea of using them in this is an example of XY problem . That is, I think you're looking at a tool that might not be the best one to solve this problem.