Regex in Python to find several possible names

-1

I need to find the judge's name on a labor file, but first I need to know if he is a Judge, Rapporteur, Rapporteur, or Judge.

I'm using the following Regex:

f_regex = re.compile(r'ju(iz|íza) relato(r|a) | ju(iz|íza) | relato(r|a) | desembargado(r|ra)')

But it is not working.

EDIT:

Problem solved. The problem was not in the regex but in a function of mine. Sorry for the inconvenience, I did not know what was happening either. I thank all who have given up their time to help me, indeed.

    
asked by anonymous 20.12.2017 / 18:56

4 answers

0

Hello, I created a version below with minor changes to your regex:

import re

f_regex = re.compile(r'^(\b|.+\s)ju(iz|íza) |^(\b|.+\s)relato(r|a) |^(\b|.+\s)desembargado(r|ra) ', re.IGNORECASE)


success = ["juíza Fulana", "Juiz Fulano", "desembargador Fulano", "Desembargadora Fulana", "Sr. Juiz de Tal"]
fail = ["Fulana de Tal", "Fulano de Tal", "Fulano de Tal", "Juízane Fulana de Tal", "Juizo de Tal", "Dajuiz de Tal"]

print("\nDeve encontrar:")
for string in success:
    result = f_regex.match(string)
    print(string,'- encontrou?',result!=None)

print("\nNao deve encontrar:")
for string in fail:
    result = f_regex.match(string)
    print(string,'- encontrou?',result!=None)

I've created some strings for testing too. Hope it helps.

    
20.12.2017 / 19:54
0

Following the text posted in the question comment: link

You can capture this information with regex \b(.+)(?:\n.*)(?:relatora?|desembargadora?|ju[íi]za?)

Explanation:

  

\ b (. +) = > Here we will capture a pattern at the beginning. As we define . + , it will grab all content until the line break. (Item below)

     

(?: \ n. *) = > Here we tell the algorithm to capture all the code in the next line.

     

(?: rapporteur? | debarger? | ju [ii] za?) = > Here we filter a few words. We've added a ? to say that the word before it is optional.

     

?: = > This option we use to prevent this data from being captured, we want this group to be only validated.

Regex running

    
22.12.2017 / 18:56
0

Well, a lot of the problems were because of the spaces.

Folder with some txts: link

The following function takes a good part (judge, rapporteur, judge), but is not getting judge rapporteur. Maybe the problem is in masculino|feminino :

def find_juiz(file):
file_lines = list(reversed(line_tokenize(file)))[:10]
file_chunked = str(file_lines)
name_juiz = ''
search = re.search(r'ju(iz|íza)\s*relato(r|ra)|ju(iz|íza)|relato(r|ra)|desembargado(r|ra)',file_chunked)
if search is not None:
    for i,line in enumerate(file_lines):
        if line.strip() in search.group():
            # while file_lines[i+1] is not None:
            #     j=i
            #     name_juiz += file_lines[j+1]
            #     j+=1
            # return name_juiz
            return i,line.strip()
else:
    return

ps: this line_tokenize comes from the package nltk (Natural Language Tool Kit), which is a package to work with NLP (Natural Language Processing) in Python. It receives a text and separates it into a list of lines, where each position is a line. Since it is a pattern that the names of the judges are in the end I reversed this list with reversed and got the last ones (which are now the first ones) 10 lines ( list(reversed(line_tokenize(file)))[:10] )

    
22.12.2017 / 18:19
0

First problem:

Your regex will not work with .match , it requires you to completely the string with its regex.

Second problem:

Another thing, your .txt file may be in UTF-8 and therefore it may not recognize the accents, so if you are using urllib (perhaps read the files remotely) in read() (of urllib ) of handler add .decode('utf-8')

If your document is in ASCII or windows-1252 or iso-8859-1 in open() add the parameter encoding :

See the examples at the end of the answer

Third problem:

Your regex is looking for anything that contains spaces before and after, remember phrases can end in punctuations like . , ! , ? , etc and can also be separated by , , ; , : , or even to be insulated with quotation marks "

Your regex should look something like:

r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'
  • The \s at the beginning indicates that it can contain space, line break or tab
  • % wp% indicates that there may be end-of-word punctuation, and% wp% indicates that it can be spaces, line breaks, or tabs at the end of the word.

Example if downloading from a URL

If you are reading from the URL do so:

# -*- coding: utf-8 -*-

import re             # importa modulo
import urllib.request # importa modulo

url = "http://m.uploadedit.com/bbtc/1513873742547.txt"
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'

with urllib.request.urlopen(url) as f:
    data = f.read().decode('utf-8')

    p = re.compile(parttern)
    resultado = p.search(data)

    print(resultado)

If the remote file is in windows-1252 or iso-8859-1, use the following:

data = f.read().decode('latin1')

Example if you are reading a file on the machine

If the file [!?",;:.\s] is in \s use .txt , if it is windows-1252 or iso-8859-1 use utf-8

import re

arquivo = '1513873742547.txt'
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'

with open(arquivo, encoding='utf-8') as f:
    data = f.read()

    p = re.compile(parttern)
    resultado = p.search(data)

    print(resultado)
    
22.12.2017 / 21:00