First problem:
Your regex will not work with .match
, it requires you to completely the string with its regex.
Second problem:
Another thing, your .txt
file may be in UTF-8 and therefore it may not recognize the accents, so if you are using urllib
(perhaps read the files remotely) in read()
(of urllib
) of handler add .decode('utf-8')
If your document is in ASCII or windows-1252 or iso-8859-1 in open()
add the parameter encoding
:
See the examples at the end of the answer
Third problem:
Your regex is looking for anything that contains spaces before and after, remember phrases can end in punctuations like .
, !
, ?
, etc and can also be separated by ,
, ;
, :
, or even to be insulated with quotation marks "
Your regex should look something like:
r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'
- The
\s
at the beginning indicates that it can contain space, line break or tab
- % wp% indicates that there may be end-of-word punctuation, and% wp% indicates that it can be spaces, line breaks, or tabs at the end of the word.
Example if downloading from a URL
If you are reading from the URL do so:
# -*- coding: utf-8 -*-
import re # importa modulo
import urllib.request # importa modulo
url = "http://m.uploadedit.com/bbtc/1513873742547.txt"
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'
with urllib.request.urlopen(url) as f:
data = f.read().decode('utf-8')
p = re.compile(parttern)
resultado = p.search(data)
print(resultado)
If the remote file is in windows-1252 or iso-8859-1, use the following:
data = f.read().decode('latin1')
Example if you are reading a file on the machine
If the file [!?",;:.\s]
is in \s
use .txt
, if it is windows-1252 or iso-8859-1 use utf-8
import re
arquivo = '1513873742547.txt'
parttern = r'(\s|^)(ju(iz|íza) relato(r|a)|ju(iz|íza)|relato(r|a)|desembargado(r|ra))[!?",;:.\s]'
with open(arquivo, encoding='utf-8') as f:
data = f.read()
p = re.compile(parttern)
resultado = p.search(data)
print(resultado)