I always have a lot of trouble with replace and sub. I know how they work, but it never works for me. I have a list of words and I am trying to replace these words in a text:
Text:
Brazil, officially the Federative Republic of Brazil is the largest country in South America and the Latin American region, being the fifth largest in the world in territorial area (equivalent to 47% of South American territory) and sixth in population ( with more than 200 million inhabitants). It is the only country in America where most of the Portuguese language is spoken and the largest Lusophone country on the planet, as well as being one of the most multicultural and ethnically diverse nations due to the strong immigration from various places in the world. Its current Constitution, formulated in 1988, defines Brazil as a presidential federative republic, formed by the union of the Federal District, the 26 states and 5 570 municipalities.
List:
is
o
da
and
do
in
na
if
de
Script:
import re
import csv
import itertools
with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
fichier = file.read().split('\n')
stopwords = csv.reader(stop)
for palavras in fichier:
palavras = palavras.lower()
for word in stopwords:
merged_stopwords = list(itertools.chain(*stopwords))
stopwords_regex = re.compile('|'.join(map(re.escape, merged_stopwords)))
replace_stopwords = stopwords_regex.sub('', palavras)
print(replace_stopwords)
The problem is that my script starts to replace in vowels within words:
output:
brazil, ficialmnt rpública frativa brazil is mair country america south rgiã america latin, n quint mair mun m trritrial (quivalnt to 47% trritóri sul-amrican) xt m ppulaçã (cm plus 200 million habitants). It is the only country in America speaking majritariamnt the prtugusa language mair country lusófn plant, in addition to a more multicultural sibling tnicamnt divrsas, m crrência frt inmigraçã riun several lcais mun. its current institution, formula m 1988, fin brasil in a prsincialista fp prpublic republic, frma pla uniã distrit fral, s 26 stas s 5 570 municipalities.
EDITED
Solution found thanks to Isac and RickADT
Script:
import re
import csv
import itertools
with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
fichier = file.read().split('\n')
stopwords = csv.reader(stop)
for palavras in fichier:
palavras = palavras.lower()
for word in stopwords:
merged_stopwords = list(itertools.chain(*stopwords))
# a soluçao esta aqui: para que cada palavra da variavel merged_stopwords seja utilizada, é preciso urilizar o word boundary
stopwords_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, merged_stopwords)))
replace_stopwords = stopwords_regex.sub('', palavras)
print(replace_stopwords)