Replace list of words in a text

1

I always have a lot of trouble with replace and sub. I know how they work, but it never works for me. I have a list of words and I am trying to replace these words in a text:

Text:

  

Brazil, officially the Federative Republic of Brazil is the largest country in South America and the Latin American region, being the fifth largest in the world in territorial area (equivalent to 47% of South American territory) and sixth in population ( with more than 200 million inhabitants). It is the only country in America where most of the Portuguese language is spoken and the largest Lusophone country on the planet, as well as being one of the most multicultural and ethnically diverse nations due to the strong immigration from various places in the world. Its current Constitution, formulated in 1988, defines Brazil as a presidential federative republic, formed by the union of the Federal District, the 26 states and 5 570 municipalities.

List:

  

is

     

o

     

da

     

and

     

do

     

in

     

na

     

if

     

de

Script:

import re
import csv
import itertools

with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
    fichier = file.read().split('\n')
    stopwords = csv.reader(stop)

    for palavras in fichier:
        palavras = palavras.lower()

        for word in stopwords:
            merged_stopwords = list(itertools.chain(*stopwords))
            stopwords_regex = re.compile('|'.join(map(re.escape, merged_stopwords)))
            replace_stopwords = stopwords_regex.sub('', palavras)

            print(replace_stopwords)

The problem is that my script starts to replace in vowels within words:

output:

  

brazil, ficialmnt rpública frativa brazil is mair country america south rgiã america latin, n quint mair mun m trritrial (quivalnt to 47% trritóri sul-amrican) xt m ppulaçã (cm plus 200 million habitants). It is the only country in America speaking majritariamnt the prtugusa language mair country lusófn plant, in addition to a more multicultural sibling tnicamnt divrsas, m crrência frt inmigraçã riun several lcais mun. its current institution, formula m 1988, fin brasil in a prsincialista fp prpublic republic, frma pla uniã distrit fral, s 26 stas s 5 570 municipalities.

EDITED

Solution found thanks to Isac and RickADT

Script:

import re
import csv
import itertools

with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
    fichier = file.read().split('\n')
    stopwords = csv.reader(stop)

    for palavras in fichier:
        palavras = palavras.lower()

        for word in stopwords:
            merged_stopwords = list(itertools.chain(*stopwords))
            # a soluçao esta aqui: para que cada palavra da variavel merged_stopwords seja utilizada, é preciso urilizar o word boundary
            stopwords_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, merged_stopwords)))
            replace_stopwords = stopwords_regex.sub('', palavras)

            print(replace_stopwords)
    
asked by anonymous 17.07.2018 / 14:52

3 answers

3

For the sake of clarity, and even because the solution you put in was not quite as I had suggested, here is my suggestion.

The suggestion was to apply a regex to the whole text, which would replace only whole words using \b of the syntax of regexes to word boundary . This means that it is not necessary to iterate neither the words of the text nor the words to exclude.

Assuming you read the text and words to remove with:

with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
    fichier = file.read()
    csvstop = csv.reader(stop)
    stopwords = list(itertools.chain(*csvstop))

The apply of the regex and remaining code would be just two more lines:

    regex = re.compile(r'\b' + r'\b|\b'.join(stopwords) + r'\b')
    replacedtext = re.sub(regex, '', fichier, re.IGNORECASE)

The regex was constructed using \bPalavra\b and | between each. The flag of re.IGNORECASE causes it to be either uppercase or lowercase avoiding any type of lower() . Inspecting the built-in regex for the given stopwords has the following:

\bé\b|\bo\b|\bda\b|\be\b|\bdo\b|\bem\b|\bna\b|\bse\b|\bde\b

Each of the words are being captured alternatively with | and \b ensures that it only picks up single words, not other words.

It is also worth remembering that taking a whole word in the middle of a sentence can have two spaces in a row. Depending on what you are going to do with the text you may not want these spaces. You can remove them easily with another regex:

replacedtext= re.sub(r'\s{2,}', ' ', replacedtext)

Replace any string of 2 or more spaces with 1 space.

    
17.07.2018 / 19:49
3
import re

txt  = open('texto').read()
lista= open('lista').read() 
sw   = re.findall('\w+',lista)
print(re.sub('\w+', lambda x: '' if x[0].lower() in sw else x[0] ,txt))

Here's a Python3 variant:

  • re.findall('\w+',lista) extracts stopwords.
  • re.sub('\w+', ... , txt) for each word in the text, replace it with
  • lambda x: '' if x[0].lower() in sw else x[0] or
    • by '' if it belongs to sw
    • by itself if it does not belong
17.07.2018 / 18:16
2

I think the simplest thing is for you to break each line into words with the split method and look at whether that word is a stopword.

import csv
import itertools

with open('texto.txt', 'r') as file, open('lista.csv', 'r') as stop:
    lines = file.read().split('\n')
    csvstop = csv.reader(stop)
    stopwords = list(itertools.chain(*csvstop))

    for line in lines:
        palavras = line.lower().split()
        # filtra as palavras q nao sao stopwords
        palavras = [palavra for palavras if palavra not in stopwords]

        print(" ".join(palavras))
    
17.07.2018 / 18:06