Remove stopwords listed in a txt file from another txt file

1

Good evening guys, I need help. I am doing preprocessing of text and for this I need to remove from a book in .txt format all stopwords found in a text file also "stopwords_br.txt". I found a program I think that's a bit like what I'm looking for. However this is in C ++ and I do not understand the commands.

Help me if possible. Thank you.

string line, deleteline;
ifstream stopword;
stopword.open("example.txt");
if (stopword.is_open())
{
    while ( getline (stopword,line) )
    {
        cout << line << endl;
    }
    stopword.close();
}    
else cout << "Unable to open file";

ofstream temp;
temp.open("temp.txt");

cout << "Please input the stop-word you want to delete..\n ";
cin >> deleteline;

while (getline(stopword,line))
{
    if (line != deleteline)
    {
        temp << line << endl;
    }
}
temp.close();
stopword.close();
remove("example.txt");
rename("temp.txt","example.txt");
cout <<endl<<endl<<endl;
system("pause");
return 0;
    
asked by anonymous 29.12.2016 / 00:11

2 answers

1

How is the file format "stopwords_br.txt"?

The code below, based on what you went through, reads the file information and removes the word. Saves the information in a new file and removes the previous one.

    int main()
{
string line, stopword; ifstream text_file; text_file.open("c:\temp\exemplo.txt"); if(text_file.is_open()){ while(getline(text_file, line)){ cout << line << endl; } text_file.close(); }else cout << "Unable to open file"; cout << "\nPlease input the stop-word you want do delete." << endl; cin >> stopword; text_file.open("c:\temp\exemplo.txt"); ofstream temp; temp.open("c:\temp\temp.txt"); if(text_file.is_open()){ while(getline(text_file, line)){ int achou = 1; while(achou > 0){ int pos = line.find(stopword); if(pos >= 0){ line.erase(pos, stopword.length()); }else{ achou = pos; } } temp << line << endl; } } temp.close(); text_file.close(); remove("c:\temp\exemplo.txt"); rename("c:\temp\temp.txt", "c:\temp\exemplo.txt"); cout << endl << endl<< endl; system("pause"); return 0;

    

05.01.2017 / 16:24
0

If we are in Linux environment, and if we can use Sed and Perl proposed ...

sed -rf <(perl -00nE 'say "s/\<(",join("|",split),")\>//g"' stopw.txt) l.txt

Example:

$ cat stopwords 
a
de
que
para
em
é

$ cat livro 
a minha tia de Braga é que em breve me vem visitar.

$ sed -rf <(perl -00nE 'say "s/\<(",join("|",split),")\>//g"' stopwords) livro 
 minha tia  Braga    breve me vem visitar.

where:

  • perl -00nE 'say "s/\<(",join("|",split),")\>//g"' stopwords , gives s/\<(a|de|que|para|em|é)\>//g , ie calculates a sed substitution,
  • which is then applied to the book ( sed -rf prog livro ).
05.01.2017 / 17:29