Error "invalid input '..' in utf8towcs" with "read.csv"

6

I have a .csv database that gathers posts from both Facebook and twitter. For reading the bank in R, the code I have used is

 bancodedados <- read.csv("nomedobanco.csv", sep=";", encoding="UTF-8")

The code loads the database almost to the end, only an error breaks the reading:

  

invalid input 'RT @ jmlara02: @LizCorreaa Comrade define multicenter comrade. ðŸ '‰ @ 90Javier @NicolasMaduro' in 'utf8towcs'

Searching the internet, I saw that the problem is somewhat recurrent. It is caused by non-character recognition provided in my code (UTF-8), which in this case is "ðŸ '‰".

Some solution proposals seen on the internet:

  • Manually remove characters from the original base._ In case, I I discarded this hypothesis because the database is very large and the not so great computer RAM.
  • Use the tryCatch () function, the R handling error, ignore this and continue reading. I thought this was the best chance, only the use of the code is quite unfriendly. I tried the debug package CRAN ... I also did not find it much better than the default.

  • Load, via the "tm" package from CRAN to VCorpus. I actually managed to load the database and data in this way, however it did not come in the dataFrame format, ie ... it was the, pure csv there.

So the question that remains is:

Would it be the 2nd best solution? If yes, how to implement tryCatch together with read.csv to ignore the error and finish reading the database?

If someone has an "Error handling" manual in Portuguese it can also help rs.

Some problem links:

link

link

link

    
asked by anonymous 28.01.2015 / 14:12

1 answer

-1

let me 'upar' the topic with a possible solution.

Try the following:

install.packages('stringr')
txt.tmp <- str_replace_all(conteudo_do_tweet,"[^[:graph:]]", " ") 

The above call removes existing graphic content in the tweet.

    
23.03.2016 / 20:58