Pre-process large text files in R

Question

Pre-process large text files in R

Navigation

#1 by (4 votes)
#2 by (3 votes)
#3 by (3 votes)

10

I am writing a script, which I will make public, to open the RAIS microdata (unidentified, available here ) in R using MonetDB. However, the bank does not accept a vignette (,) as a decimal separator. Each RAF UFano.txt file is quite large (up to 7GB) so the solution can not require modifications that fit into RAM. Two alternatives:

a) import into the database, as if everything were string, and then do within SQL an UPDATE creating new columns for numeric variables and substituting "," for ".".

b) preprocess the file, replacing in the .txt with a comma by point.

The perch is about the alternative "b".

Is there any efficient way to do this replacement? AjDamico indicates a slow form, replacing line by line here .

As an example we can start with the Acre archive of 2012 (AC2012.txt), which I this link

As it is to be packaged as an R command, the solution can not depend on the OS or require the installation of things outside of R.

sql r big-data csv txt

asked by anonymous 09.10.2014 / 22:52

3 answers

What are the main functions to create a minimal example reproducible in R? Add custom timezones in PHP?

score 4 · Answer 1

I made a partial solution (which should not be the most efficient) that reads the file in chunks and saves it in a csv, already with the decimal as a point instead of a comma. Then you just set the size of the chunk that fits in your RAM. It worked with the ACRE test file, in thesis would work with larger files.

chunks <- 40000
acre <- read.csv2("AC2012.txt", fileEncoding="latin1", nrows=chunks, stringsAsFactors=FALSE)
write.table(acre, "teste.csv",sep=",")
oldrows <- chunks
while(nrow(acre)>0){
  acre <- read.csv2("AC2012.txt", fileEncoding="latin1", nrows=chunks, skip=oldrows, 
                               stringsAsFactors=FALSE)
  cat("Lido: linhas", oldrows, "a", oldrows + nrow(acre), "\n")
  oldrows <- oldrows+nrow(acre)
  write.table(acre, "teste.csv", col.names=FALSE, append=TRUE, sep=",")
}

# testando
original <- read.csv2("AC2012.txt", fileEncoding="latin1",stringsAsFactors=FALSE)
novo <- read.csv("teste.csv", header=TRUE,sep=",", stringsAsFactors=FALSE, row.names=NULL)
all.equal(novo[-1], original)
TRUE

I've also tested the solution proposed in Damico's question , with chunks, and is faster, at least in this example :

file.create("acre.txt")
outcon <- file( "acre.txt" , "w" )
incon <- file("AC2012.txt" , "r" )
while( length( one.line <- readLines( incon , 40000 , encoding="latin1") ) > 0 ){
  one.line <- gsub( ',' , '.' , one.line )
  writeLines( one.line , outcon )
}

score 3 · Answer 2

Thanks to @CarlosCinelli and @FlavioBarros for the great answers. I decided for a 3rd alternative to my question above, using the "ff" package to import the data in the R into an "ffdf" and then export it to a traditional CSV:

dat <- read.csv2.ffdf(file="AC2012.txt", strip.white=TRUE,
                      na.strings = c('{ñclass}','{ñ','{ñ class}','{ñc','000-1')))
write.csv.ffdf(dat, "AC2012.csv")

Measured the total elapsed time. For the Acre are 4.65s stop reading and 5.17s to save, on a hard-disk SSD. For all the state files of the RAIS of 2002, whose .txt occupy 11.8GB, were in total 29min to import and 35min to export.

Using the ff package has some advantages:

The ff package decides how to optimize between disk and RAM when importing the data. The user does not think about it and uses a function with usual syntax (read.csv2).
As in read.csv2, the read.csv2.ffdf function assumes that .CSV is European / Latin, using "," as the decimal separator, and ";" as a field separator
The function also deals with other problems of the current version of RAIS files such as some fields with spaces before the data (eg "999") and the strings used to identify missings ("{ñclass}", "{ ñ "," {ñ class} "," {ñc "," 000-1 "). After fixing these issues, the size of the 2002 files decreases to 4.7GB, less than half.

Perhaps a drawback is that the ff package does not support extermely large bases, while the other solutions in this post are scalable without major problems.

score 3 · Answer 3

Very good @CarlosCinelli solution. But a workaround is to use the iterators package. The change_dot () function basically reads a row, changes the ',' by '.' and writes the line to a text file.

library(iterators)

change_dot <- function(file, saida='teste.txt', chunk=1) {
  con1 <- file(file, 'r')
  con2 <- file(saida, open = 'w')
  linha <- 0
  it <- ireadLines(con1, n=chunk)
  out <- tryCatch(expr=write(x = gsub(pattern = ',', replacement = '.', x = nextElem(it)), con2), 
                   error=function(e) e)

  while(!any(class(out) == "error")) {
    linha = linha + 1
    print(paste('Escrita linha ', linha))
    out <- tryCatch(expr=write(x = gsub(pattern = ',', replacement = '.', x = nextElem(it)), con2, append = T), 
                  error=function(e) e)
  }
}

system.time(change_dot(file = 'AC2012.txt', saida = 'saida.csv'))



 user  system elapsed 
  48.65    4.70   53.04

In this file, the AC2012.txt, the procedure took 48 seconds on my machine.

It should be noted here that it is possible to increase the size of the chunk to values greater than 1. For example, increasing to 40000 would achieve the following times using this solution, and the Damico solution:

change_ponto <- function() {
            file.create("acre.txt")
            outcon <- file( "acre.txt" , "w" )
            incon <- file("AC2012.txt" , "r" )
            while( length( one.line <- readLines( incon , 40000 , encoding="latin1") ) > 0 ){
              one.line <- gsub( ',' , '.' , one.line )
              writeLines( one.line , outcon )
            }
}

system.time(change_ponto())

 user  system elapsed 
   6.53    0.82    7.36

system.time(change_dot(file = 'AC2012.txt', saida = 'teste4.csv', chunk = 40000))

 user  system elapsed 
   6.71    3.12    9.92

And now testing if the files are the same:

    teste2 <- read.csv("acre.txt", header=F, sep=";", stringsAsFactors=FALSE, row.names=NULL)
    teste4 <- read.csv("teste4.csv", header=F, sep=";", stringsAsFactors=FALSE, row.names=NULL)
    all.equal(teste2, teste4)

[1] TRUE

I wrote a post about iterators on my blog a while ago: link