Filter 5GB CSV file in R

6

I'm trying all ways to tinker with a 5GB file for my monograph.

The code I'm trying to get into the R is as follows:

> write.csv(subset(read.csv("enem.csv", header=TRUE), UF_ESC=="SP"), "filtro.csv", row.nomes=FALSE)

I've tested with a template file it's all worked fine, but not in the original database. I guess it's because of the size, so think about it and nothing.

If someone has another idea to help you, it will be a huge help.

    
asked by anonymous 17.01.2016 / 01:53

2 answers

6

Here are some tips to solve your problem:

1 - Read the file using another function:

> microbenchmark(
+   base = read.csv(file = "df-write-csv.csv", header = T),
+   readr = readr::read_csv("df-write-csv.csv"),
+   data.table = data.table::fread("df-write-csv.csv"),
+   rio = rio::import("df-write-csv.csv", format = "csv")
+ )
Unit: microseconds
       expr      min        lq      mean    median        uq      max neval
       base 1836.230 1912.1815 2253.6071 1980.3995 2282.1675 4148.787   100
      readr  823.960  881.3625 1072.4790  921.6605 1120.2365 3538.359   100
 data.table  327.759  364.4810  442.5933  402.3295  458.7895  920.436   100
        rio  312.317  351.2260  444.1087  382.9325  439.7960 2938.490   100

See that reading the file with the fread function of the data.table or the import function of the rio is 4x faster than with the native function of R. Make sure you can actually read.

2 - Verify that you actually managed to filter your database. Save the subset result to an auxiliary object. If the problem is this, try filtering using functions from other packages like dplyr or data.table .

When they are long operations, data.table can be much faster.

> df <- data.frame(x = 1:100000, y = 1:100000, l = sample(letters, size = 100, replace = T))
> microbenchmark(
+   base = subset(df, l == "a"),
+   dplyr = dplyr::filter(df, l == "a"),
+   data.table = data.table(df)[l == "a",]
+ )
Unit: milliseconds
       expr       min        lq      mean    median        uq      max neval
       base 10.329514 12.467143 14.962479 13.976907 17.171858  24.3761   100
      dplyr  7.331626  8.624356 10.063947  8.853807 11.140871  16.8939   100
 data.table  2.986519  4.580536  6.774548  4.824227  5.957255 119.9709   100

3 - Use the function write_csv of package readr it is more or less 2x faster than the function write.csv native of R.

microbenchmark(
  base = write.csv(df, file = "df-write-csv.csv", row.names = F),
  rio = rio::export(df, file = "df-rio.csv", format = "csv"),
  readr = readr::write_csv(df, path = "df-readr.csv")
)

Unit: microseconds
  expr     min       lq     mean    median       uq      max neval
  base 713.564 1097.534 2025.377 1467.4980 2996.136 4168.352   100
   rio 718.141 1156.998 2243.143 2011.5310 3106.479 7368.046   100
 readr 366.306  594.629 1265.297  734.0445 1793.405 5852.142   100

Anyway, if you could read the 5GB file, it is very likely that you can also write it, since it is already in the RAM of your computer.

    
17.01.2016 / 13:53
1

Andrew, as you will only filter your database and after that it will get much smaller, you can read it in chunks. For this you can do the following:

Just to test I created the following "big" file:

library(readr)
library(dplyr)
x <- data.frame(x = runif(3e6), y = 1:3e6)
write_csv(x, path = "test.csv")

The following code snippet reads the database in small parts ( tam_chunk ), filters those parts and then saves it to a file named filtrado.csv .

See if it works that way. It should be time consuming, but at least you can get past the memory problem:

# criando a conexão com o arquivo grande
arq_grande <- file("test.csv", "r")
tam_chunk <- 1e5 # tamanho do chunk
# lendo as 100 primeiras linhas do banco de dados e criando
# o arquivo filtrado
df1 <- read.csv(arq_grande, nrows = 100, header = T)
df_filtrado <- df1 %>% filter(x <= 0.5)
write.table(df_filtrado, "filtrado.csv", row.names = F, sep = ",", dec = ".")
# iniciando o loop de leitura em chunks
n_row <- 1
repeat {
  ## process data...
  ## ...then read the next chunk
  df <- read.csv(arq_grande, header=FALSE, col.names = names(df1), nrows = tam_chunk)
  cat("Read", nrow(df), "rows\n")
  if (nrow(df) == 0)                # done yet?
    break
  df_filtrado <- df %>% filter(x <= 0.5) # filtrar o data.frame
  # salvar o data.frame
  write.table(df_filtrado, "filtrado.csv", append = T, 
              col.names = F, row.names = F, sep = ",", dec = ".")
}
close(arq_grande)

This response was very inspired in this response from Stack Overflow in English .

    
19.01.2016 / 10:37