In R, a function that reads only a few columns of a dataframe in Rda format

5

I have 27 txt files. Each file represents a state of Brazil. In fact, every file is dataframe about job market. The 27 files can add up to 20 gigs. The first thing I did to reduce this was to save each file in Rda format. With this, for example, 20 gigabytes of memory can be reduced to approximately 3 gigabytes. This is very good, but the big problem is that I often do not need to read all the dataframe variables (approximately 40 variables in total). For example, in the case of txt, I can use the fread function to read only 3 variables:

fread("data.txt", select = c("var1","var2","var3") )

Unfortunately, I did not find a version for the rda case. I then decided to create a function that allows me to read only a few columns. Take an example of a 27: RJ.txt file. The idea is to break this dataframe by columns, save each column in Rda format and save everything to a folder. I then created a function to do this:

df <- fread ( "RJ.txt") # Leio o arquivo original
arquivo_pasta<- "C:/Meu diretorio/pastaRJ" # Esta é a minha pasta onde vou guardar todas as variáveis.

# Esta é a minha função para salvar
save2<- function(df , arquivo_pasta )
{
dfl <- as.list(df) # nossa matrix agora é uma lista
remove(df)
setwd(arquivo_pasta)
for( i in 1:length(dfl))
{
  v <- dfl[[i]]      
  save(  v , file = paste0( names(dfl)[i], ".Rda" )   )  #salvamos
}
}

In this way, I have a folder with 40 columns of RJ.txt, each in RDF format. Now I create a function to read just a few columns

read2 <- function(arquivo_pasta , colunas)
{
  setwd(arquivo_pasta)

  # Vamos criar uma matriz, com uma variável auxiliar para poder selecionar apenas as variáveis que queremos
  (path<- list.files(path = arquivo_pasta, all.files = T ,  full.names = T ))
  path<- as.data.frame(path)

  # Criamos a variável auxiliar com apenas o nome da variável
  path$aux<- gsub(arquivo_pasta, "" , path$path)
  path$aux<- gsub("/", "" , path$aux)
  path$aux<- gsub(".Rda", "" , path$aux)

  # Finalmente, selecionamos as colunas
  path <- subset(path , aux %in% colunas )

  # Criamos uma variável auxiliar para poder iniciar o empilhamento
  df_ret <- 1

  for(i in 1:nrow(path))
  {
    load(as.character(path$path[i]))
    dfaux<- data.table(v)
    names(dfaux) <- as.character(path$aux[i])
    df_ret<- cbind(df_ret, dfaux)
  }

  # Excluímos a variável auxiliar 
  df_ret<- df_ret[,df_ret:=NULL]
  return(df_ret)

}

As you can imagine, I'm doing this because I want to get rid of all txt files. The problem is that I want to do this a little more efficiently and faster. I wonder if you have any idea how best this, especially in a matter of time.

    
asked by anonymous 16.11.2017 / 15:33

1 answer

5

A good solution is to use the fst package. Note that it is not ideal for short term storage since it is still under intense development.

According to the README, it compresses as well as saveRDS , is faster to read and write, and allows you to read only a few columns.

Example:

# Generate a random data frame with 10 million rows and various column types
nrOfRows <- 1e7

x <- data.frame(
  Integers = 1:nrOfRows,  # integer
  Logicals = sample(c(TRUE, FALSE, NA), nrOfRows, replace = TRUE),  # logical
  Text = factor(sample(state.name, nrOfRows, replace = TRUE)),  # text
  Numericals = runif(nrOfRows, 0.0, 100),  # numericals
  stringsAsFactors = FALSE)

# Store it
  write.fst(x, "dataset.fst")

# Retrieve it
  y <- read.fst("dataset.fst")

# Ler só algumas linhas e colunas
 read.fst("dataset.fst", c("Logicals", "Text"), 2000, 4990) # subset rows and columns
    
16.11.2017 / 17:32