I have 27 txt files. Each file represents a state of Brazil. In fact, every file is dataframe about job market. The 27 files can add up to 20 gigs. The first thing I did to reduce this was to save each file in Rda format. With this, for example, 20 gigabytes of memory can be reduced to approximately 3 gigabytes. This is very good, but the big problem is that I often do not need to read all the dataframe variables (approximately 40 variables in total). For example, in the case of txt, I can use the fread function to read only 3 variables:
fread("data.txt", select = c("var1","var2","var3") )
Unfortunately, I did not find a version for the rda case. I then decided to create a function that allows me to read only a few columns. Take an example of a 27: RJ.txt file. The idea is to break this dataframe by columns, save each column in Rda format and save everything to a folder. I then created a function to do this:
df <- fread ( "RJ.txt") # Leio o arquivo original
arquivo_pasta<- "C:/Meu diretorio/pastaRJ" # Esta é a minha pasta onde vou guardar todas as variáveis.
# Esta é a minha função para salvar
save2<- function(df , arquivo_pasta )
{
dfl <- as.list(df) # nossa matrix agora é uma lista
remove(df)
setwd(arquivo_pasta)
for( i in 1:length(dfl))
{
v <- dfl[[i]]
save( v , file = paste0( names(dfl)[i], ".Rda" ) ) #salvamos
}
}
In this way, I have a folder with 40 columns of RJ.txt, each in RDF format. Now I create a function to read just a few columns
read2 <- function(arquivo_pasta , colunas)
{
setwd(arquivo_pasta)
# Vamos criar uma matriz, com uma variável auxiliar para poder selecionar apenas as variáveis que queremos
(path<- list.files(path = arquivo_pasta, all.files = T , full.names = T ))
path<- as.data.frame(path)
# Criamos a variável auxiliar com apenas o nome da variável
path$aux<- gsub(arquivo_pasta, "" , path$path)
path$aux<- gsub("/", "" , path$aux)
path$aux<- gsub(".Rda", "" , path$aux)
# Finalmente, selecionamos as colunas
path <- subset(path , aux %in% colunas )
# Criamos uma variável auxiliar para poder iniciar o empilhamento
df_ret <- 1
for(i in 1:nrow(path))
{
load(as.character(path$path[i]))
dfaux<- data.table(v)
names(dfaux) <- as.character(path$aux[i])
df_ret<- cbind(df_ret, dfaux)
}
# Excluímos a variável auxiliar
df_ret<- df_ret[,df_ret:=NULL]
return(df_ret)
}
As you can imagine, I'm doing this because I want to get rid of all txt files. The problem is that I want to do this a little more efficiently and faster. I wonder if you have any idea how best this, especially in a matter of time.