Hello,
I'm working with a large data frame - 1000 variables and 60,000 rows - and I need to calculate the percentage of NA and whitespace for each of the variables separately.
What is the best way to do this in R?
Hello,
I'm working with a large data frame - 1000 variables and 60,000 rows - and I need to calculate the percentage of NA and whitespace for each of the variables separately.
What is the best way to do this in R?
To count NA
per columns you can use the colSums()
function:
# total de linhas
n = nrow(df)
# porcentagem de NA por coluna
round(colSums(is.na(df))*100/n, 2)
Or you can also use the apply()
function:
# função para contar NA's
sum_NA <- function(dados){
sum(is.na(dados))
}
# total de linhas
n = nrow(df)
# aplicando a função em cada coluna
round(apply(df, 2, sum_NA)*100/n, 2)
Well, come on, one of the ways to do that is to create a loop and take column by column of your data frame.
I created a data frame to exemplify
df <- data.frame(A=c(NA,2,'',1),B=c('',4,4,2),C=c(5,'','',''),D=c(7,7,5,4),E=c('','',NA,NA),F=c(9,9,0,6))
Notice that some of them have blank and NA values ...
for (i in 1:ncol(df)){
print(sum(is.na(df[,c(i)] ) | df[,c(i)] == "" )/length(df[,c(i)]) * 100)
}
This is a loop that walks in each column and calculates the percentage you need based on my data frame this for
will print the following results:
[1] 50
[1] 25
[1] 75
[1] 0
[1] 100
[1] 0
Do you want something simpler and maybe faster? try:
print(colMeans(is.na(df) | df == "")*100)
This gives the following output:
A B C D E F
50 25 75 0 100 0
Look at is.na
is a function of R
that finds all NA's
made ou(|)
to find all empty ==""
, I think this last option is faster because it only uses functions compiled natively from R