I have this date frame with 275 variables and would like to remove variables that are not contributing significantly (that have non-zero value less than 10 times). Can anyone help me?
I have this date frame with 275 variables and would like to remove variables that are not contributing significantly (that have non-zero value less than 10 times). Can anyone help me?
One possible way to do this is to use the select_if
function of the dplyr
package.
First define a function that counts the number of zeros:
contar_zeros <- function(x){
sum(x == 0)
}
Now consider this data.frame
df <- data_frame(
x = 0,
y = 1:10,
z = c(rep(0,5), 6:10)
)
df
# A tibble: 10 × 3
x y z
<dbl> <int> <dbl>
1 0 1 0
2 0 2 0
3 0 3 0
4 0 4 0
5 0 5 0
6 0 6 6
7 0 7 7
8 0 8 8
9 0 9 9
10 0 10 10
Using select_if
:
df_sem_colunas <- select_if(df, function(col) contar_zeros(col) < 10)
df_sem_colunas
# A tibble: 10 × 2
y z
<int> <dbl>
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10