Compare contents of a Column

5

I'm a beginner in R and needed help comparing column contents.

First I ordered my table based on a specific column. For this I used the following function:

 x = fread ("x.txt",sep=";")
 x_ordenado = x[order(x$V3),]

I'm working with files that have a certain 5 million lines, but I need to reduce this number. One way would be to delete data that is the same as a list of 10450 items. That is, in these 5 million rows I have a column with values equal and different from this list.

Any idea what I can do?

Thank you

    
asked by anonymous 20.05.2015 / 22:26

2 answers

1

You can do this in more ways than one in R. The simplest way would be to use %in% to check which of your values are not in the list of values you want to remove. For example:

> todos <- 1:10 #Seus dados, números de 1 a 10
> excluir <- c(2,3,5,7) #Valores que serão removidos
> todos[!todos %in% excluir] #Faz um subset dos valores não-contidos em excluir
[1]  1  4  6  8  9 10

This approach does not seem to be heavy even for this amount of data, but another alternative would be to use filter of dplyr , which would look like this:

> library(dplyr)
> df <- data.frame(todos) #Transformando em dataframe
> df %>% filter(! todos %in% excluir)
  todos
1     1
2     4
3     6
4     8
5     9
6    10

If you are going to nest other commands, dplyr may be a good alternative, otherwise there is no need to load the package just for that.

This would remove your unwanted values, but I do not think it would result in an improvement in data manipulation since you would remove only 0.2% of the rows. It may be possible to improve the code at other points to improve the steps that are really slow, rather than reducing the size of the data.

    
20.05.2015 / 22:48
1

Creating a sample data.frame:

dados <- data.frame(x = rnorm(30), y = c("a","b","c"))

To delete rows you will make a logical operation of sets in which you will select elements that are not in the set.

Let's create the vector that has the categories of y you want to remove:

excluir <- c("a", "b")

Now we can only select rows where y is not in vector excluir ( ! is to deny):

dados[!dados$y %in% excluir, ]
           x y
3   0.1003638 c
6   1.4888718 c
9   0.3561347 c
12 -0.4532080 c
15  0.3552320 c
18  0.6220573 c
21 -1.0136110 c
24 -0.4445456 c
27 -0.6974983 c
30  1.0516000 c

As you're saying your base might be large, in addition to the% w / w that Molx mentioned, another interesting package is dplyr . With data.table would look like this:

library(data.table)
dados <- data.table(dados)
dados[! y %in% excluir,]
             x y
 1:  0.1003638 c
 2:  1.4888718 c
 3:  0.3561347 c
 4: -0.4532080 c
 5:  0.3552320 c
 6:  0.6220573 c
 7: -1.0136110 c
 8: -0.4445456 c
 9: -0.6974983 c
10:  1.0516000 c
    
22.05.2015 / 16:01