Separate content from one column into other columns

8

I have a table-shaped file. I need to separate the contents of a specific column into other columns, the data is separated by ; .

Well I have managed to do this, but the problem is that the content will result for each row in a different number of columns. And by this code it replicates to the content until it reaches the number of columns. I wanted him to exhibit NA .

From my code I get this:

  v1      v2      v3    p/sep  sep sep  sep sep  
dados1  dados2  dados3  a;b;c   a   b   c   a  
dados1  dados2  dados3  a;c     a   c   a   c  
dados1  dados2  dados3  a       a   a   a   a  
dados1  dados2  dados3  a;b     a   b   a   b  
dados1  dados2  dados3  a;b;c;d a   b   c   d  
dados1  dados2  dados3  a;b     a   b   a   b  

But the result I hope to have is:

  v1      v2      v3 separar  sep sep   sep sep  
dados1  dados2  dados3  a;b;c   a   b   c   NA  
dados1  dados2  dados3  a;c     a   c   a   c  
dados1  dados2  dados3  a       a   NA  NA  NA  
dados1  dados2  dados3  a;b     a   b   NA  NA  
dados1  dados2  dados3  a;b;c;d a   b   c   d  
dados1  dados2  dados3  a;b     a   b   NA  NA  

library(reshape)        
file_split = data.frame(file,colsplit(file$separar,split=";",names="buffer",))
    
asked by anonymous 23.05.2015 / 16:36

2 answers

3

The biggest difficulty is to provide names for the new columns. I'm going to adopt the same variable file and the same schema for the column names that Molx used.

If you know in advance the number of columns contained in separar . (4, in this case)

colunas <- paste0("sep", 1:4)

The number of columns can be obtained using the str_count function of the stringr package.

library(stringr);
colunas <- paste0("sep", 1:(max(str_count(file$separar, ';')) + 1));

It is now possible to split the separar column using the separate function of the tidry package.

library(tidyr);
separate(file, separar, into = colunas, sep = ';', remove = FALSE,
         extra = "merge");
    
24.05.2015 / 17:42
4

You've been very close to the solution. If instead of reshape use reshape2 , using the same function (changing the argument split to pattern ), you get close to what you would like:

> file <- read.table(text="  v1      v2      v3    separar  sep sep  sep sep  
dados1  dados2  dados3  a;b;c   a   b   c   a  
dados1  dados2  dados3  a;c     a   c   a   c  
dados1  dados2  dados3  a       a   a   a   a  
dados1  dados2  dados3  a;b     a   b   a   b  
dados1  dados2  dados3  a;b;c;d a   b   c   d  
dados1  dados2  dados3  a;b     a   b   a   b", header=T, as.is=T)[,1:4]
> library(reshape2)        
> file_split = data.frame(file, colsplit(file$separar, pattern=";", names=paste0("sep", 1:4)))

Resulting

> file_split
      v1     v2     v3 separar sep1 sep2 sep3 sep4
1 dados1 dados2 dados3   a;b;c    a    b    c     
2 dados1 dados2 dados3     a;c    a    c          
3 dados1 dados2 dados3       a    a               
4 dados1 dados2 dados3     a;b    a    b          
5 dados1 dados2 dados3 a;b;c;d    a    b    c    d
6 dados1 dados2 dados3     a;b    a    b          

This is a standard definition of the function. In both packages, it is not possible to control behavior for different number of columns, but reshape uses recycle while reshape2 is complete with empty strings. To replace empty spaces with NA, you can do, for example:

> file_split[file_split == ""] <- NA
> file_split
      v1     v2     v3 separar sep1 sep2 sep3 sep4
1 dados1 dados2 dados3   a;b;c    a    b    c <NA>
2 dados1 dados2 dados3     a;c    a    c <NA> <NA>
3 dados1 dados2 dados3       a    a <NA> <NA> <NA>
4 dados1 dados2 dados3     a;b    a    b <NA> <NA>
5 dados1 dados2 dados3 a;b;c;d    a    b    c    d
6 dados1 dados2 dados3     a;b    a    b <NA> <NA>
It is only important to be careful here because this substitution changes all% values of the table, perhaps it is safer to use the column names / indexes so as not to risk changing other data that should remain empty and not NA (if any).

Before finding this difference of "" I made an alternative using base and following this tip . It is not so succinct, but since it is ready, it follows below:

> file.split <- strsplit(file$separar, ";")
> n.obs <- sapply(file.split, length)
> seq.max <- seq_len(max(n.obs))
> resultado <- cbind(file, t(sapply(file.split, "[", i = seq.max)))
> resultado
      v1     v2     v3 separar 1    2    3    4
1 dados1 dados2 dados3   a;b;c a    b    c <NA>
2 dados1 dados2 dados3     a;c a    c <NA> <NA>
3 dados1 dados2 dados3       a a <NA> <NA> <NA>
4 dados1 dados2 dados3     a;b a    b <NA> <NA>
5 dados1 dados2 dados3 a;b;c;d a    b    c    d
6 dados1 dados2 dados3     a;b a    b <NA> <NA>
    
24.05.2015 / 05:49