How to separate a string from a certain row in a data.frame and at the same time create more rows?

7

I have a data.frame with a column with strings of type "123-235-203".

For example, the line:

  

string column1 column2
  123-235-203 x and

I want to separate this string, so that the line containing it will multiply preserving the values of the other columns:

strsplit(string, "-")

Turning the line into 3 lines.

  

string column1 column2
  123 x and
  235 x and
  203 x and

Is there a way to do this using dplyr or tidyr?

    
asked by anonymous 30.04.2015 / 02:00

3 answers

5

Hadley himself offers us the answer in the vignette of the tidyr.

library(tidyr)
library(dplyr)
df <- data.frame(
x = 1:3,
y = c("a", "d,e,f", "g,h"),
stringsAsFactors = FALSE
)
df %>%
transform(y = strsplit(y, ",")) %>%
unnest(y)
    
30.04.2015 / 03:44
4

Create a test base:

b <- data_frame(x = 1:10,y = 1:10, string = rep("123-235-203", 10))

With the following code you can get what you wanted:

b %>% 
  separate(string, into = c('s1', 's2', 's3'), sep = '-') %>%
  gather(string, valor, -x, -y) %>%
  select(-string) %>%
  arrange(x)
  • The function separate of tidyr transforms your string into three columns (s1, s2 and s3).
  • The function gather of tidyr multiplies the lines.
  • The function select of dplyr removes the new string column that is no longer needed.
  • arrange is just to be easier to understand.

The results are below:

# Source: local data frame [30 x 3]
# 
#    x y valor
# 1  1 1   123
# 2  1 1   235
# 3  1 1   203
# 4  2 2   123
# 5  2 2   235
# 6  2 2   203
# 7  3 3   123
# 8  3 3   235
# 9  3 3   203
# 10 4 4   123
# .. . .   ...

In case the strings can have variable sizes, but there is a maximum of "-" 's you can do as follows:

b <- data_frame(x = 1:10,y = 1:10, string = rep(c("123-203", "123-203-555"), length.out = 10))

b %>% 
  separate(string, sep = '-', into = c("s1", "s2", "s3"), extra = "merge") %>%
  gather(string, valor, -x, -y, na.rm = T) %>%
  select(-string) %>%
  arrange(x) 
  • I added the argument extra = "merge" to separate so that it does not return an error.
  • I added the argument na.rm = T so it does not create lines with NA .
30.04.2015 / 03:17
3

Here is a solution using the data.table package

> library(data.table)
> dTbl <- data.table(x = 1:3, y = c("a", "d,e,f", "g,h"))
> dTbl[, .(y=unlist(strsplit(y, ','))), by=x]
   x y
1: 1 a
2: 2 d
3: 2 e
4: 2 f
5: 3 g
6: 3 h
>
    
01.05.2015 / 03:38