Cluster Analysis by Groups

3

I'm trying a cluster analysis for multiple groups within a dataframe, in order to return the characteristics of this analysis (ex the resulting groups) into a database using the function tidy ( broom ).

dput

dataset=structure(list(a = c(28L, 19L, 92L, 35L, 42L, 82L, 91L, 98L, 
58L, 58L, 92L, 61L, 67L, 73L, 4L, 35L, 9L, 17L, 7L, 82L, 24L,   
51L, 45L, 1L, 97L, 97L, 99L, 5L, 67L, 97L, 95L, 77L, 56L, 67L, 
80L, 22L, 87L, 31L, 97L, 15L, 12L, 94L, 18L, 86L, 1L, 99L, 2L, 
88L, 84L, 65L, 59L, 38L, 8L, 46L, 66L, 30L, 32L, 36L, 17L, 35L, 
40L, 16L, 60L, 28L, 47L, 56L, 82L, 88L, 76L, 38L, 88L, 61L, 26L, 
64L, 24L, 48L, 30L, 68L, 88L, 42L, 62L, 12L, 76L, 37L, 25L, 91L, 
18L, 76L, 13L, 24L, 49L, 89L, 35L, 88L, 19L, 24L, 62L, 91L, 99L,  
18L), b = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("group1", 
"group2", "group3", "group4"), class = "factor"), c = c(61L, 
28L, 82L, 38L, 22L, 79L, 7L, 12L, 73L, 78L, 17L, 28L, 30L, 11L, 
99L, 47L, 42L, 51L, 13L, 16L, 35L, 51L, 92L, 41L, 45L, 27L, 17L, 
37L, 27L, 53L, 23L, 50L, 81L, 25L, 93L, 11L, 80L, 35L, 32L, 9L, 
56L, 18L, 17L, 63L, 49L, 11L, 26L, 93L, 45L, 7L, 43L, 90L, 31L, 
80L, 53L, 66L, 62L, 13L, 54L, 7L, 20L, 37L, 79L, 52L, 35L, 8L, 
6L, 46L, 35L, 3L, 18L, 82L, 92L, 80L, 8L, 87L, 89L, 20L, 26L, 
86L, 29L, 55L, 46L, 83L, 66L, 25L, 17L, 68L, 21L, 83L, 26L, 97L, 
54L, 71L, 19L, 6L, 20L, 86L, 83L, 8L)), class = "data.frame", row.names = c(NA, 
-100L))

I tried this:

library(dplyr)
library(broom)

res1<-dataset%>%
group_by(b)%>%
do(cluster= 
       kmeans(dataset[,c(1,3)],centers=3))

res2<-tidy(res1,cluster)

But I can not get what I want (the resulting dataframe should have 100 rows, each with its own group derived from the parsing). There is an error in my code, or , this function is not suitable to perform this action.

    
asked by anonymous 21.11.2018 / 19:11

1 answer

5

This function is not suitable for this action, at least not the way it is being used here. The trick is to use the nest function of the tidyr package:

library(dplyr)
library(tidyr
cluster <- dataset %>%
  nest(a, c) %>%
  mutate(model = map(data, kmeans, 3),
         centers = map(model, tidy))  

With it, I can tell you how R should group in a list (in this case, a list with levels(dataset$b) elements) the columns that interest me to do my clustering. Then I use mutate to actually find the clustering of this data.

See that the result is as expected:

cluster %>%
  unnest(centers)

        b       x1       x2 size withinss cluster
1  group1 82.62500 20.75000    8 2529.375       1
2  group1 56.50000 83.83333    6 5278.333       2
3  group1 24.36364 39.00000   11 4378.545       3
4  group2 86.40000 23.20000   10 2952.000       1
5  group2 13.25000 30.00000    8 2861.500       2
6  group2 81.57143 73.28571    7 2947.143       3
7  group3 40.20000 73.70000   10 4321.700       1
8  group3 76.50000 33.50000    6 2337.000       2
9  group3 33.33333 18.00000    9 3200.000       3
10 group4 87.25000 62.75000    8 5699.000       1
11 group4 31.62500 75.37500    8 2415.750       2
12 group4 37.00000 18.44444    9 4592.222       3

The problem is that we do not yet have what you really care about, which is the cluster to which each observation belongs. But we have the centers of each cluster. Thus, we can predict, through Euclidean distance, which cluster of each observation. For this, we will use the cl_predict function of the clue package:

dataset %>%
  filter(b=="group1") %>%
  select(-b) %>%
  cl_predict(cluster$model[[1]], .)

Class ids:
 [1] 3 3 2 3 3 2 1 1 2 2 1 1 1 1 2 3 3 3 3 1 3 3 2 3 1

I could not make this prediction for all models at the same time. To get all 100 predictions required, you would have to somehow make cluster$model[[1]] vary, be it somehow tidy, or using a for .

Another thing I also do not know how to do is to cluster the data with different numbers of clusters per group. 3 clusters were searched in all 4 groups of the b variable. I do not know if this would be a reasonable thing to do in practice.

But these two tasks I will leave to the reader:)

    
22.11.2018 / 11:42