Less data in a data set

8

I have a dataset and would like to select only the smallest date between each primary key ( MATRICULA column). Here is the example of my DF:

MATRICULA <- c(1,1,3,3,3,4,5,5,5,5,6)
DATA <- c('15/01/2018', '10/12/2017', '20/11/2017', '01/01/2015', 
  '25/10/2018', '02/07/2016', '03/12/2016','17/08/2017', '22/03/2018', 
  '12/06/2018', '13/04/2014')
DADOS <- data.frame(MATRICULA, DATA)

I already use the abv_data = c(as.Date(DADOS$DATA,"%d/%m/%Y")) function to transform the date format.

Therefore, I would like the result to appear only on a smaller date from the MATRICULA column. The result I hope should be:

MATRICULA <- c(1,3,4,5,6)
DATA <- c('10/12/2017', '01/01/2015', '02/07/2016', '03/12/2016', '13/04/2014')
DADOS <- data.frame(MATRICULA,DATA)
    
asked by anonymous 19.12.2018 / 13:44

3 answers

7

This problem is very easy to solve using the dplyr package. The first thing to do is to transform the DATA column into date so that R can establish a relation of order to it. I'll just copy your original code, which was correct, and overwrite the column DATA within DADOS :

DADOS$DATA = c(as.Date(DADOS$DATA,"%d/%m/%Y"))

With this done, the package dplyr will serve to group your data by MATRICULA and then look for the lowest value of DATA for each of them:

DADOS %>% 
  group_by(MATRICULA) %>% 
  slice(which.min(DATA))
# A tibble: 5 x 2
# Groups:   MATRICULA [5]
  MATRICULA DATA      
      <dbl> <date>    
1         1 2017-12-10
2         3 2015-01-01
3         4 2016-07-02
4         5 2016-12-03
5         6 2014-04-13

And that's it. This is the result. It will serve any number of registrations you have in your data set, and it will not matter how many dates there are within each of these registrations.

    
19.12.2018 / 14:05
7

Two ways with base R.

With aggregate .

aggregate(abv_data ~ MATRICULA, DADOS, min)
#  MATRICULA   abv_data
#1         1 2017-12-10
#2         3 2015-01-01
#3         4 2016-07-02
#4         5 2016-12-03
#5         6 2014-04-13

With tapply .

as.Date(tapply(abv_data, DADOS$MATRICULA, FUN = min), origin = "1970-01-01")
#           1            3            4            5            6 
#"2017-12-10" "2015-01-01" "2016-07-02" "2016-12-03" "2014-04-13"
    
19.12.2018 / 14:09
6

Another solution with dplyr is using filter instead of slice .

library(tidyverse)
library(lubridate)

DADOS %>% 
  mutate(DATA = dmy(as.character(DATA))) %>% 
  group_by(MATRICULA) %>% 
  filter(DATA == min(DATA))

# A tibble: 5 x 2
# Groups:   MATRICULA [5]
  MATRICULA DATA      
      <dbl> <date>    
1         1 2017-12-10
2         3 2015-01-01
3         4 2016-07-02
4         5 2016-12-03
5         6 2014-04-13
    
19.12.2018 / 21:42