Remove duplicate cases and keep specific values from another variable

2

Consider the following situation:

I have a database with two variables. The first is a variable with duplicate values (eg CPFxxx.xxx.xxx-xx appears 14 times, CPFxxx.xxx.xxx-xx appears 18 times, and so on). The second variable is the event occurrence dates (eg 2017-01-18, 2017-01-19 ...) associated with each CPF.

I use the following function to remove duplicate cases:

new<-dataset[!duplicated(dataset[c("CPFs")]),]

And I can remove duplicate lines.

My goal: remove duplicates in CPFs , but in the other variable ( data ), make the most recent ones (or the oldest ones) remain tied to the CPF. That is, I need to establish an order at the time of the function execution.

So if I have the dates ( 2018-01-20; 2017-02-22 ) attached to a CPF, the date bound to it would be: 2017-02-22 .

% dummy to answer the answer:

dataset=structure(list(CPFs = c(1234, 2345, 1234, 2345, 1234, 2345, 1234, 
2345), date = c(1998, 1997, 1993, 1992, 1998, 1998, 1992, 1989
)), class = "data.frame", row.names = c(NA, -8L))

Desired result:

CPF  date
1234 1992
2345 1989
    
asked by anonymous 10.10.2018 / 21:49

1 answer

3

A simple way to solve is to use the dplyr , tidyverse :

  new_dataset <- dataset %>% 
    arrange(date) %>% 
    distinct(CPFs, .keep_all = TRUE)

Note that dates need to be formatted as Date , not as string, otherwise sorting may not work properly.

If you want to select the most recent observation, just use arrange(desc(date)) , that is, sort descending.

    
10.10.2018 / 22:42