Remove duplicate names with regular expression

7

Suppose I have the following vector, with the names of presidents of the republic:

presidentes <- c("da Fonseca, DeodoroDeodoro da Fonseca", 
"Peixoto, FlorianoFloriano Peixoto", "de Morais, PrudentePrudente de Morais", 
"Sales, CamposCampos Sales")

I would like to format this vector so that you can read the name of each president directly:

"Deodoro da Fonseca" "Floriano Peixoto" "Prudente de Morais" "Campos Sales"      

I imagine there is some regular expression that does this, but I can not build it.

    
asked by anonymous 30.08.2016 / 20:58

4 answers

5

So it's not pretty but it worked:

library(stringr)
rex <- ".*, [:alpha:]{1,}[A-Z]{1}"
nomes_invertidos <- str_extract_all(presidentes, rex) %>% unlist() %>% str_sub(end = -2)
str_replace_all(presidentes, nomes_invertidos, replacement = "")

[1] "Deodoro da Fonseca" "Floriano Peixoto"   "Prudente de Morais" "Campos Sales"   

The regex paste:

  • anything up to the comma ( .*, ),
  • the comma,
  • a space,
  • any letter up to the first letter of capital ( [:alpha:]{1,}[A-Z]{1} ).
30.08.2016 / 21:15
3

The solution may vary. I used the following:

Regex -> ^.+?,\s*(\w+)(.+?)$

Substituição -> $1$2

I do not know if the R language works, but the (\w+) retroreference snippet captures only the snippet that has only one name and concatenates the remainder of the other snippet.

I tested it with notepad ++ and it worked.

In R, this expression can be used as follows:

gsub("^.+?,\s*(\w+)\1(.+?)$", "\1\2", presidentes)
#[1] "Deodoro da Fonseca" "Floriano Peixoto"   "Prudente de Morais" "Campos Sales"
    
30.08.2016 / 22:07
1

I do not understand the language r , but as I have already I explained here , you can do a search simple to find the same duplicate sequence, and replace with one.

  • pattern: ([a-z]+)
  • replace: $1
  • flag: i , and g , if the language needs to specify "replace all".

REGEX in JS: str.replace(/([a-z]+)/gi, '$1')

See working at REGEX101

Explanation

  • ([a-z]+) - Group 1
  • [a-z] - This is limited to letters, and because it has the i flag, it accepts uppercase and lowercase.
  • - Resumes the same catch of group 1. generating search for duplicate parts.
31.08.2016 / 15:07
0
gsub(".*[a-z]([A-Z])", "\1", p)

that is:

de Morais, PrudentePrudente de Morais
..................eP
                   ↓
                   Prudente de Morais
    
27.01.2017 / 12:16