Selecting / Cleaning information in a column

1

I have a database with thousands of rows, but in one of the columns the data looks like this:

XLOCAL
Estirão do Equador, Rio Javari (04°27'S;71°30'W)
Alto Rio Paru de Oeste, Posto Tiriós (02°15'N;55°59'W)
Ipixuna do Pará, Rodovia Belém-Brasília km 92/93 (02°26'S;47°30'W)
Aurora do Pará, Rodovia Belém-Brasília km 86 (02°04'S;47°33'W)

I would like help to leave only the coordinates, removing all the texts, parentheses and semicolons. It would look like this:

 XLOCAL
04°27'S 71°30'W
02°15'N 55°59'W

I tried using strings and gsub but I did not succeed. Example of what I tried.

df <- c("sdasdad (04°27'S;71°30'W)", "zxczxczcxz (01°40'N;51°23'W)")
grep("^([[:punct:]])", df, value=TRUE)
pattern <- "[[:alpha:]]"
gsub("^.[[:alpha:]]", df, fixed=F)

result

[1] " (04°27';71°30')" " (01°40';51°23')" #Reparem que ele removeu também "N", "S", "W" das coordenadas.

The database is a museum, they are not available online, you have to organize it to make it available online. Help me, there are thousands of lines to remove manually. Thank you very much in advance.

    
asked by anonymous 24.10.2018 / 00:36

2 answers

1

I think the question complicated the regex too much. See that.
First stay with only what is between ( and ) . Since these characters are special characters, you must use \( and \) . This is what sub does.
Then replace the semicolon ; with a space. For this I have used gsub but since there is only ; it can also be sub .

gsub(";", " ", sub("^.*\((.*)\)", "\1", XLOCAL))
#[1] "04°27'S 71°30'W" "02°15'N 55°59'W" "02°26'S 47°30'W"
#[4] "02°04'S 47°33'W"

This is equivalent (it's exactly the same thing) to the following, divided into two statements to be more readable.

tmp <- sub("^.*\((.*)\)", "\1", XLOCAL)
XLOCAL <- gsub(";", " ", tmp)

Data in dput format.

XLOCAL <-
c("Estirão do Equador, Rio Javari (04°27'S;71°30'W)", 
"Alto Rio Paru de Oeste, Posto Tiriós (02°15'N;55°59'W)", 
"Ipixuna do Pará, Rodovia Belém-Brasília km 92/93 (02°26'S;47°30'W)", 
"Aurora do Pará, Rodovia Belém-Brasília km 86 (02°04'S;47°33'W)")

This statement creates a vector . If you want a dataframe, after running the above statement, do

dados <- data.frame(XLOCAL)

Next, in the code where it is just XLOCAL it should be dados$XLOCAL .

    
24.10.2018 / 19:37
1

Try this:

data = read.delim(file.choose(), header = T)

library("stringr")

new_string = str_sub(data$XLOCAL, start = -16)

str_sub(new_string, start = 1, end=15)
#[1] "04°27'S;71°30'W" "02°15'N;55°59'W" "02°26'S;47°30'W" "02°04'S;47°33'W"
    
24.10.2018 / 01:09