How to do this regular expression in python 3.6

2

I need to do a regular expression to extract the links from this string:

links =('href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=01GV&id_grupo=70>ADMINISTRAÇÃO - GOVERNADOR VALADARES - DIURNO - SISU - GRUPO A</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=01GV&id_grupo=71>ADMINISTRAÇÃO - GOVERNADOR VALADARES - DIURNO - SISU - GRUPO B</a></li>

The string is much larger. I put only one part because the rest repeats. Here's what I've tried:

campus1 = re.findall("href", links)
campus2 = re.findall("http", links)
campus3 = re.findall("href=http", links)
campus4 = re.findall("hre", links)
campus5 = re.findall("a", links)
campus6 = re.findall("<a> <\a>", links)

When I give a print or leave the letters separated or leave the link and these names (which later I will also have to think of an expression to get only those names of colleges) Anyone any ideas? What comes out is this when I run campus1 = re.findall ("href", links), for example: 'href', 'href', 'href', 'href', 'href', 'href', 'href', 'href', 'href', 'href', 'href', 'href', That is, it returns all the href's of the string. I would like to extract only the links, for example:

">

All links as they are in this string.

    
asked by anonymous 26.02.2017 / 15:32

1 answer

0

Do this:

import re
s = "<li><a>href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=01GV&id_grupo=70>ADMINISTRAÇÃO - GOVERNADOR VALADARES - DIURNO - SISU - GRUPO A</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=01GV&id_grupo=71>ADMINISTRAÇÃO - GOVERNADOR VALADARES - DIURNO - SISU - GRUPO B</a></li>"
print(re.findall(r'href=[\'"]?([^\'" >]+)', s))

See the Ideone

Explanation of Regex

    
26.02.2017 / 16:45