Regular expression in python 3.6 for phrase extraction inteitra

1

I need to extract only the sentences that contain ADMINISTRATION - JUDGE OF OUTSIDE - NIGHT - SISU - GROUP B, for example. That is, I need to get only the course name, city, shift, O SISU, and the group name of the following string:

string = </li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=70>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=71>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO B</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=72>

The string is huge, that's just a bit. I managed to make one but she is returning bitten things, and also, she is not picking up letters with an accent, like for example the accented "O" of HISTORY. The expression I did was

cursos = re.findall(([A-Z])\w+g)

I need to get this out:

ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A

But it returns me this:

GEOGRAFIA - JUIZ DE FORA - DIURNO - SISU - GRUPO( não está pegando qual grupo é)

and in HISTORY for example it does not take the "O" accented.

    
asked by anonymous 28.02.2017 / 17:33

3 answers

2

I was expecting someone who really noticed regex to respond but I'll give you a different solution > (and in many cases better , ):

from bs4 import BeautifulSoup as bs

string = '</li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=70>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=71>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO B</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=72>'

soup = bs(string, 'html.parser')
aEles = soup.findAll('a')
texts = '\n'.join(i.text for i in aEles if i.text != '')
print(texts)

This will print:

  

ADMINISTRATION - JUDGE OF FORA - NIGHT - SISU - ADMINISTRATION GROUP   - JUIZ DE FORA - NIGHT - SISU - GROUP B

    
28.02.2017 / 21:02
1

(1) Regular expressions are not the best tool for extracting HTML content - it's best to use an HTML parser to do this - like the beautifulsoup listed in Miguel's response, or the HTMLParser module itself standard Python library. link (In Python 2 the module is HTMLParser instead of html.parser - but, I insist, you should not be using Python 2 - it will leave you 10 years behind in functionalities and easinesses, including handling of accented characters)

(2) All said, the problem with your regular expression is that you are in the wrong focus - instead of looking for the phrases themselves, which can have many variations, it is much easier to search for what is > round of the phrase, which is fixed (the tags <a> and </a> .) If there are more links than the ones of interest, you can start complicating your regular expression to get only content from the% (and you will understand why the recommendation is NOT to use regular expressions for this) - or, after extracting all content from the <a> tags, use a normal filter of%. Python with "for" and "if" to leave only what interests you. (may be more readable and easier than a complex regexp).

With all of this being said, the regular expression to retrieve everything inside the <li> tags, which you can use with the <a> method is:

re.findall (r"<a.*?>(.*?)</a", string)

The output I get for the HTML snippet you pasted is:

['ADMINISTRA\xc3\x87\xc3\x83O - JUIZ DE FORA - NOTURNO - SISU - GRUPO A',
 'ADMINISTRA\xc3\x87\xc3\x83O - JUIZ DE FORA - NOTURNO - SISU - GRUPO B']

(In Python 2.7 - in Python3, the accentuation of the given excerpt already comes out correct in the representation)

    
01.03.2017 / 14:27
-1

Using regex for HTML formatted this way, you can do this:

regex = r'(?<=<a href=http://www.ufjf.br/cdara/sisu-2/sisu-20\d{2}-\da-edicao/lista-de-espera-sisu-\d)(?:[\s\S]*?>)([-\x41-\x5A\xC0-\xDC\s]*?)(?:</a>)'
cursos = re.findall(regex, string)
    
01.03.2017 / 15:33