Regexp extract value

Question

Regexp extract value

Navigation

#1 by (4 votes)
#2 by (2 votes)

3

I have the following strings:

"The.Office.US.S {SE} AND {EP} .The.Dundies.720p.srt"

"The Office [{SE}. {EP}] The Fight.srt"

This string is a template of the name of a file, the files will be in the following form

"The.Office.US.S01E06.The.Dundies.720p.srt"

"The Office [01.06] The Fight.srt"

I need to extract the 01 and 06 values of these strings using python. But I'm not able to mount a regexp that works for my case

#encoding: utf-8
import re
template = "The.Office.US.S{SE}E{EP}.The.Dundies.720p.srt"
arq = "The.Office.US.S01E06.The.Dundies.720p.srt"

#Nesta linha que está minha dificuldade
pat = re.compile('\{.*?\}')

season, episode = re.findall(pat, text)
print("Temporada: ", season)
print("Episódio: ", episode)

python regex

asked by anonymous 12.06.2017 / 01:29

2 answers

4

Edited, final version
Meet the 2 templates with a single regex. This version results in a tuple with numeric values. As the regex serves the 2 templates, the tuple will always return 4 items, 2 of them None , unless the string is the sum of the two templates. You can also get the values of the groups separately, that is, to know what was between the brackets (template 1), consider the groups: 'Bracket' and 'Point'. For the second template, take the groups 'S' and 'E'. Note The code below works with a single regular expression that actually consists of two [pipe sepals (|)], so you can build a more granular version, with a regex for each template, as shown below.

import re

s1 = "The Office [01.06] The Fight.srt"
s2 = 'The.Office.US.S01E06.The.Dundies.720p.srt'
padrao = '(?P<Colchete>\d{2})\.(?P<Ponto>\d{2})|(?P<S>\d{2})E(?P<E>\d{2})'
re1 = re.compile(padrao)

print ('## Resultado para s1 ##')
print ('Groups: ',re1.search(s1).groups())
print ('Colchete: ', re1.search(s1).group('Colchete'))
print ('Ponto: ', re1.search(s1).group('Ponto'),'\n')

## Resultado para s1 ##
Groups:  ('01', '06', None, None)
Colchete:  01
Ponto:  06

print ('## Resultado para s2 ##')
print ('Groups: ',re1.search(s2).groups())
print ('S: ', re1.search(s2).group('S'))
print ('E: ', re1.search(s2).group('E'))

## Resultado para s2 ##
Groups:  (None, None, '01', '06')
S:  01
E:  06

You can even make a more granular version by breaking the regex into 2 and working separately with the templates, something like this:

padrao1 = '(?P<Colchete>\d{2})\.(?P<Ponto>\d{2})'
padrao2 = '(?P<S>\d{2})E(?P<E>\d{2})'

re_p1 = re.compile(padrao1)
re_p2 = re.compile(padrao2)

print ('## Resultados para a versão Granular ##')

print ('## Para s1 ##')
print ('Groups: ',re_p1.search(s1).groups())
print ('Colchete: ', re_p1.search(s1).group('Colchete'))
print ('Ponto: ', re_p1.search(s1).group('Ponto'),'\n')

## Resultados para a versão Granular ##
## Para s1 ##
Groups:  ('01', '06')
Colchete:  01
Ponto:  06 

print ('## Para s2 ##')
print ('Groups: ',re_p2.search(s2).groups())
print ('S: ', re_p2.search(s2).group('S'))
print ('E: ', re_p2.search(s2).group('E'))

## Para s2 ##
Groups:  ('01', '06')
S:  01
E:  06

DEMO

12.06.2017 / 02:25

Application logon failure in IIS when trying to connect with SQLServer Format Field C #

score 2 · Accepted Answer

You have to transform your template into a regex. As it has several characters that are special for a regular expression, you first need to escape them. Then just replace {SE} and {EP} with a group getting [0-9]+ and that's it. The code below does this:

def template2regex(template):
    template = re.escape(template)

    regex = template.replace('\{SE\}', '(?P<season>[0-9]+)')
    regex = regex.replace('\{EP\}', '(?P<episode>[0-9]+)')

    return re.compile(regex)

    template = "The.Office.US.S{SE}E{EP}.The.Dundies.720p.srt"
    regex = template2regex(template)
    regex.search('The.Office.US.S01E06.The.Dundies.720p.srt').groups()
('01', '06')
    template = 'The Office [{SE}.{EP}] The Fight.srt'
    regex = template2regex(template)
    regex.search('The Office [01.06] The Fight.srt').groups()
('01', '06')