Organize data flow by string pattern

1

Friends, I'm working on a scraping project. At some point, I get a table on the screen in the form of a giant string, something like this:

list = ('0004434-48.2010 \ n UNION \ n (30 business days) 07/07/2017 \ n 13/07/2017 \ n 0008767-77.2013 \ n 2017 \ n (10 business days) / 2017 \ n 13/07/2017).

I dealt with this by giving a "split" command, having the "\ n" parameter, which made the list look like this:

list = ['0004434-48.2010', 'UNION', '(30 working days) 03/07/2017', '13 / 07/2017 ',' 0008767-77.2013 ',' 2017 ',' (10 business days) 03/07/2017 ', '13 / 07/2017']

Now my difficulty is: the first item in the list is the reference number of the table row. It identifies a particular contract, which goes up to the item that contains the second date. Next comes ANOTHER line (another contract) and subsequent items will belong to this second contract.

Doubt: how can I separate this? Because I will still deal with the date, the contracts will only be "clicked" under certain conditions. I tried to make a loop like this:

for x in range(len(lista)):
    if len(lista[x]) == 15: #identificar o processo
        organizaProcessos.append(lista[x])

    
asked by anonymous 17.07.2017 / 18:16

2 answers

2

( TL; DR ) If I understand what you want to do:

lista = ['0004434-48.2010',
 'UNIÃO',
 '(30 dias úteis) 03/07/2017',
 '13/07/2017',
 '0008767-77.2013',
 '2017',
 '(10 dias úteis) 03/07/2017',
 '13/07/2017']

def chunks(_list, parts):
     for i in range(0, len(_list), parts):
         yield _list[i:i+parts]

for i, chunk in enumerate(chunks(lista, 4)):
    locals()["part{0}".format(i)] = chunk

print ('Primeira parte: ',part0)
print ('\nSegunda parte: ',part1)

Output:

Primeira parte:  ['0004434-48.2010', 'UNIÃO', '(30 dias úteis) 03/07/2017', '13/07/2017']

Segunda parte:  ['0008767-77.2013', '2017', '(10 dias úteis) 03/07/2017', '13/07/2017']

That is, you have n (Depending on how many contracts you have on the line) lists of 4 elements, the list representing a contract, the first element being the identation of it.

See working on repl.it.

    
17.07.2017 / 19:30
1

Use parse () of dateutil.parser , which tests whether a string is a date or not.

#!/usr/bin/python
#-*- coding: utf-8
from dateutil.parser import parse

def chunks(string):
    try:
        int(string)
        return False
    except:
        try:
            parse(string)
            return True
        except:
            return False

def split(string,num):
    c = 0
    i = 0
    list = string.split(' ')
    for x in range(0,len(list)):
        c += chunks(list[i])
        i += 1
        if c == num: break
    return list[0:i],list[i+1::]

string = '0004434-48.2010 \n UNIÃO \n (30 dias úteis) 03/07/2017 \n 13/07/2017 \n 0008767-77.2013 \n 2017 \n (10 dias úteis) 03/07/2017 \n 13/07/2017'
a,b = split(string,2)
print(a)
print(b)

This will be the output.

['0004434-48.2010', '\n', 'UNIÃO', '\n', '(30', 'dias', 'úteis)', '03/07/2017', '\n', '13/07/2017']
['0008767-77.2013', '\n', '2017', '\n', '(10', 'dias', 'úteis)', '03/07/2017', '\n', '13/07/2017']

Note that I can even work with a variable number of dates per line. Suppose I want to separate the lines after the third date, rather than after the second.

Just swap

a,b = split(string,2)

by

a,b = split(string,3)

and the result will be

['0004434-48.2010', '\n', 'UNIÃO', '\n', '(30', 'dias', 'úteis)', '03/07/2017', '\n', '13/07/2017', '\n', '0008767-77.2013', '\n', '2017', '\n', '(10', 'dias', 'úteis)', '03/07/2017']
['13/07/2017']
    
21.07.2017 / 21:10