How to extract data from plain non-standard texts?

0

I would like to extract fields to a database from text files. However, the fields are positioned differently in each text and it is difficult to obtain the values by common methods, for example:

file 1:

PROVA: 2º Corta Mato    

LOCAL:  Pinhal da Paz
ORGANIZAÇÃO:    
AAP 

ESTADO TEMPO: Bom   
DATA:   28-01-2007  

file 2:

PROVA: MEGA SPRINTER         LOCAL: E.B.I. DE ARRIFES
ASSOCIACAO: AASM/SDSM
TEMPO: Nublado c/ vento
DIA: 22 de Março de 2006

file 3:

AASM
ESTADO TEMPO: Nublado/Ventoso c/ alguma chuva
DATA: 19 de Novembro de 2005
1º Triatlo Técnico + P. de Preparação
C. D. DAS LARANJEIRAS

There are thousands of files, multiple fields per file and each field can have one or multiple values per text, so extracting data by hand is out of the question.

    
asked by anonymous 09.12.2015 / 15:22

1 answer

1

For this purpose I created the package MassTextExtractor to load it, just install it through the pip at the command line:

sudo pip install MassTextExtractor

An example of its use for the "local" and "proof" fields of the sampled files would be:

from MassTextExtractor import TextsParser

# marcar linhas do campo prova
file_dirs = ["./ficheiro_1.txt", "./ficheiro_2.txt", "./ficheiro_3.txt"]
flags = ["Triatlo", "PROVA:"]
prova = TextsParser(file_dirs, flags)

# limpar partes da linha
prova.switchers = [("PROVA:", "")]
prova.switch_texts_field_lines()

# partir parte da linha
prova.breakers = [("LOCAL", 0)]
prova.break_texts_field_lines()


# marcar linhas do campo local
file_dirs = ["./ficheiro_1.txt", "./ficheiro_2.txt", "./ficheiro_3.txt"]
flags = ["LARANJEIRAS", "LOCAL:"]
local = TextsParser(file_dirs, flags)

# partir parte da linha
local.breakers = [("LOCAL:", 1)]
local.break_texts_field_lines()

# limpar partes da linha
local.switchers = [("LOCAL:", "")]
local.switch_texts_field_lines()


print prova.return_texts_field_lines()
print local.return_texts_field_lines()

It may seem overly pedantic, however, I believe it can be quite useful when used as a last resort to get data from large amounts of semi-unstructured text.

    
09.12.2015 / 15:22