NLPS parsing with external list


Parsing: An input text that will pass through the grammar and output are all entries that the grammar finds in the text. The problem is that my non-terminals are external list files and I can not visualize a way to do it.

Example of a pseudo-code:

1) Open a text

2) Pass the grammar (just an example):

grammar ("" "

S - > NP VP

NP - > DET N

VP - > V N

DET - > tt_list

N - > nt.txt

V - > list.txt "" ")

3) Principle the results of text that obey the grammar

For example:

with open ("corpus_risque.txt", "r") as f:
    texte =

    grammar = nltk.parse_cfg("""
    S-> NP VP
    NP -> DET N
    VP -> V N 
    DET -> lista_det.txt
    N -> lista_n.txt
    V -> lista.txt""")

    parser = nltk.ChartParser(grammar)
    parsed = parser.parse(texte)

Normally, the grammars appear this way, already ready:

grammar = nltk.parse_cfg("""

S -> NP VP
NNP -> 'Python'
VBZ -> 'is'
DT -> 'a'
JJ -> 'good'
NN -> 'programming' | 'language' | 'research'
IN -> 'for'

Would it be possible?

asked by anonymous 31.08.2017 / 16:40

1 answer


In fact, what you want is impossible. What happens is that you are creating a terminal node "DET -> dt_list", in which the analysis will ask for this terminal list_det.txt specified by the non-terminal Det in the list. try to create a file of type cfg or fcfg with the elements divided and then call in a script, it will be easier.

For example: I create a file called tester.fcfg with some grammar rules and lexical items with a few strokes and an script

My script will have:

import nltk

from nltk import grammar, parse, FeatStruct

sent = input('Digite uma sentenca ou palavra: ')

cp = parse.load_parser('tester.fcfg', trace=2)

tokens = sent.split()

trees = cp.parse(tokens)

for tree in trees: print(tree)


And in the tester.fcfg file:

##Regras Gramaticas##

Sentence -> SD[AGR=?a] SV[AGR=?a]
Sentence -> SD[AGR=?a]
Sentence -> SV[AGR=?a]
Sentence -> Nome
Sentence -> Verbo
Sentence -> PP[AGR=?a]
Sentence -> Pro[AGR=?a] 
Sentence -> Pro[AGR=?a] SV[AGR=?a]
Sentence -> P[AGR=?a]
Sentence -> P[AGR=?a] N[AGR=?a] | P N
Sentence -> VBar
Sentence -> SD SV

SN[AGR=?a] -> SD[AGR=?a] | N[AGR=?a] | SD[AGR=?a] PP[AGR=?a] | N[AGR=?a]

SD[AGR=?a] -> Det[AGR=?a] N[AGR=?a] | Det[AGR=?a] | PP[AGR=?a] N[AGR=?a] | Det N

PP[AGR=?a] -> P[AGR=?a] SN[AGR=?a]

SV[AGR=?a] -> V[AGR=?a] SN[AGR=?a] | V[AGR=?a] PP[AGR=?a] SN[AGR=?a] | VBar

VBar -> Pro[AGR=?a] SV[AGR=?a] | Pro[AGR=?a] V[AGR=?a]

Nome -> N

Verbo -> V

##Tracos Lexicais##

Det[AGR=[NUM='sg', GND='f'],CAT =[Cat='Artigo']] -> 'a' | 'da' | 'na'

Det[AGR=[NUM='pl', GND='f'], CAT =[Cat='Artigo']] -> 'as' | 'nas'

Det[AGR=[NUM='sg', GND='m'], CAT =[Cat='Artigo']]-> 'o' | 'de' | 'no' | 'um'

Det[AGR=[NUM='pl', GND='m'], CAT =[Cat='Artigo']]-> 'os' | 'nos'

Pro[AGR=[NUM='sg', GND='m', PERS='3']]-> 'ele'

Pro[AGR=[NUM='sg', GND='m', PERS='1']]-> 'eu'

P[AGR=[NUM='sg', GND='m', PERS='3'], CAT =[Cat= 'Pronome', SubCat= Demonstrativo]] -> 'este' | 'aquele' | 'esse'

P[AGR=[NUM='pl', GND='m', PERS='3']] -> 'estes' | 'aqueles' | 'esses'

P[AGR=[NUM='sg', GND='f', PERS='3']] -> 'esta' | 'aquela' | 'essa'

P[AGR=[NUM='pl', GND='f', PERS='3']] -> 'estas' | 'aquelas' | 'essas'

N[AGR=[NUM='sg', GND='f'], CAT =[Cat='Substantivo', SubCAT='Comum']] -> 'biblioteca' | 'doutora' | 'leoa' | 'livraria' | 'professora' | 'lavadeira' | 'aluna' | 'madre' | 'menina' | 'mae' | 'mulher' | 'dentista' | 'juiza'

N[AGR=[NUM='pl', GND='f'], CAT =[Cat='Substantivo', SubCAT='Comum']]-> 'doutoras' |  'meninas' | 'mulheres' | 'juizas' | 'bola' | 'pata'

N[AGR=[NUM='sg', GND='m'],CAT =[Cat='Substantivo', SubCAT='Comum']] -> 'menino' | 'homem' | 'juiz' | 'doutor' | 'professor' | 'livro' | 'carro' | 'jogador'

N[AGR=[NUM='sg', GND='m'], SEMANTICA=[ ANI='animal']]-> 'pato' | 'cachorro' | 'gato'

N[AGR=[NUM='sg', GND='m'],CAT =['Substantivo Proprio'], SEMANTICA=[ ANI='humano']]-> 'Pedro' | 'Carlos' | 'Henrique'

N[AGR=[NUM='sg', GND='f'], CAT =['Substantivo Proprio'], SEMANTICA=[ ANI='humano']]-> 'Maria' | 'Veronica' | 'Lara' | 'Carla'

N[AGR=[NUM='pl', GND='m']] ->  'meninos' | 'homens' | 'livros' | 'carros'

N[AGR=[NUM='sg', GND='n']] ->  'estudante' | 'piloto' | 'presidente' | 'jornalista' | 'jogadora' | 'jornal'

N[AGR=[NUM='pl', GND='n']] -> 'estudantes' | 'pilotos' | 'presidentes' | 'jornalistas'

V[AGR=[NUM='sg'], CAT =['Verbo'], CP=['presente do indicativo']] -> 'comprar' | 'compra' | 'comprou' | 'pegar' | 'pegou' | 'ler' | 'leu' | 'ama' | 'amo' | 'amar' | 'jogar' | 'entrou' | 'amor'

V[AGR=[NUM='sg'], CAT =[Cat='Verbo', SubCat = ' Ligacao e adicao'], CP=['presente do indicativo']] -> 'e'  


Note that what will be called by the script will be the lexical items and grammar rules specified in the same file. The question is, which language models (in this case are strokes organized by AVM [Attribute-Value]) that you are following and for what type of computational implementation do you want ...

I do not know if that's exactly what it is, but from what I've seen, you're trying to create more than one corpus , labeling forms and parsing . See the NLTK documentation, plus some books to help you better.

13.10.2017 / 23:03