Count elements of a column

3

How to count the number of occurrences in columns?

File:

luz NC  luz
mas ADV más
blanquita   ADJ blanco
que CQUE    que
las ART el
que CQUE    que
traía   VLfin   traer
de  PREP    de
serie   NC  serie
mi  PPO mi|mío
coche   NC  coche

Script:

from collections import Counter

with open ("corpus_TreeTagger.txt", "r") as f:
    texte = f.read()
    colunas = texte.split("\n")

    def frequencia(colunas):
        for linhas in colunas:
            lexema = linhas.split('\t')[0]
            pos = linhas.split('\t')[1]
            lema = linhas.split('\t')[2]

        return Counter(lexema)
        return Counter(pos)
        return Counter(lema)

print(frequencia(colunas))

Error:

Traceback (most recent call last):
  File "FINALV2.py", line 72, in <module>
    print(frequencia(colunas))
  File "FINALV2.py", line 23, in frequencia
    pos = linhas.split('\t')[1]
IndexError: list index out of range

Could anyone help me?

    
asked by anonymous 30.08.2017 / 15:40

1 answer

1

[TL; DR]

Pandas

Now I understand the format of the file, I do not know if I completely understood the purpose, so I made a version based on pandas, which counts the occurrences of each word in each column.

First let's simulate the file, to facilitate include a line to identify the columns, this can easily be done on a production system.

import io 
import pandas as pd

# Simulando um txt separado por tabs
s = '''
Palavra\tEtiqueta\tLema
luz\tNC\tluz
mas\tADV\tmás
blanquita\tADJ\tblanco
que\tCQUE\tque
las\tART\tel
que\tCQUE\tque
traía\tVLfin\ttraer
de\tPREP\tde
serie\tNC\tserie
mi\tPPO\tmi|mío
coche\tNC\tcoche
'''

Now let's read the file for a pandas dataframe

# lendo o arquivo para um dataframe
df = pd.read_csv(io.StringIO(s), sep='\t')

Introducing the dataframe

df
Out[15]: 
      Palavra Etiqueta    Lema
0         luz       NC     luz
1         mas      ADV     más
2   blanquita      ADJ  blanco
3         que     CQUE     que
4         las      ART      el
5         que     CQUE     que
6       traía    VLfin   traer
7          de     PREP      de
8       serie       NC   serie
9          mi      PPO  mi|mío
10      coche       NC   coche

Now let's group by the Palavra column and display the number of occurrences of each word in that column throughout the table:

df.groupby('Palavra').count()

           Etiqueta  Lema
Palavra                  
blanquita         1     1
coche             1     1
de                1     1
las               1     1
luz               1     1
mas               1     1
mi                1     1
que               2     2
serie             1     1
traía             1     1

Grouping by column Etiqueta and displaying the number of occurrences of each word in that column, in the table:

df.groupby('Etiqueta').count()

          Palavra  Lema
Etiqueta               
ADJ             1     1
ADV             1     1
ART             1     1
CQUE            2     2
NC              3     3
PPO             1     1
PREP            1     1
VLfin           1     1

Finally, the results of grouping by column Lema and number of occurrences of each word of that column in the whole table:

df.groupby('Lema').count()

        Palavra  Etiqueta
Lema                     
blanco        1         1
coche         1         1
de            1         1
el            1         1
luz           1         1
mi|mío        1         1
más           1         1
que           2         2
serie         1         1
traer         1         1

Download or view rendering in jupyter notebook.

    
30.08.2017 / 18:42