Pandas: Create Multiple Columns in the DataFrame

1

I have a spreadsheet with the following information:

    NUM_LEGISLACAO  DSC_URL    ...      COD_TIPO   DSC_TIPO
0         323/1895      NaN    ...           2.0   CONCRETO
1       2.269/1896      NaN    ...           2.0   CONCRETO
2       2.397/1896      NaN    ...           2.0   CONCRETO
3       2.380/1896      NaN    ...           2.0   CONCRETO
4       2.489/1897      NaN    ...           2.0   CONCRETO
5       2.997/1898      NaN    ...           2.0   CONCRETO

And I developed this code, with the goal of creating two more columns with separate numbers and years:

def truncus01():
    xlsx = pd.ExcelFile(file)
    df = xlsx.parse(xlsx.sheet_names[0])

    print(len(df.NUM_LEGISLACAO))
    print(df.columns)
#    print(df[df['NUM_LEGISLACAO']])

    #df['NUM'], df['ANO'] = (df.NUM_LEGISLACAO)
    for i in df.NUM_LEGISLACAO:
        df['NUM'] = i.split('/')[0]
        df['ANO'] = i.split('/')[1]
    print(df)

But I did not succeed, year and number come back, as shown below:

    NUM_LEGISLACAO  DSC_URL  COD_SITUACAO  ...    DSC_TIPO   NUM   ANO
0         323/1895      NaN          11.0  ...    CONCRETO  5475  1905
1       2.269/1896      NaN           2.0  ...    CONCRETO  5475  1905
2       2.397/1896      NaN           2.0  ...    CONCRETO  5475  1905
265    90.396/1984      NaN          11.0  ...   NORMATIVO  5475  1905
266    90.804/1985      NaN           2.0  ...    CONCRETO  5475  1905
267     81195/1978      NaN           NaN  ...         NaN  5475  1905
268      5475/1905      NaN           NaN  ...         NaN  5475  1905

[269 rows x 8 columns]

What's the right way to do it?

    
asked by anonymous 27.05.2018 / 00:23

1 answer

1

The problem of repetition happens because, within looping :

for i in df.NUM_LEGISLACAO:
        df['NUM'] = i.split('/')[0]
        df['ANO'] = i.split('/')[1]

At each iteration, the NUM and ANO columns are filled (fully) with the current cycle value ( i ), "killing" the previous update, so at the end of execution these columns will contain only the last values read.

One possible way to create the two columns is through the command pandas.Series.str.split() , which will split the data NUM_LEGISLACAO through the / separator and store the results in the NUM and ANO columns, respectively:

dt['NUM'], dt['ANO'] = dt.NUM_LEGISLACAO.str.split('/').str

Example (with question data):

dt
Out[30]:
  NUM_LEGISLACAO  COD_TIPO  DSC_TIPO
0       323/1895       2.0  CONCRETO
1     2.269/1896       2.0  CONCRETO
2     2.397/1896       2.0  CONCRETO
3     2.380/1896       2.0  CONCRETO
4     2.489/1897       2.0  CONCRETO
5     2.997/1898       2.0  CONCRETO

# AQUI: Cria as 2 colunas
dt['NUM'], dt['ANO'] = dt.NUM_LEGISLACAO.str.split('/').str

dt
Out[32]:
  NUM_LEGISLACAO  COD_TIPO  DSC_TIPO    NUM   ANO
0       323/1895       2.0  CONCRETO    323  1895
1     2.269/1896       2.0  CONCRETO  2.269  1896
2     2.397/1896       2.0  CONCRETO  2.397  1896
3     2.380/1896       2.0  CONCRETO  2.380  1896
4     2.489/1897       2.0  CONCRETO  2.489  1897
5     2.997/1898       2.0  CONCRETO  2.997  1898
    
27.05.2018 / 03:25