How to sum values from a csv using Python?

Question

How to sum values from a csv using Python?

Navigation

#1 by (7 votes)

2

I have a csv file similar to this one, with all the information of all municipalities in Brazil (I shortened the csv to not be too long):

ESTADO,MUNICIPIO,HABITANTES,AREA
AC,ACRELÂNDIA,12538,1807.92
AC,ASSIS BRASIL,6072,4974.18
AC,BRASILÉIA,21398,3916.5
AC,BUJARI,8471,3034.87
AL,BATALHA,17076,320.92
AL,BELÉM,4551,48.63
AM,BARCELOS,25718,122476.12
AM,BARREIRINHA,27355,5750.57
AM,BENJAMIN CONSTANT,33411,8793.42

I'm trying to add just the number of inhabitants of the northern region, in this case, AC and AM. For this I used the code below (in Python 3.6.5):

import csv

populacao = 0
arquivo = open('brasil.csv', encoding='utf8')
for registro in csv.reader(arquivo):
    habitantes = registro[2]
    estado = registro[0]
    if habitantes != 'habitantes':
        if estado != 'estado':
            regiao_norte = ['AC', 'AM']
            for estado in regiao_norte:
                populacao += int(habitantes)
print(populacao)

I get as a sum: 381511598. But the sum is clearly incorrect. I thought using the list would act as a picker of the states I wanted to add. I can not understand what I'm missing. How can I make this sum correctly?

python python-3.x csv

asked by anonymous 03.10.2018 / 23:43

1 answer

Storing Strings in Vectors - C Type of relationships 1: 1

score 7 · Accepted Answer

Your mistake, at least what pops the most, is to go through each northern state in each record, which is two, and add the number of inhabitants of that line twice. And it happens on all the lines (north and other) because you're not filtering.

The csv module is unnecessary in most of the cases I see, this is one of them, you can simply do:

regiao_norte = {'AC', 'AM'}
populacao = 0
with open('brasil.csv') as f:
    f.readline() # ignorar o nome das colunas, so para evitar fazer operacoes desnecessarias em baixo sobre esta linha 
    for l in f: # percorrer cada linha do ficheiro
        vals = l.replace('\n', '').split(',') # tirar a quebra de linha e separar por virgula
        if(vals[0] in regiao_norte): 
            populacao += int(vals[2])
print(populacao) # 134963 para o exemplo colocado

With the csv module:

import csv

regiao_norte = {'AC', 'AM'}
with open('brasil.csv') as f:
    populacao = sum(int(vals[2]) for vals in csv.reader(f) if vals[0] in regiao_norte)
print(populacao) # 134963 para o exemplo colocado

For more serious things you also have a massively used module, pandas , for this case I think not worth it, but here's the case you want to go 'beyond' with this dataset:

import pandas as pd

df = pd.read_csv('brasil.csv')
df_norte = df.loc[(df['ESTADO'] == 'AC') | (df['ESTADO'] == 'AM')] # linhas onde o estado for 'AM' ou 'AC'
populacao = df_norte['HABITANTES'].sum() # 134963 para o exemplo colocado