Problems with incorrect data in a dataset (using Pandas)

Question

Problems with incorrect data in a dataset (using Pandas)

Navigation

#1 by (2 votes)

3

I have a dataset called Auto.csv, which has the form:

mpg cylinders   displacement    horsepower  weight  acceleration    year    origin  name
18  8   307 130 3504    12  70  1   chevrolet chevelle malibu
15  8   350 165 3693    11.5    70  1   buick skylark 320
18  8   318 150 3436    11  70  1   plymouth satellite
16  8   304 150 3433    12  70  1   amc rebel sst
17  8   302 140 3449    10.5    70  1   ford torino
15  8   429 198 4341    10  70  1   ford galaxie 500
14  8   454 220 4354    9   70  1   chevrolet impala
14  8   440 215 4312    8.5 70  1   plymouth fury iii
14  8   455 225 4425    10  70  1   pontiac catalina
?   8   390 190 3850    8.5 70  1   amc ambassador dpl
15  8   383 170 3563    10  70  1   dodge challenger se
14  8   340 160 3609    8   70  1   plymouth 'cuda 340
15  8   400 150 3761    9.5 70  1   chevrolet monte carlo
14  8   455 225 3086    10  70  1   buick estate wagon (sw)
24  4   113 95  2372    15  70  3   toyota corona mark ii
22  6   198 95  2833    15.5    70  1   plymouth duster
18  6   199 97  2774    15.5    70  1   amc hornet
21  6   200 85  2587    16  70  1   ford maverick

Imagine that the dataset has 1million lines, and possibly somewhere there is a character in place of a number, that is, by an error, somewhere in the column mpg that should contain only numbers there is a strange character like '?'

import pandas as pd
import numpy as np

carros = pd.read_csv('Auto.csv',sep =',')
print(carros['mpg'])

How can I test if the mpg column contains only numbers, and if not, return a new Pandas dataframe with the row removed?

Dataset source: link

python-3.x pandas

asked by anonymous 28.05.2018 / 00:41

1 answer

What is the difference between the prefer-dist and prefer-source options in Composer? Place If no Where

score 2 · Accepted Answer

A secure way to test whether the value of the line is numeric (for question data) is to define a function that attempts to convert to int and checks for error:

def e_numero(x):
    try:
        # tenta converter para inteiro
        int(x)
        # retorna verdadeiro se conseguir...
        return True
    except ValueError:
        # ... ou falso, caso não consiga
        return False

After defining the function, to test for rows with incorrect value in the column:

dt[dt['mpg'].apply(lambda x: not(e_numero(x)))]

And to generate the new DataFrame without the incorrect rows:

dt_novo = dt[dt['mpg'].apply(lambda x: e_numero(x))]

Here is an example session:

# Dados
dt
Out[54]:
   mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                       name
0   18          8           307         130    3504          12.0    70       1  chevrolet chevelle malibu
1   15          8           350         165    3693          11.5    70       1          buick skylark 320
2   18          8           318         150    3436          11.0    70       1         plymouth satellite
3   16          8           304         150    3433          12.0    70       1              amc rebel sst
4   17          8           302         140    3449          10.5    70       1                ford torino
5   15          8           429         198    4341          10.0    70       1           ford galaxie 500
6   14          8           454         220    4354           9.0    70       1           chevrolet impala
7   14          8           440         215    4312           8.5    70       1          plymouth fury iii
8   14          8           455         225    4425          10.0    70       1           pontiac catalina
9    ?          8           390         190    3850           8.5    70       1         amc ambassador dpl
10  15          8           383         170    3563          10.0    70       1        dodge challenger se
11  14          8           340         160    3609           8.0    70       1         plymouth 'cuda 340
12  15          8           400         150    3761           9.5    70       1      chevrolet monte carlo
13  14          8           455         225    3086          10.0    70       1    buick estate wagon (sw)
14  24          4           113          95    2372          15.0    70       3      toyota corona mark ii
15  22          6           198          95    2833          15.5    70       1            plymouth duster
16  18          6           199          97    2774          15.5    70       1                 amc hornet
17  21          6           200          85    2587          16.0    70       1              ford maverick

# AQUI => Lista as linhas com valores não numéricos na coluna mpg
dt[dt['mpg'].apply(lambda x: not(e_numero(x)))]
Out[55]:
  mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                name
9   ?          8           390         190    3850           8.5    70       1  amc ambassador dpl

# AQUI => Cria o DataFrame dt_novo sem as linhas com valores incorretos
dt_novo = dt[dt['mpg'].apply(lambda x: e_numero(x))]

dt_novo
Out[57]:
   mpg  cylinders  displacement  horsepower  weight  acceleration  year  origin                       name
0   18          8           307         130    3504          12.0    70       1  chevrolet chevelle malibu
1   15          8           350         165    3693          11.5    70       1          buick skylark 320
2   18          8           318         150    3436          11.0    70       1         plymouth satellite
3   16          8           304         150    3433          12.0    70       1              amc rebel sst
4   17          8           302         140    3449          10.5    70       1                ford torino
5   15          8           429         198    4341          10.0    70       1           ford galaxie 500
6   14          8           454         220    4354           9.0    70       1           chevrolet impala
7   14          8           440         215    4312           8.5    70       1          plymouth fury iii
8   14          8           455         225    4425          10.0    70       1           pontiac catalina
10  15          8           383         170    3563          10.0    70       1        dodge challenger se
11  14          8           340         160    3609           8.0    70       1         plymouth 'cuda 340
12  15          8           400         150    3761           9.5    70       1      chevrolet monte carlo
13  14          8           455         225    3086          10.0    70       1    buick estate wagon (sw)
14  24          4           113          95    2372          15.0    70       3      toyota corona mark ii
15  22          6           198          95    2833          15.5    70       1            plymouth duster
16  18          6           199          97    2774          15.5    70       1                 amc hornet
17  21          6           200          85    2587          16.0    70       1              ford maverick