Filling in Python

0

Good morning, I have a data frame with air temperature, global radiation and CO2, but my CO2 data is with NaN and I need to find data in other "lines" similar to fill the Nan with the CO2 data.

import numpy as np
import pandas as pd

df = pd.read_hdf('./dados.hd5')

df.head()

Year_DoY_Hour          Tair        Rg       CO2
2016-01-01 00:00:00    22.651600   0.000    NaN
2016-01-01 00:30:00    22.445700   0.000    6.43
2016-01-01 01:00:00    22.388300   0.000    5.03
2016-01-01 01:30:00    22.400000   0.000    3.05
2016-01-01 02:00:00    22.257099   0.000    NaN
2016-01-01 02:30:00    22.133900   0.000    2.50
2016-01-01 03:00:00    21.948999   0.000    1.58
2016-01-01 03:30:00    21.787901   0.000    0.89
2016-01-01 04:00:00    21.610300   0.000    1.58
2016-01-01 04:30:00    21.619400   0.000    NaN
    
asked by anonymous 13.03.2017 / 14:16

1 answer

1

It seems to me from the description of your problem that you are dealing with a predictive problem, and more precisely, it is the problem of repairing the incomplete values of a set of data using the information contained in it . It is a common problem and known in the data science literature and the suggestions in general are to treat the problem as a normal classification or regression problem where the target variables will be the variables with incomplete values that you want to complete. p>

There are other ways in the literature to treat incomplete values, for example, the summary techniques here . However, since you've already decided to try to predict incomplete values by similarity, this link provides an easy example of how to implement a Linear Descriminant Analysis model for this purpose, using the machine learning library Scikit-Learn . I transcribe the specific part of the code below:

from pandas import read_csv
import numpy
from sklearn.preprocessing import Imputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# fill missing values with mean column values
imputer = Imputer()
transformed_X = imputer.fit_transform(X)
# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=3, random_state=7)
result = cross_val_score(model, transformed_X, y, cv=kfold, scoring='accuracy')
print(result.mean())
    
17.05.2017 / 08:53