Calculate how many NaN in each Python time

1

Good morning, I have a DF where you have NaN and I need to count how many NaN I have in each time. My df has a period of 31 days. I think I should use Groupby, but I'm not finding a way.

import numpy as np
import pandas as pd

df = pd.read_hdf('./dados.hd5')

df.head()

Year_DoY_Hour          Temperatura
2016-01-01 00:00:00    NaN
2016-01-01 01:00:00    22.445700
2016-01-01 02:00:00    22.388300
2016-01-01 03:00:00    22.400000
2016-01-01 04:00:00    NaN
2016-01-01 05:00:00    22.133900
2016-01-01 06:00:00    21.948999
2016-01-01 07:00:00    21.787901
2016-01-01 08:00:00    21.610300
2016-01-01 09:00:00    NaN
.
.
.
2016-31-01 00:00:00    NaN
2016-31-01 01:00:00    21.310800
2016-31-01 02:00:00    22.910700
2016-31-01 03:00:00    23.810600
2016-31-01 04:00:00    NaN

We can see that in time 00:00:00 we have 2 NaN and in 03:00:00 we have 1 NaN and in 04:00:00 we have 2 NaN

    
asked by anonymous 17.03.2017 / 15:34

2 answers

4

The answer you have is correct, but do not use the Pandas (which is what you use and asked). Here's one that uses the Pandas then. :)

Basically the idea is to do the grouping by setting labels that use the date (year, month and day) and just the time . The rest (minutes and seconds) you ignore. In this way, each grouping will have all the entries of the same time, and then just count the number of nulls ( NaN ). An important issue is that the collation function expects a function (or lambda, as in the case of the example) that will receive the table index. So, it makes sense to convert the date / time column to DateTime first, and then convert it to the table index. So you can process this value directly in the / lambda function.

Here is a sample code (documented step-by-step):

import pandas as pd

# Somente para a leitura dos dados de teste
df = pd.read_csv('teste.csv', sep=',')

# Converte os dados da coluna data/hora para o formato DateTime
df['Year_DoY_Hour'] = pd.to_datetime(df['Year_DoY_Hour'], format='%Y-%d-%m %H:%M:%S')

# Converte a coluna da data/hora para o índice da tabela
df = df.set_index('Year_DoY_Hour')

# Imprime os dados originais
print('Original:')
print('-' * 40)
print(df)

# Agrupa os dados pelo índice em rótulos do tipo "ano-dia-mês hora:00:00"
grouped = df.groupby(
                     lambda index: '{:04d}-{:02d}-{:02d} {:02d}:00:00' \
                     .format(index.year, index.day, index.month, index.hour)
                    )

# Faz a contagem para cada agrupamento e imprime
print('\n')
print('Contagem de NaNs por hora:')
print('-' * 40)
for label, group in grouped:
    # Conta o número de NaNs (nulos) na coluna 'Temperatura' do grupo atual
    numNaNs = len(group[group['Temperatura'].isnull()])
    # Imprime a contagem do grupo (com o respectivo rótulo)
    print('{}: {}'.format(label, numNaNs))

Using as input data a CSV with:

Year_DoY_Hour,Temperatura
2016-01-01 00:00:00,NaN
2016-01-01 00:01:00,NaN
2016-01-01 00:18:22,22.388300
2016-01-01 00:55:00,NaN
2016-01-01 01:00:00,22.445700
2016-01-01 02:00:00,22.388300
2016-01-01 03:00:00,22.400000
2016-01-01 04:00:00,NaN
2016-01-01 05:00:00,22.133900
2016-01-01 06:00:00,21.948999
2016-01-01 07:00:00,21.787901
2016-01-01 08:00:00,21.610300
2016-01-01 09:00:00,NaN
2016-01-01 09:07:18,NaN
2016-01-01 09:44:00,21.310800
2016-31-01 02:00:00,22.910700
2016-31-01 03:00:00,23.810600
2016-31-01 04:00:00,NaN

The result is:

Original:
----------------------------------------
                     Temperatura
Year_DoY_Hour
2016-01-01 00:00:00          NaN
2016-01-01 00:01:00          NaN
2016-01-01 00:18:22    22.388300
2016-01-01 00:55:00          NaN
2016-01-01 01:00:00    22.445700
2016-01-01 02:00:00    22.388300
2016-01-01 03:00:00    22.400000
2016-01-01 04:00:00          NaN
2016-01-01 05:00:00    22.133900
2016-01-01 06:00:00    21.948999
2016-01-01 07:00:00    21.787901
2016-01-01 08:00:00    21.610300
2016-01-01 09:00:00          NaN
2016-01-01 09:07:18          NaN
2016-01-01 09:44:00    21.310800
2016-01-31 02:00:00    22.910700
2016-01-31 03:00:00    23.810600
2016-01-31 04:00:00          NaN


Contagem de NaNs por hora:
----------------------------------------
2016-01-01 00:00:00: 3
2016-01-01 01:00:00: 0
2016-01-01 02:00:00: 0
2016-01-01 03:00:00: 0
2016-01-01 04:00:00: 1
2016-01-01 05:00:00: 0
2016-01-01 06:00:00: 0
2016-01-01 07:00:00: 0
2016-01-01 08:00:00: 0
2016-01-01 09:00:00: 2
2016-31-01 02:00:00: 0
2016-31-01 03:00:00: 0
2016-31-01 04:00:00: 1
    
17.03.2017 / 17:54
2

I do not know what the format really is in hd5 (I researched but I did not understand), if it is what you put instead of doing ...split(',') as I do in the examples below it does ....split(' ') (4 spaces ). The csv format I used for testing is:

2016-01-01 00:00:00,    NaN
2016-01-01 01:00:00,    22.445700
2016-01-01 02:00:00,    22.388300
2016-01-01 03:00:00,    22.400000
2016-01-01 04:00:00,    NaN
2016-01-01 05:00:00,    22.133900
2016-01-01 06:00:00,    21.948999
2016-01-01 07:00:00,    21.787901
...

With groupby you can do this:

from itertools import groupby

with open('tests.csv', 'r') as f:
    dados = [(l.split(',')[0], l.split(',')[1].strip()) for l in f]
print(dados) # [('2016-01-01 00:00:00', 'NaN'), ('2016-01-01 01:00:00', '22.445700'), ('2016-01-01 02:00:00', '22.388300'), ('2016-01-01 03:00:00', '22.400000'), ...]
dados_sort = sorted((k.split()[1], v) for k, v in dados) # importante
for hora, group in groupby(dados_sort, key=lambda x: x[0]):
    group = list(group)
    if any(v == 'NaN' for k, v in group):
        print('Existem {} NaN na hora {}'.format(len(group), hora))

Program output for the data you gave:

  

There are 2 NaN in the hour 00:00:00
There are 2 NaN in the hour 04:00:00
There are   1 NaN at 09:00 AM


But honestly I would not do it this way (unless I really needed it), it would do so:

from collections import Counter

dados = {}
with open('tests.csv', 'r') as f:
    for l in f:
        hora, val = l.split(',') # hora e temperatura, deves ja ter isto devidido por linha no teu caso
        dados.setdefault(val.strip(), []).append(hora.split(' ')[1])
print(dados) # {'22.388300': ['02:00:00'], '23.810600': ['03:00:00'], '21.610300': ['08:00:00'], '22.400000': ['03:00:00'], '21.948999': ['06:00:00'], 'NaN': ['00:00:00', '04:00:00', '09:00:00', '00:00:00', '04:00:00'], '22.910700': ['02:00:00'], '22.445700': ['01:00:00'], '21.787901': ['07:00:00'], '22.133900': ['05:00:00'], '21.310800': ['01:00:00']}
print(Counter(dados['NaN']))
  

{'00:00:00': 2, '04: 00: 00 ': 2, '09: 00: 00': 1}

Or, if you do not need to store the values, you can only:

from collections import Counter

list_NaN = []
with open('tests.csv', 'r') as f:
    for l in f:
        hora, val = l.split(',')
        if val.strip() == 'NaN':
            list_NaN.append(hora.split(' ')[1])
print(Counter(list_NaN))
  

{'00:00:00': 2, '04: 00: 00 ': 2, '09: 00: 00': 1}

    
17.03.2017 / 16:50