How to calculate the median of a large number of values?

2

I have a list with approximately 1.5 million values and I need to define the median of this set. The values are stored in a file, such as string , enclosed in single quotation marks (ex. '155' ).

How can I calculate the median of this amount of values?

Note: I can not use the ready functions, such as min , max , etc.

    
asked by anonymous 09.08.2018 / 20:18

3 answers

1

In this type of problem we have to be very careful with runtime and memory. Working with lists can be a problem. For now I suggest this solution.

1) Read the file line by line and, as you read, already add numbers to a list in an orderly manner

For this, we will use this function that places a number in a list in the ordered position.

def adiciona_na_ordem(lista, tamanho_lista, numero):
    for i in range(0, tamanho_lista):
        if numero < lista[i]:
            break
    else:
        i+=1
    return lista[:i]+[numero]+lista[i:]

>>> lista = [0,1,2,3]
>>> print (adiciona_na_ordem(lista, len(lista), 10))
[0, 1, 2, 3, 10]

And read the file line by line:

entrada = open('dados.txt', 'r')

lista_ordenada = [int(entrada.readline())] #Para não inicializar a lista vazia
num_lidos = 1 #Evita usar len(lista_ordenada)
for linha in entrada:
    numero = int(linha)
    lista_ordenada = adiciona_na_ordem(lista_ordenada, num_lidos, numero)
    num_lidos += 1

entrada.close()

An example would be:

para uma entrada:
7
7
1
4
4
5
6
7

>>> print (lista_ordenada)
[1, 4, 4, 5, 6, 7, 7, 7]

2) Paste the median

if num_lidos % 2 == 1:
    mediana = lista_ordenada[num_lidos//2]
else:
    mediana = (lista_ordenada[num_lidos//2 -1]+lista_ordenada[num_lidos//2]) / 2
    
10.08.2018 / 05:29
1

Well, the information is a bit vague and I do not know the format of the file where that data is, so I can give you just an idea of what to do.

1) Read the file and put the values in a list (if the file is .csv will make it much easier).

2) Transforms them into intergers using int () .

3) Use the sorted () function to sort the list, example here .

4) Calculate the total size of the list using the len () function.

5) Get the full size of the list and calculate the rest of the division of that number by 2 using the % sign.

6) If you are even, you average the core elements, if the median is the center element of your list.

    
10.08.2018 / 00:58
1

I would like to suggest another approach. If the numbers are integers or with a few decimal places, many of them can be repeated, so we can count the number of occurrences of each number one with a dictionary. So we do not use that much memory.

1) We read the file line by line and make a dictionary with the occurrences while maintaining a list with the ordered keys (we will use it in the future)

Function that places a number in a list in the sorted position:

def adiciona_na_ordem(lista, tamanho_lista, numero):
    for i in range(0, tamanho_lista):
        if numero < lista[i]:
            break
    else:
        i+=1
    return lista[:i]+[numero]+lista[i:]

We read the line by line:

entrada = open('dados.txt', 'r')

primeiro_valor = int(entrada.readline())
ocorrencias = {primeiro_valor:1}
lista_chaves_ordenadas = [primeiro_valor]
num_linhas = 1

for linha in entrada:
    numero = int(linha)
    if numero in ocorrencias: #Se o numero ja esta no dicionario
        ocorrencias[numero] += 1
    else: #Se não está, adiciono em ocorrencias
        ocorrencias[numero] = 1
        #OU ocorrencias.update({numero:1})
        lista_chaves_ordenadas = adiciona_na_ordem(lista_chaves_ordenadas, len(lista_chaves_ordenadas), numero)

    num_linhas += 1

entrada.close()

Example of what the variables would look like:

para uma entrada:
7
7
1
4
4
5
6
7

>>> print (ocorrencias)
{7: 3, 1: 1, 4: 2, 5: 1, 6: 1}
>>> print (lista_chaves_ordenadas)
[1, 4, 5, 6, 7]

2) Now we calculate the median by traversing the dictionary in the ordered order of the keys until the sum of the occurrences reaches half

if num_linhas % 2 == 1: #Buscamos o elemento central
    num_elementos = 0
    for key in lista_chaves_ordenadas:
        num_elementos += ocorrencias[key]
        if num_elementos >= num_linhas/2:
            print (key)
            break

else: #media dos dois elementos centrais
    num_elementos = 0
    mediana = None
    for key in lista_chaves_ordenadas:
        num_elementos += ocorrencias[key]
        #Prox 2 ifs se os valores medianos forem chaves diferentes, ex: 5 e 6
        if num_elementos == num_linhas/2:
            mediana = key
        if num_elementos > num_linhas/2 and mediana != None:
            mediana += key
            print(mediana/2)
            break
        #Prox if se os valores medianos forem a mesma chave, ex: 6 e 6
        if num_elementos > num_linhas/2 and mediana == None:
            mediana = key
            print(key)
            break

Edit: If you use pytho2.7, I think the dictionary is already sorted, so it does not need the list_conditions, just do for key, value in ocorrencias.iteritems(): instead of for key in lista_chaves_ordenadas: . But be careful with the divisions. Make /2.0 not round to integer

    
10.08.2018 / 05:40