Merging function calculating wrongly

3

I'm trying to write a function in python that takes two vectors, deletes duplicates in the first and adds the corresponding data in the second.

I mean trying to emulate the following MySQL query:

SELECT 'Price', sum('Volume') FROM 'Dataset' GROUP BY 'Price'

My code looks like this:

def Group(X, Y):
  temp = dict.fromkeys(X, [0, 0])
  L = range(len(X))

  for i in L:
    print('BEFORE: i=%s, temp=%s, temp[X[i]]=%s, X=%s, X[i]=%s, Y=%s, Y[i]=%s' %(i, temp, temp[X[i]], X, X[i], Y, Y[i]))

    temp[X[i]][0] += Y[i]
    temp[X[i]][1] += 1

    print('AFTER: i=%s, temp=%s, temp[X[i]]=%s, X=%s, X[i]=%s, Y=%s, Y[i]=%s' %(i, temp, temp[X[i]], X, X[i], Y, Y[i]))

  return temp

When calling the function, it should respond like this:

>>> a = [1, 1, 2]
>>> b = [3, 4, 5]
>>> Group(a, b)
BEFORE: i=0, temp={1: [0, 0], 2: [0, 0]}, temp[X[i]]=[0, 0], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=3
AFTER: i=0, temp={1: [3, 1], 2: [0, 0]}, temp[X[i]]=[3, 1], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=3
BEFORE: i=1, temp={1: [3, 1], 2: [0, 0]}, temp[X[i]]=[3, 1], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=4
AFTER: i=1, temp={1: [7, 2], 2: [0, 0]}, temp[X[i]]=[7, 2], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=4
BEFORE: i=2, temp={1: [7, 2], 2: [5, 1]}, temp[X[i]]=[5, 1], X=[1, 1, 2], X[i]=2, Y=[3, 4, 5], Y[i]=5
AFTER: i=2, temp={1: [7, 2], 2: [5, 1]}, temp[X[i]]=[5, 1], X=[1, 1, 2], X[i]=2, Y=[3, 4, 5], Y[i]=5
{1: [7, 2], 2: [5, 1]}

But it's responding like this:

>>> a = [1, 1, 2]
>>> b = [3, 4, 5]
>>> Group(a, b)
BEFORE: i=0, temp={1: [0, 0], 2: [0, 0]}, temp[X[i]]=[0, 0], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=3
AFTER: i=0, temp={1: [3, 1], 2: [3, 1]}, temp[X[i]]=[3, 1], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=3
BEFORE: i=1, temp={1: [3, 1], 2: [3, 1]}, temp[X[i]]=[3, 1], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=4
AFTER: i=1, temp={1: [7, 2], 2: [7, 2]}, temp[X[i]]=[7, 2], X=[1, 1, 2], X[i]=1, Y=[3, 4, 5], Y[i]=4
BEFORE: i=2, temp={1: [7, 2], 2: [7, 2]}, temp[X[i]]=[7, 2], X=[1, 1, 2], X[i]=2, Y=[3, 4, 5], Y[i]=5
AFTER: i=2, temp={1: [12, 3], 2: [12, 3]}, temp[X[i]]=[12, 3], X=[1, 1, 2], X[i]=2, Y=[3, 4, 5], Y[i]=5
{1: [12, 3], 2: [12, 3]}

In the first execution of the loop the function errs, because I want it to add the value in Y[0] only in temp[1][0] , not temp[2][0] also.

Can anyone see what I'm doing wrong?

Thank you in advance!

[RESOLVED]

I want to thank you again for the valuable tips and answers! The solution I chose / found was:

def Group(X, Y, isSum):
    temp = dict.fromkeys(X, (0, 0))
    for i in range(len(X)):
        temp[X[i]] = (temp[X[i]][0] + Y[i], temp[X[i]][1] + 1)
    if not isSum:
        for i in temp:
            temp[i] = temp[i][0] / temp[i][1]
    return temp

And the machine response was:

>>> a = [1, 1, 2, 3, 4]
>>> b = [5, 6, 7, 7, 8]
>>> 
>>> Group(a, b, True)
{1: (11, 2), 2: (7, 1), 3: (7, 1), 4: (8, 1)}
>>> Group(b, a, False)
{8: 4.0, 5: 1.0, 6: 1.0, 7: 2.5}
>>> 

I actually wanted two functions in one, to emulate these two queries:

SELECT 'Price', SUM('Volume') FROM 'Dataset' GROUP BY 'Price' ;
SELECT 'Volume', AVG('Price') FROM 'Dataset' GROUP BY 'Volume' ;

I apologize for having presented only half the problem.

Hugs to all!

[RESOLVED 2]

Here's an add-on, with real problem data:

Dataset = [
    # Tempo, Volume e Preço do par BTCUSD em [https://www.bitfinex.com/]
    (1426502898, 0.01, 292.76),
    (1426503160, 0.01, 292.77),
    (1426503169, 0.0224, 292.77),
    (1426503180, 0.304111, 292.77),
    (1426503188, 0.026, 292.77),
    (1426503194, 0.388, 292.77),
    (1426503287, 0.01, 292.76),
    (1426503287.14286, 0.202, 292.7),
    (1426503287.28571, 0.7082, 292.77),
    (1426503287.42857, 1.12, 292.75),
    (1426503287.57143, 2.57392822, 292.76),
    (1426503287.71429, 3.095, 292.77),
    (1426503287.85714, 7, 292.77),
    (1426503292, 0.22, 292.7),
    (1426503320, 0.5, 292.78),
    (1426503340, 0.5, 292.78),
    (1426503358, 0.5, 292.78),
    (1426503377, 0.01, 292.76),
    (1426503377.5, 0.49, 292.78),
    (1426503378, 0.01, 292.75),
    (1426503384, 0.01, 292.76),
    (1426503385, 0.1, 292.78),
    (1426503394, 0.01, 292.76),
    (1426503395, 0.5, 292.78),
    (1426503399, 0.01, 292.76),
    (1426503402, 0.01, 292.76),
    (1426503407, 0.01, 292.76),
    (1426503414, 0.181549, 292.78),
    (1426503414.5, 0.318451, 292.78),
    (1426503415, 0.01, 292.76)
]

partition = lambda data, pos: [i[pos] for i in data]

def Group(X, Y, isSum):
    temp = dict.fromkeys(X, (0, 0))
    for i in range(len(X)):
        temp[X[i]] = (temp[X[i]][0] + Y[i], temp[X[i]][1] + 1)
    if not isSum:
        for i in temp:
            temp[i] = temp[i][0] / temp[i][1]
    return temp

And the machine response was:

>>> 
>>> Group(partition(Dataset, 2), partition(Dataset, 1), True)
{292.77: (11.553711, 8), 292.78: (3.09, 8), 292.76: (2.6639282199999985, 10), 292.75: (1.1300000000000001, 2), 292.7: (0.42200000000000004, 2)}
>>> Group(partition(Dataset, 1), partition(Dataset, 2), False)
{0.304111: 292.77, 1.12: 292.75, 2.57392822: 292.76, 3.095: 292.77, 0.181549: 292.78, 0.5: 292.78, 7: 292.77, 0.026: 292.77, 0.202: 292.7, 0.01: 292.76000000000005, 0.1: 292.78, 0.388: 292.77, 0.318451: 292.78, 0.49: 292.78, 0.0224: 292.77, 0.22: 292.7, 0.7082: 292.77}
>>> 
    
asked by anonymous 19.03.2015 / 19:23

1 answer

2

The problem with your code is that you are using a changeable object as the default value for your dictionary. When you do:

temp = dict.fromkeys(X, [0, 0])

You expect it to create a [0, 0] different list for each key in X , right? But that's not what happens, it's using the same list for all elements! There, of course, when modifying one of them modifies the others also ...

One way of solving it is by using a tuple, which is immutable:

temp = dict.fromkeys(X, (0, 0))

...

temp[X[i]] = (temp[X[i]][0] + Y[i], temp[X[i]][1] + 1)

Anyway, why not instead use the groupby method instead. of Python?

>>> from itertools import groupby
>>>
>>> a = [1, 1, 2]
>>> b = [3, 4, 5]
>>>
>>> temp = {}
>>> for g,e in groupby(sorted(zip(a,b)), lambda x: x[0]):
...   e = list(e)
...   temp[g] = [sum(x[1] for x in e), len(e)]
...
>>> temp
{1: [7, 2], 2: [5, 1]}

Explaining:

  • The zip(a,b) lists each element of a with its corresponding element of b , i.e. [(1,3), (1,4), (2,5)] ;

    • sorted is required because groupby only works correctly if the list is sorted by the collation key. The default ordering is sufficient because it first compares the first element (the key) and then the second element.
  • The groupby groups this list by the first element; g then is the element of list a , and e is an iterator on pairs grouped;

  • For each grouping, I compute the sum of the items in the b list list in that grouping and the number of those items, associating them with their element g .

19.03.2015 / 20:37