Operands can not be transmitted together with forms that still appear to be the same

0

I tried to calculate the cosine similarity between two columns of a% data_frame% following this tutorial :

train["diff"] = (train["quest_emb"] - train["sent_emb"])**2

However, when computing it, it looks like I have a dimension error with the embedding vector that comes from GloVe.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
   1032             with np.errstate(all='ignore'):
-> 1033                 return na_op(lvalues, rvalues)
   1034         except Exception:

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in na_op(x, y)
   1011         try:
-> 1012             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
   1013         except TypeError:

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    204     if use_numexpr:
--> 205         return _evaluate(op, op_str, a, b, **eval_kwargs)
    206     return _evaluate_standard(op, op_str, a, b)

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     64     with np.errstate(all='ignore'):
---> 65         return op(a, b)
     66 

ValueError: operands could not be broadcast together with shapes (1,4096) (7,) 

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-81-af28fc11a9d3> in <module>()
----> 1 predicted = predictions(train)

<ipython-input-80-1699cf33d87c> in predictions(train)
      2 
      3     train["cosine_sim"] = train.apply(cosine_sim, axis = 1)
----> 4     train["diff"] = (train["quest_emb"] - train["sent_emb"])**2
      5     train["euclidean_dis"] = train["diff"].apply(lambda x: list(np.sum(x, axis = 1)))
      6     del train["diff"]

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in wrapper(left, right)
   1067             rvalues = rvalues.values
   1068 
-> 1069         result = safe_na_op(lvalues, rvalues)
   1070         return construct_result(left, result,
   1071                                 index=left.index, name=res_name, dtype=None)

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
   1035             if is_object_dtype(lvalues):
   1036                 return libalgos.arrmap_object(lvalues,
-> 1037                                               lambda x: op(x, rvalues))
   1038             raise
   1039 

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.arrmap_object()

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in <lambda>(x)
   1035             if is_object_dtype(lvalues):
   1036                 return libalgos.arrmap_object(lvalues,
-> 1037                                               lambda x: op(x, rvalues))
   1038             raise
   1039 

    ValueError: operands could not be broadcast together with shapes (1,4096) (130318,) 

4096 corresponds to the vectors constructed by the encoder function and 130318 corresponds to the number of rows in the dataframe.

So, this gives me this ValueError because the operands can not be transmitted together with shapes twice. However, it looks like it's the same:

print("len(train[\"quest_emb\"])",len(train["quest_emb"]))
print("len(train[\"sent_emb\"])",len(train["sent_emb"]))

len(train["quest_emb"]) 130318
len(train["sent_emb"]) 130318

The two columns look like this:

    sent_emb                                            quest_emb
0   [[0.030376578, 0.044331014, 0.081356354, 0.062...   [[0.01491953, 0.021973763, 0.021364095, 0.0393...
1   [[0.030376578, 0.044331014, 0.081356354, 0.062...   [[0.04444952, 0.028005758, 0.030357722, 0.0375...
2   [[0.030376578, 0.044331014, 0.081356354, 0.062...   [[0.03949683, 0.04509903, 0.018089347, 0.07667...
   ...

Having verified the size of the two columns, which is the same, and without actually knowing where the other numbers come from, ** why do we call the dimension of the embedding vector when I subtract vectors representing two texts? **

The goal is to create unsupervised learning. The complete code, but not updated is about Github .

Update:

When I try to see what is in a line of one and another I realize that the structure is not the same:

>>> print(train["quest_emb"][0])
[[0.01491953 0.02197376 0.02136409 ... 0.01360919 0.03114151 0.03259924]]

>>> print(train["sent_emb"][0])
[array([0.03037658, 0.04433101, 0.08135635, ..., 0.06764812, 0.04971079,
   0.02240689], dtype=float32), array([0.05260669, 0.04548098, 0.0382337 , ..., 0.04823414, 0.07656007,
   0.03501297], dtype=float32), array([0.0502927 , 0.04480611, 0.02038252, ..., 0.03942193, 0.03132772,
   0.04595207], dtype=float32), array([0.06769167, 0.03393815, 0.0625218 , ..., 0.05555448, 0.03059104,
   0.03422254], dtype=float32)]
    
asked by anonymous 15.08.2018 / 23:31

0 answers