I tried to calculate the cosine similarity between two columns of a% data_frame% following this tutorial :
train["diff"] = (train["quest_emb"] - train["sent_emb"])**2
However, when computing it, it looks like I have a dimension error with the embedding vector that comes from GloVe.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
1032 with np.errstate(all='ignore'):
-> 1033 return na_op(lvalues, rvalues)
1034 except Exception:
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in na_op(x, y)
1011 try:
-> 1012 result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
1013 except TypeError:
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/computation/expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
204 if use_numexpr:
--> 205 return _evaluate(op, op_str, a, b, **eval_kwargs)
206 return _evaluate_standard(op, op_str, a, b)
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/computation/expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
64 with np.errstate(all='ignore'):
---> 65 return op(a, b)
66
ValueError: operands could not be broadcast together with shapes (1,4096) (7,)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-81-af28fc11a9d3> in <module>()
----> 1 predicted = predictions(train)
<ipython-input-80-1699cf33d87c> in predictions(train)
2
3 train["cosine_sim"] = train.apply(cosine_sim, axis = 1)
----> 4 train["diff"] = (train["quest_emb"] - train["sent_emb"])**2
5 train["euclidean_dis"] = train["diff"].apply(lambda x: list(np.sum(x, axis = 1)))
6 del train["diff"]
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in wrapper(left, right)
1067 rvalues = rvalues.values
1068
-> 1069 result = safe_na_op(lvalues, rvalues)
1070 return construct_result(left, result,
1071 index=left.index, name=res_name, dtype=None)
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues)
1035 if is_object_dtype(lvalues):
1036 return libalgos.arrmap_object(lvalues,
-> 1037 lambda x: op(x, rvalues))
1038 raise
1039
pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.arrmap_object()
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/ops.py in <lambda>(x)
1035 if is_object_dtype(lvalues):
1036 return libalgos.arrmap_object(lvalues,
-> 1037 lambda x: op(x, rvalues))
1038 raise
1039
ValueError: operands could not be broadcast together with shapes (1,4096) (130318,)
4096 corresponds to the vectors constructed by the encoder function and 130318 corresponds to the number of rows in the dataframe.
So, this gives me this ValueError because the operands can not be transmitted together with shapes twice. However, it looks like it's the same:
print("len(train[\"quest_emb\"])",len(train["quest_emb"]))
print("len(train[\"sent_emb\"])",len(train["sent_emb"]))
len(train["quest_emb"]) 130318
len(train["sent_emb"]) 130318
The two columns look like this:
sent_emb quest_emb
0 [[0.030376578, 0.044331014, 0.081356354, 0.062... [[0.01491953, 0.021973763, 0.021364095, 0.0393...
1 [[0.030376578, 0.044331014, 0.081356354, 0.062... [[0.04444952, 0.028005758, 0.030357722, 0.0375...
2 [[0.030376578, 0.044331014, 0.081356354, 0.062... [[0.03949683, 0.04509903, 0.018089347, 0.07667...
...
Having verified the size of the two columns, which is the same, and without actually knowing where the other numbers come from, ** why do we call the dimension of the embedding vector when I subtract vectors representing two texts? **
The goal is to create unsupervised learning. The complete code, but not updated is about Github .
Update:
When I try to see what is in a line of one and another I realize that the structure is not the same:
>>> print(train["quest_emb"][0])
[[0.01491953 0.02197376 0.02136409 ... 0.01360919 0.03114151 0.03259924]]
>>> print(train["sent_emb"][0])
[array([0.03037658, 0.04433101, 0.08135635, ..., 0.06764812, 0.04971079,
0.02240689], dtype=float32), array([0.05260669, 0.04548098, 0.0382337 , ..., 0.04823414, 0.07656007,
0.03501297], dtype=float32), array([0.0502927 , 0.04480611, 0.02038252, ..., 0.03942193, 0.03132772,
0.04595207], dtype=float32), array([0.06769167, 0.03393815, 0.0625218 , ..., 0.05555448, 0.03059104,
0.03422254], dtype=float32)]