I'm trying to apply the embeddings algorithm in sentences to a dataset , but I only have 0 in a single column of the outputs. This is done in the context of the reproduction of this text understanding tutorial . I have a data set of questions in train ['question']
and sentences in train ['sentences']
. When I try to apply the insertion embeddings algorithm on them to have a 1D vector, all train ['sentences']
seems to be transformed into 0:
>>> train.head(3)
answer_start context question text sentences target sent_emb quest_emb
0 269 Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b... When did Beyonce start becoming popular? in the late 1990s [Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ... 1 [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... [[0.0045933574, 0.045667633, 0.052930944, 0.02...
1 207 Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b... What areas did Beyonce compete in when she was... singing and dancing [Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ... 1 [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... [[-5.123895e-05, 0.062714845, 0.055081595, 0.0...
2 526 Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b... When did Beyonce leave Destiny's Child and bec... 2003 [Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ... 3 [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... [[0.0045933574, 0.060728118, 0.059565082, 0.03...
In fact, as you can see to the right and to the side of the dataframe, sent_emb
is set to 0, while quest_emb
seems to have been calculated.
Embedding was done with this code:
def process_data(train):
print("step 1")
train['sentences'] = train['context'].apply(lambda x: [item.raw for item in TextBlob(x).sentences])
print("step 2")
train["target"] = train.apply(get_target, axis = 1)
print("step 3")
train['sent_emb'] = train['sentences'].apply(lambda x: [dict_emb[item][0] if item in\
dict_emb else np.zeros(4096) for item in x])
print("step 4")
train['quest_emb'] = train['question'].apply(lambda x: dict_emb[x] if x in dict_emb else np.zeros(4096) )
return train
process_data(train)
As you can imagine, it seems like dict_emb [item] [0] se o item em dict_emb
was applied to train ['quest_emb']
while else np.zeros (4096)
was applied to train ['sent_em' ]
, as if x was never part of dict_emb
. dict_emb
is the incorporation diction I did thanks to:
dict_emb
is the embedding dictionary that I made thanks to:
with open("data/dict_embeddings1.pickle", "rb") as f:
d1 = pickle.load(f)
with open("data/dict_embeddings2.pickle", "rb") as f:
d2 = pickle.load(f)
dict_emb = dict(d1)
dict_emb.update(d2)
And these pickles were created from this notebook create_emb.ipynb to create the embeddings dictionary.
I took less data than the data in the dataset because otherwise it spits. Will the problem come from there? Or creating embeddings?
Update
The problem comes from the third function, incorporation phrase:
train['sent_emb'] = train['sentences'].apply(
lambda x: [dict_emb[item][0] if item in dict_emb
else np.zeros(4096) for item in x])
No item
is part of dict_emb
. Here is an excerpt from dict_emb
:
{'What event was Frédéric a part of when he arrived in Paris during the later part of September in 1831?': array([[0.00812027, 0.0661487 , 0.05848939, ..., 0.02172186, 0.085614 ,
0.04505331]], dtype=float32), 'To whom did Beyonce credit as her major influence on her music?': array([[ 0.01196026, 0.07206462, 0.0604387 , ..., -0.00673536,
0.08809125, 0.04786895]], dtype=float32), 'Who was the first female to achieve the International Artist Award at the American Music Awards?': array([[0.00737114, 0.05858064, 0.04078764, ..., 0.02477051, 0.06046902,
0.06636532]], dtype=float32),...
and here is the first% tested%:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.
As for item
, it seems to have worked well in the embedding of questions. I feel that the include dictionary is defective.