Why does my phrase embeddings function define one of the output vectors with 0?

0

I'm trying to apply the embeddings algorithm in sentences to a dataset , but I only have 0 in a single column of the outputs. This is done in the context of the reproduction of this text understanding tutorial . I have a data set of questions in train ['question'] and sentences in train ['sentences'] . When I try to apply the insertion embeddings algorithm on them to have a 1D vector, all train ['sentences'] seems to be transformed into 0:

>>> train.head(3)
    answer_start            context                             question                                    text sentences                                                          target   sent_emb                                        quest_emb
0   269     Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...   When did Beyonce start becoming popular?    in the late 1990s   [Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...   1   [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...   [[0.0045933574, 0.045667633, 0.052930944, 0.02...
1   207     Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...   What areas did Beyonce compete in when she was...   singing and dancing     [Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...   1   [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...   [[-5.123895e-05, 0.062714845, 0.055081595, 0.0...
2   526     Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...   When did Beyonce leave Destiny's Child and bec...   2003    [Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...   3   [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...   [[0.0045933574, 0.060728118, 0.059565082, 0.03...

In fact, as you can see to the right and to the side of the dataframe, sent_emb is set to 0, while quest_emb seems to have been calculated.

Embedding was done with this code:

def process_data(train):

    print("step 1")
    train['sentences'] = train['context'].apply(lambda x: [item.raw for item in TextBlob(x).sentences])

    print("step 2")
    train["target"] = train.apply(get_target, axis = 1)

    print("step 3")
    train['sent_emb'] = train['sentences'].apply(lambda x: [dict_emb[item][0] if item in\
                                                           dict_emb else np.zeros(4096) for item in x])
    print("step 4")
    train['quest_emb'] = train['question'].apply(lambda x: dict_emb[x] if x in dict_emb else np.zeros(4096) )

    return train

process_data(train)

As you can imagine, it seems like dict_emb [item] [0] se o item em dict_emb was applied to train ['quest_emb'] while else np.zeros (4096) was applied to train ['sent_em' ] , as if x was never part of dict_emb . dict_emb is the incorporation diction I did thanks to:

dict_emb is the embedding dictionary that I made thanks to:

with open("data/dict_embeddings1.pickle", "rb") as f:
    d1 = pickle.load(f)
with open("data/dict_embeddings2.pickle", "rb") as f:
    d2 = pickle.load(f)
dict_emb = dict(d1)
dict_emb.update(d2)

And these pickles were created from this notebook create_emb.ipynb to create the embeddings dictionary.

I took less data than the data in the dataset because otherwise it spits. Will the problem come from there? Or creating embeddings?

Update

The problem comes from the third function, incorporation phrase:

train['sent_emb'] = train['sentences'].apply(
    lambda x: [dict_emb[item][0] if item in dict_emb 
               else np.zeros(4096) for item in x])

No item is part of dict_emb . Here is an excerpt from dict_emb :

{'What event was Frédéric a part of when he arrived in Paris during the later part of September in 1831?': array([[0.00812027, 0.0661487 , 0.05848939, ..., 0.02172186, 0.085614  ,
        0.04505331]], dtype=float32), 'To whom did Beyonce credit as her major influence on her music?': array([[ 0.01196026,  0.07206462,  0.0604387 , ..., -0.00673536,
         0.08809125,  0.04786895]], dtype=float32), 'Who was the first female to achieve the International Artist Award at the American Music Awards?': array([[0.00737114, 0.05858064, 0.04078764, ..., 0.02477051, 0.06046902,
        0.06636532]], dtype=float32),...

and here is the first% tested%:

Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.

As for item , it seems to have worked well in the embedding of questions. I feel that the include dictionary is defective.

    
asked by anonymous 14.08.2018 / 11:02

0 answers