How to accelerate an encoder function?

0

I try to apply a method of embedding sentences, [InferSent] [1], which provides semantic representations of sentences. It is trained in natural language inference data and generalizes well in many different tasks.

The process looks like this:

  

Create a vocabulary from the training data and use this vocabulary to train the offending model. Once the model is trained, provide a sentence as input for the encoder function that will return a vector of 4096 dimensions, regardless of the number of words in the sentence.

However, in my case, there are 130319 issues to encode and hence the encoder's function takes a lot of time. I wonder if there is any way to speed it up.

    # Load model
    from models import InferSent
    MODEL_PATH =  '../new_models/infersent1.pkl'
    params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                    'pool_type': 'max', 'dpout_model': 0.0, 'version': 1}
    model = InferSent(params_model)

    model.load_state_dict(torch.load('InferSent/encoder/infersent1.pkl'))

    W2V_PATH = 'InferSent/dataset/GloVe/glove.840B.300d.txt'
    model.set_w2v_path(W2V_PATH)

    model.build_vocab_k_words(K=100000)

    # encode questions
    questions = list(df["question"])

Ahí es cuando codifico. Agregué un temporizador para las previsiones :

    import time

    t0 = time.time()
    for i in range(len(questions)):
        if i%1000 == 0:
            t1 = time.time()
            total = t1-t0
            print("encoding number ",i," time since beginning:", total)
        dict_embeddings[questions[i]] = model.encode([questions[i]], tokenize=True)

That's when I codify. I've added a timer for the predictions:

import time

t0 = time.time()
for i in range(len(questions)):
    if i%1000 == 0:
        t1 = time.time()
        total = t1-t0
        print("encoding number ",i," time since beginning:", total)
    dict_embeddings[questions[i]] = model.encode([questions[i]], tokenize=True)

So, in the first 1000 iterations, this gives me:

encoding number  0  time since beginning: 0.00016880035400390625
encoding number  1000  time since beginning: 228.6366264820099

If we speak well of second, with 130000 iterations, I estimate the time for 8 hours and 14 minutes. Which is really too long. This is why you are looking to optimize this cycle.

Reproducible Example

To make a reproducible example, here's how I came up with the 'phrases' I use:

# load json file
train = pd.read_json("data/train-v2.0.json")
# add data to a dataframe
contexts = []
questions = []
answers_text = []
answers_start = []
for i in range(train.shape[0]):
    topic = train.iloc[i,0]['paragraphs']
    for sub_para in topic:
        for q_a in sub_para['qas']:
            questions.append(q_a['question'])
            if q_a['answers']:
                answers_start.append(q_a['answers'][0]['answer_start'])
                answers_text.append(q_a['answers'][0]['text'])
            elif q_a['plausible_answers']:
                answers_start.append(q_a['plausible_answers'][0]['answer_start'])
                answers_text.append(q_a['plausible_answers'][0]['text'])
            contexts.append(sub_para['context'])
df = pd.DataFrame({"context":contexts, "question": questions, "answer_start": answers_start, "text": answers_text})
# load data in csv file
df.to_csv("data/train.csv", index = None)
# Create dictionary of sentence embeddings for faster computation
paras = list(df["context"].drop_duplicates().reset_index(drop= True))
blob = TextBlob(" ".join(paras))
sentences = [item.raw for item in blob.sentences]

You can find the data in the [github of the Stanford Question Answering Data Set] [2].

[1]: link [2]: link

    
asked by anonymous 09.08.2018 / 15:18

0 answers