I try to apply a method of embedding sentences, [InferSent] [1], which provides semantic representations of sentences. It is trained in natural language inference data and generalizes well in many different tasks.
The process looks like this:
Create a vocabulary from the training data and use this vocabulary to train the offending model. Once the model is trained, provide a sentence as input for the encoder function that will return a vector of 4096 dimensions, regardless of the number of words in the sentence.
However, in my case, there are 130319 issues to encode and hence the encoder's function takes a lot of time. I wonder if there is any way to speed it up.
# Load model
from models import InferSent
MODEL_PATH = '../new_models/infersent1.pkl'
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
'pool_type': 'max', 'dpout_model': 0.0, 'version': 1}
model = InferSent(params_model)
model.load_state_dict(torch.load('InferSent/encoder/infersent1.pkl'))
W2V_PATH = 'InferSent/dataset/GloVe/glove.840B.300d.txt'
model.set_w2v_path(W2V_PATH)
model.build_vocab_k_words(K=100000)
# encode questions
questions = list(df["question"])
Ahí es cuando codifico. Agregué un temporizador para las previsiones :
import time
t0 = time.time()
for i in range(len(questions)):
if i%1000 == 0:
t1 = time.time()
total = t1-t0
print("encoding number ",i," time since beginning:", total)
dict_embeddings[questions[i]] = model.encode([questions[i]], tokenize=True)
That's when I codify. I've added a timer for the predictions:
import time
t0 = time.time()
for i in range(len(questions)):
if i%1000 == 0:
t1 = time.time()
total = t1-t0
print("encoding number ",i," time since beginning:", total)
dict_embeddings[questions[i]] = model.encode([questions[i]], tokenize=True)
So, in the first 1000 iterations, this gives me:
encoding number 0 time since beginning: 0.00016880035400390625
encoding number 1000 time since beginning: 228.6366264820099
If we speak well of second, with 130000 iterations, I estimate the time for 8 hours and 14 minutes. Which is really too long. This is why you are looking to optimize this cycle.
Reproducible Example
To make a reproducible example, here's how I came up with the 'phrases' I use:
# load json file
train = pd.read_json("data/train-v2.0.json")
# add data to a dataframe
contexts = []
questions = []
answers_text = []
answers_start = []
for i in range(train.shape[0]):
topic = train.iloc[i,0]['paragraphs']
for sub_para in topic:
for q_a in sub_para['qas']:
questions.append(q_a['question'])
if q_a['answers']:
answers_start.append(q_a['answers'][0]['answer_start'])
answers_text.append(q_a['answers'][0]['text'])
elif q_a['plausible_answers']:
answers_start.append(q_a['plausible_answers'][0]['answer_start'])
answers_text.append(q_a['plausible_answers'][0]['text'])
contexts.append(sub_para['context'])
df = pd.DataFrame({"context":contexts, "question": questions, "answer_start": answers_start, "text": answers_text})
# load data in csv file
df.to_csv("data/train.csv", index = None)
# Create dictionary of sentence embeddings for faster computation
paras = list(df["context"].drop_duplicates().reset_index(drop= True))
blob = TextBlob(" ".join(paras))
sentences = [item.raw for item in blob.sentences]
You can find the data in the [github of the Stanford Question Answering Data Set] [2].