Explanation
The bag-of-words model is a simplified representation used in the natural language processing > and in the Information Retrieval (IR) . In this model, a text (as a phrase or a document) is represented as the bag (multiset) of its words, disregarding the grammar and even the order of the words, but maintaining the multiplicity.
Implementation Example
The following templates are a text document using bag-of-words .
Here are two simple text documents:
(1) John gosta de assistir filmes. Mary também gosta de filmes.
(2) John também gosta de assistir jogos de futebol.
Based on these two text documents, a list is constructed as follows:
[
"John" ,
"gosta" ,
"de" ,
"assistir" ,
"filmes" ,
"Mary" ,
"também" ,
"futebol" ,
"jogos"
]
It is also common to calculate the frequency of appearance of words:
linear(tj) = 1 − d(tj)/N
Where tj
is the word you want to find the frequency, d(tj)
the number of times the word appears and N
is the number of documents or phrases.
Conclusion
Simply put, the bag-of-words is a form of text representation. And it is commonly used to machine learning , sentiment analysis , chatbot and topic model .
Source: Wikipedia