NLP: Bag of Words Model
This article will introduce the common Bag of Words (BoW) model in NLP and how to utilize it for calculating the similarity between sentences using cosine similarity.
Firstly, let’s delve into what the Bag of Words model is. Let’s consider two simple sentences for example:
1 |
|
In NLP, handling complete paragraphs or sentences at once is often challenging. Hence, the first step typically involves tokenization and word segmentation. Here, as we only have sentences, we’ll focus on tokenization. For English sentences, the word_tokenize
function from NLTK can be used, while for Chinese sentences, the jieba
module can be utilized. Therefore, the initial step is tokenization, with code as follows:
1 |
|
The output result would be:
1 |
|
Tokenization is complete. The next step involves constructing a corpus, which consists of all words and punctuation appearing in the sentences. The code for creating the corpus is as follows:
1 |
|
The output would be a set containing all unique words and punctuation:
1 |
|
The following step is to establish a numerical mapping for the words and punctuation in the corpus. This aids in the subsequent vector representation of sentences. The code for creating the mapping is as follows:
1 |
|
The output would be a dictionary mapping each unique word or punctuation to a numerical value:
1 |
|
Although the words and punctuation aren’t mapped based on their order of appearance, it won’t affect the vector representation of sentences and subsequent sentence similarity.
The next crucial step in the Bag of Words model is to establish vector representations for sentences. This representation doesn’t merely select 0s and 1s based on the presence of words or punctuation; instead, it considers the frequency of their appearance as their numerical representation. Combining the corpus dictionary previously created, the code for vector representation of sentences is as follows:
1 |
|
The output would represent the true vector representations of the sentences:
1 |
|
Let’s pause for a moment and observe this vector. In the first sentence, “I” appears twice. In the corpus dictionary, “I” corresponds to the number 5, thus “5” appears twice in the first sentence, resulting in the tuple (5, 2)
in the list, indicating that the word “I” appears twice in the first sentence. The true vector representations of the two sentences are:
[2, 0, 0, 1, 1, 2, 0, 1, 1]
[1, 1, 1, 0, 1, 2, 1, 0, 1]
Now, the Bag of Words model is complete. Next, we’ll utilize this model, specifically the vector representations of the two sentences, to calculate their similarity.
In NLP, when two sentences are represented as vectors, cosine similarity is often chosen as a measure of similarity. The cosine similarity of vectors is the cosine value of the angle between the vectors. The Python code for calculating cosine similarity is as follows:
1 |
|
The output would be:
1 |
|
Thus, we’ve obtained the similarity between sentences using the Bag of Words model.
However, in practical NLP projects, for computing sentence similarity, one can easily use the gensim
module. Below is the code using gensim
to calculate similarity between two sentences:
1 |
|
The output would be:
1 |
|
Thank you for reading!