NLP: Bag of Words Model

This article will introduce the common Bag of Words (BoW) model in NLP and how to utilize it for calculating the similarity between sentences using cosine similarity.

Firstly, let’s delve into what the Bag of Words model is. Let’s consider two simple sentences for example:

1
2
sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

In NLP, handling complete paragraphs or sentences at once is often challenging. Hence, the first step typically involves tokenization and word segmentation. Here, as we only have sentences, we’ll focus on tokenization. For English sentences, the word_tokenize function from NLTK can be used, while for Chinese sentences, the jieba module can be utilized. Therefore, the initial step is tokenization, with code as follows:

1
2
3
4
from nltk import word_tokenize

sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]

The output result would be:

1
[['I', 'love', 'sky', ',', 'I', 'love', 'sea', '.'], ['I', 'like', 'running', ',', 'I', 'love', 'reading', '.']]

Tokenization is complete. The next step involves constructing a corpus, which consists of all words and punctuation appearing in the sentences. The code for creating the corpus is as follows:

1
2
3
4
5
6
all_list = []
for text in texts:
all_list += text

corpus = set(all_list)
print(corpus)

The output would be a set containing all unique words and punctuation:

1
{'love', 'running', 'reading', 'sky', '.', 'I', 'like', 'sea', ','}

The following step is to establish a numerical mapping for the words and punctuation in the corpus. This aids in the subsequent vector representation of sentences. The code for creating the mapping is as follows:

1
2
corpus_dict = dict(zip(corpus, range(len(corpus))))
print(corpus_dict)

The output would be a dictionary mapping each unique word or punctuation to a numerical value:

1
{'running': 1, 'reading': 2, 'love': 0, 'sky': 3, '.': 4, 'I': 5, 'like': 6, 'sea': 7, ',': 8}

Although the words and punctuation aren’t mapped based on their order of appearance, it won’t affect the vector representation of sentences and subsequent sentence similarity.

The next crucial step in the Bag of Words model is to establish vector representations for sentences. This representation doesn’t merely select 0s and 1s based on the presence of words or punctuation; instead, it considers the frequency of their appearance as their numerical representation. Combining the corpus dictionary previously created, the code for vector representation of sentences is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Establishing vector representation for sentences
def vector_rep(text, corpus_dict):
vec = []
for key in corpus_dict.keys():
if key in text:
vec.append((corpus_dict[key], text.count(key)))
else:
vec.append((corpus_dict[key], 0))

vec = sorted(vec, key=lambda x: x[0])

return [item[1] for item in vec]

vec1 = vector_rep(texts[0], corpus_dict)
vec2 = vector_rep(texts[1], corpus_dict)
print(vec1)
print(vec2)

The output would represent the true vector representations of the sentences:

1
2
[2, 0, 0, 1, 1, 2, 0, 1, 1]
[1, 1, 1, 0, 1, 2, 1, 0, 1]

Let’s pause for a moment and observe this vector. In the first sentence, “I” appears twice. In the corpus dictionary, “I” corresponds to the number 5, thus “5” appears twice in the first sentence, resulting in the tuple (5, 2) in the list, indicating that the word “I” appears twice in the first sentence. The true vector representations of the two sentences are:
[2, 0, 0, 1, 1, 2, 0, 1, 1]
[1, 1, 1, 0, 1, 2, 1, 0, 1]

Now, the Bag of Words model is complete. Next, we’ll utilize this model, specifically the vector representations of the two sentences, to calculate their similarity.

In NLP, when two sentences are represented as vectors, cosine similarity is often chosen as a measure of similarity. The cosine similarity of vectors is the cosine value of the angle between the vectors. The Python code for calculating cosine similarity is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from math import sqrt

def similarity_with_2_sents(vec1, vec2):
inner_product = 0
square_length_vec1 = 0
square_length_vec2 = 0
for tup1, tup2 in zip(vec1, vec2):
inner_product += tup1 * tup2
square_length_vec1 += tup1**2
square_length_vec2 += tup2**2

return (inner_product / (sqrt(square_length_vec1 * square_length_vec2)))

cosine_sim = similarity_with_2_sents(vec1, vec2)
print('The cosine similarity between the two sentences is: %.4f.' % cosine_sim)

The output would be:

1
The cosine similarity between the two sentences is: 0.7303.

Thus, we’ve obtained the similarity between sentences using the Bag of Words model.

However, in practical NLP projects, for computing sentence similarity, one can easily use the gensim module. Below is the code using gensim to calculate similarity between two sentences:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
sent1 = "I love sky, I love sea."
sent2 = "I like running, I love reading."

from nltk import word_tokenize
sents = [sent1, sent2]
texts = [[word for word in word_tokenize(sent)] for sent in sents]

from gensim import corpora
from gensim.similarities import Similarity

# Building the corpus
dictionary = corpora.Dictionary(texts)

# Using doc2bow as the Bag of Words model
corpus = [dictionary.doc2bow(text) for text in texts]
similarity = Similarity('-Similarity-index', corpus, num_features=len(dictionary))

# Obtaining the similarity between sentences
new_sentence = sent1
test_corpus_1 = dictionary.doc2bow(word_tokenize(new_sentence))

cosine_sim = similarity[test_corpus_1][1]
print("Similarity between the two sentences using gensim: %.4f." % cosine_sim)

The output would be:

1
Similarity between the two sentences using gensim: 0.7303.

Thank you for reading!


NLP: Bag of Words Model
http://example.com/2023/10/12/NLP-Bag-of-Words-Model/
Author
Jenny Qu
Posted on
October 12, 2023
Licensed under