When it comes to natural language processing, or NLP for short, word embedding is the key. Since most models are incapable of processing strings, we need to convert plain strings into numbers and this mapping of text is defined as word embedding.
Bag-of-words and Tf-Idf are two popular choices of word embedding. Both of them are based on the occurrence of words. Bag-of-words creates a document term matrix which stores how many times a word appears in each document. Tf-Idf is slightly different, as it will adjust for the fact that some words appear more frequently in the corpus. The words that appear frequently are less helpful in distinguishing documents, so we should assign less weights to them.
Tf-Idf is defined as the product of term frequency and inverse document frequency. Term frequency denotes the frequency of each word in each document, i.e. \[tf = f_{t,d}\] \(f_{t,d}\) denotes how many times term \(t\) appears in document \(t\), which can be found in the document term matrix. Inverse document frequency is defined as follows: \[idf = \log(\frac{N}{n_t})\] where \(N\) denotes the number of documents and \(n_t\) denotes how many documents contain term \(t\).
text2vec is a powerful package for text analysis and NLP. Here, I am going to use a simple example to illustrate how we can measure text similarity with Tf-Idf function from text2vec. Especially, we will see how important it is to choose an appropriate Idf function.
Suppose we have a corpus of only two sentences:
First, let us convert the text to lowercase letters and remove non-alphanumeric characters.
require(stringr)
require(text2vec)
text = c("I love apples.",
"I love apples too.")
prep_fun = function(x) {
x %>%
# convert text to lower case
str_to_lower %>%
# remove non-alphanumeric characters
str_replace_all("[^[:alnum:]]", " ")
}
clean_text = prep_fun(text)
Next, we can create a dictionary and build a document term matrix for the corpus. The dictionary consists of unique terms in the corpus.
# build the document term matrix
it = itoken(clean_text, progressbar = FALSE)
v = create_vocabulary(it)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
We can get the following document term matrix:
document | too | apples | i | love |
---|---|---|---|---|
1 | 1 | 1 | 1 | |
2 | 1 | 1 | 1 | 1 |
Then we can define a Tf-Idf model. By default, text2vec sets smooth_idf=TRUE so as to prevent a division-by-zero, that is adding 1 to the document frequency: \[smooth\_idf = \log(\frac{N}{n_t + 1})\]
# Tf-Idf weights
tfidf = TfIdf$new(smooth_idf = TRUE)
dtm_tfidf = fit_transform(dtm, tfidf)
The model yields the following Tf-Idf weight matrix:
document | too | apples | i | love |
---|---|---|---|---|
1 | -0.135 | -0.135 | -0.135 | |
2 | 0 | -0.101 | -0.101 | -0.101 |
Tf-Idf decreases the weight for common words and increases the weight for rare words. It will also normalize the document term matrix. For example, we get the weight for term apples in the first sentence by \(\frac{1}{3}\cdot \log{\frac{2}{2+1}} = -0.135\) and we get the weight for term too in the second sentence by \(\frac{1}{4}\cdot \log{\frac{2}{1+1}} = 0\).
With Tf-Idf weight matrix, we can then measure cosine similarities between sentences.
tfidf_cos_sim = sim2(dtm_tfidf, method="cosine", norm="l2")
print(tfidf_cos_sim)
## 2 x 2 sparse Matrix of class "dsCMatrix"
## 1 2
## 1 1 1
## 2 1 1
The result shows the similarity between these two sentences is 1, which indicates they are exactly the same. However, this is not the case. It is obvious that the second sentence has one more word, i.e. too.
We run into this problem because of the Idf function. Whenever one term appears in all documents except one, this term will given no weight using the default Idf function in text2vec. The example here might be a special case where we only have two sentences and they only differ in one word, but we can avoid such problems by choosing an appropriate Idf function.
We can use some other variants of Idf other than the default. For example, one popular choice is \[idf = \log(\frac{N}{n_t} + 1)\]
TfIdf$private_methods$get_idf = function(x) {
cs = colSums( abs(sign(x) ) )
if (private$smooth_idf)
idf = log(nrow(x) / cs + 1)
else
idf = log(nrow(x) / (cs))
Diagonal(x = idf)
}
Using this new Idf function, let us compute the cosine similarity again.
tfidf = TfIdf$new(smooth_idf = TRUE)
dtm_tfidf = fit_transform(dtm, tfidf)
tfidf_cos_sim = sim2(dtm_tfidf, method="cosine", norm="l2")
print(tfidf_cos_sim)
## 2 x 2 sparse Matrix of class "dsCMatrix"
## 1 2
## 1 1.0000000 0.7377375
## 2 0.7377375 1.0000000
The two sentences now have a similarity score of 0.74, which makes more sense to me.