Creating a simple recommendation system based on TF-IDF and Count Vectorizer
I recently started studying Natural Language Processing (NLP). As a beginner, I found that even a simple text analysis technique can be useful. So, this article will dive into several basic concepts regarding text analysis and give an idea on creating a simple recommendation system based on these concepts, building a recommendation system will enhence understanding of the concepts. Topics below will be discussing:
What is Count Vectorizer?
What is TF-IDF?
How TF-IDF improve Count Vectorizer?
Creating a simple recommendation system by TF-IDF
For those who have been exposed to NLP, a most widely recognized method for processing text is converting text to vectors that can be trained in a model. Count Vectorizer is one of the methods that can be used to transform text to vectors. A document (such as a sentence) in your data set is denoted as a 1×V vector; If there are N documents in your data set, N sample documents are denoted as N×V a matrix. For example, the kaggle data set used to create recommendation system has a genres column, each genre of a movie (a row) refers a document (Figure 1).
How to convert a document to a vector? The toy example intuitively explains the Count Vectorizer. The document ‘I like apple’ can be represented by [1,1,1,0,0,0,0,0,0]; ‘I like avocado’ can be denoted as [1,1,0,1,0,0,0,0,0]; ‘I do not like avocado, but avocado is healthy’ is converted to [1,1,0,2,1,1,1,1,1]. The three documents are turned into a 3×9 matrix. 9 is the vocabulary size.
Code for Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus = [
'I like apple',
'I like avocado',
'I do not like avocado, but avocado is healthy'
]
c2v=CountVectorizer()
c2v.fit_transform (corpus)
TF-IDF is technique that can be used to improve this basic Count Vectorizer. At this point, you might be wondering what need to be improved with Count Vectorizer; Stop words are culprit. Stop words are the words that appear everywhere but don’t necessarily carry a lot of information, such as ‘are’, ‘is’, ‘I’, ‘a’, ‘have’, ‘so’, ‘the’ etc. For example, if we are doing spam detection, both spam and non-spam email will contain these words. This can be less helpful to distingush texts. On the contrary, the impact of stop words might outweigh that of other meaningful words. Moreover, the stops words increase the dimensionality of vectors, the higher dimensionality the more cumputation is needed, which will increase the cost.
How to define stop words in texts? Stop words are application-sepcific and changed. For example,if articles about artificial intelligence and physics are classified, the term machine learning can be useful. However, if texts are all about AI, machine learning is likely a stop word. Is there a way to automatically detect the stop words in texts? TF-IDF can crack this problem. The essential idea of TF-IDF is to scale down the count of the stop words, at the same time scale up the count of the non stop words.
tf(t,d): the output (matrix) of the CountVectorizer in sklearn. idf(t) is inverse of proportion of documents that contain term t, then take log. If N/N(t) gets larger, so too will its log, thus idf(t) will rise when term t appear in less documents.
A example clearly shows how the idf(t)= log(N/N(t)) shrinks or enlarges the term frequency. Intuitively, the proportion of documents that include term t is larger, the frequency of term t will be smaller compared to its orginal frequency. This is how TF-IDF automatically identifies stop words in the documents.
Recommendation system
NLP is a most useful tool to create a recommendation system (RS). Although the NLP theories introduced above are simplest, they can be used to build a recommendation system. Let’s me tell you how to do this. The data set used for creating recommendation system is from kaggle.
Imagine a scenario where a user watched a movie and gave the movie high rating, what other six movies would you recommend to him based on the the movie he watched? Six other movies with the most similar tags and genres are the good choices. In preliminary data processing, I convert genres and tags that describe the movie to a vector using TF-IDF and close taste movies can be found through the vectors’ cosine distances.
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv('tmdb_5000_movies.csv')
df['description']=df.apply(create_description,axis=1)
Convert the description of movie to the corresponding vectors. max_features
is the parameter that enable me to limit the top number of features (words) selected in my TF-IDF matrix. The selection criteria is term frequency. In this case, I will choose 5000 words.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
tfidf=TfidfVectorizer(max_features=5000)
x=tfidf.fit_transform(df['description'])
print(x)
output:
Calculating cosine distances and select the six most similar movies.
movie=pd.Series(df.index,index=df['title'])
def Recommendation(title):
index=movie[title]
if type(index)==pd.Series:
index=index.iloc[0]
quary=x[index]
distances=cosine_similarity(quary,x)
distances=distances.flatten()
#extract movies' distances by descening order
recommendation_idx=(-distances).argsort()[1:7]
return df['title'].iloc[recommendation_idx]
Extract the top six movies that resemble Avatar in terms of taste. The results seems good.
print(Recommendation('Avatar'))
Extract the top six movies similar in taste to The Dark Knight Rises.
print(Recommendation('The Dark Knight Rises'))
Thanks for reading. If you like the blog, please give me a clap!