Building a Joke Recommender System
As the data is increasing day by day we have large data at our exposal nowadays. With this large amount of data it also becomes difficult for a user to find relevant information. This is where recommendation systems come handy and can be used. Recommender systems are so important that these days the majority of big companies like youtube, netflix, Amazon earn a large percentage of their revenues from them.
Amazon earns $98 billion of annual revenue from recommendations. On netflix 80% of content consumed by customers comes from recommendations. Thus all these companies are investing millions of dollars to improve on their recommendations.
A good Recommendation system helps a user to get relevant information based on the user interest or history, which saves a lot of effort from the user side thereby enhancing user experience and increasing user loyalty.
Generally Recommendation systems are divided into 2 types: content based and collaborative filtering based systems.
Content based RS typically uses user and item metadata and features related to items to recommend a new item. On the other hand collaborative filtering based RS treats items as a blackbox and generally tries to recommend based on the user item interactions e.g. ratings etc. With this small introduction lets jump straight to the business problem we will try to solve today.
Contents
- Business Problem
- ML Formulation
- Data Analysis
- Performance Metrics
- EDA
- Existing Approaches
- Approaches Tried
- Results
- Conclusions And Future works
- References
Business Problem
Imagine sitting bored and tired on your laptop after all the binge watching , social media etc. What if I tell you that there is a system that can put a smile on your face based on your humour? Yes, we are talking about a joke reader system containing different varieties of jokes for every user to have fun with. Well on the outset it does not look that interesting in today’s world where we have tons of online streaming services, but another way to look at it is as an exercise for your humor. Since humor is a key in understanding any joke , based on your humour you would like one joke funny and other not that funny. Also as humans humor is key in every interaction we make with others, so improving our humor would help in better interactions also.
Ok, even if people are ready to try such a system still people would not like to search for jokes, as it puts an extra effort for them. This is where our system would recommend jokes to a user based on their likes/humour. Like any other recommendation system the more user-item interaction one captures the better the system becomes. Here also for the same reason we would like the user to give a rating between -10 (not funny) to +10 (very funny) after reading the joke.
So in all, we would like to build a system which would give users some jokes to read based on their humour or other interactions and to record the rating given to the joke by the user.
ML Formulation
Given a user-joke interaction table, where each cell represents a rating given to a joke by a user, we would like to build a recommender system which would recommend jokes to a user based on the previous interactions.
Data Analysis
We will be using a jester dataset to solve our problem and can be downloaded from here. Dataset was built by people from the University Of California.
Data set contains two files , one file contains data in csv format containing 2 columns joke_id and joke. Joke_id is an integer column which represents each joke with an integer. Joke column contains the actual joke as text data.There are a total of 150 jokes in our dataset.
Second file contains 3 columns: user_id, joke_id, rating. User_id is an integer which represents each user . There are a total of 59132 unique users. Joke_id represents the joke for which the user has provided ratings. Rating column contains the actual ratings given by the user to the joke which is a real value lying between -10 to +10 where -10 represents NOT FUNNY and +10 represents VERY FUNNY.
There are few jokes in the system which are rated by almost all users. These jokes are 5, 7 ,8, 13, 15, 16, 17, 18, 19, 20.
Performance Metrics
Selection of a good metric is a very important step in building a good recommender system. Based on how we want to build our system we can select various metrics e.g. If we want our system to predict ratings, using MAE(Mean absolute error) or RMSE (Root mean square error ) is a good idea.
If we want to measure the quality of our recommendations we can use Precision and Recall.
If we want to measure how different our recommendations are to different users we can measure personalization.
This blog Performance metrics for Recommender system gives a good understanding about various metrics that can be used.
However for our purpose we will be using simple RMSE and MAP@K as our performance metrics, Since we will be predicting ratings and also want to measure the quality of our recommendations. More about MAP@K can be found here.
EDA
Before starting to solve any ML/DL problem it is absolutely necessary to perform EDA to get the sense out of our data.
Lets start by taking our data by using this simple code snippet
jokes = pd.read_csv('jester_items.csv')
ratings = pd.read_csv('jester_ratings.csv')
Distribution of joke ratings
We can generate the following plot using this snippets of code:
joke_groups = ratings[['userId','jokeId']].groupby([key],as_index=False).count()
'''Sorting the DF based on input provided'''
joke_groups = joke_groups.sort_values(by=sort_by,ascending=False)
'''plotting the plot'''
plt.plot(joke_groups[sort_by].values)
plt.title('Number of '+key+ ' VS Number of Ratings')
plt.xlabel('Number of '+key)
plt.ylabel('Number of Ratings')
plt.show()
print('*'*100)
The above code can be used to generate below plots by just changing the variables.
Below plot can be plotted using the following snippet of code
no_of_rated_jokes_per_user = ratings[['userId','jokeId']].groupby(by='userId').count().sort_values(
by='jokeId',ascending=False)['jokeId'].values
fig = plt.figure(figsize=plt.figaspect(.5))
ax1 = plt.subplot(121)
sns.kdeplot(no_of_rated_jokes_per_user, shade=True, ax=ax1)
plt.xlabel('No of ratings by user')
plt.title("PDF")
ax2 = plt.subplot(122)
sns.kdeplot(no_of_rated_jokes_per_user, shade=True, cumulative=True,ax=ax2)
plt.xlabel('No of ratings by user')
plt.title('CDF')
plt.show()
We can see from our plot that there are some jokes in our dataset that received high number of ratings (popular jokes) as compared to others.It seems that our data is following a power law distribution approximately, which makes sense because in general also on any streaming platform also there are only few items that are watched by many users . PDF and CDF also tells the similar story.
Cold start problem
A cold start problem is a very common problems in recommender systems. It happens when we have to recommend items to a user which has just joined our system or in other words if a user has not rated any items how will our system recommend items to such users?
Well in a content based system this is generally not an issue because we can recommend based on the meta data of the users itself. But in an collaborative setting this is a problem. It can be solved using various methods , the simplest of which is to recommend popular items from our system. We will also use this approach and recommend the top rated jokes to such a user.
Using a small snippet of code we can see if there is such a problem in our system or not.
non_rated_jokes = list(filter(lambda x: x not in joke_groups['jokeId'].values,total_jokes))
The result is that we actually have 10 such users in our system that have not rated any jokes.
Distribution of ratings given per user
The following plot shows distribution of ratings given by each user in our data.
Seeing the above plots we can see that there are some users in our data who have rated large number of jokes as compared to other users. This distribution is close to a power law distribution and depicts a common behaviour of users across any internet based platforms.
Distribution of Ratings
Here we will see how our actual ratings are distributed across all our data.
As we can see that our ratings lie between -10 and +10 and most of our jokes have received a rating of 3 approx. Also we can say that the people tend to like the jokes because there are higher number of ratings as compared to lower ones. Another Important observation we can see is that there are wide variety of jokes in our dataset from bad ones to good ones, this is a good thing because our model will get trained on all sorts of data and will help to generalize well.
Distribution of Average ratings for each joke
In the below plot we have plotted the distribution of average ratings received by each joke in our dataset
Seeing the plot we can see that our average ratings received by a joke lies between -4 and +4 with most of the jokes getting an average rating of 2.2 approximately. Higher average ratings for a joke can have two meanings, one is that the jokes were actual very funny or the other is that the jokes have a certain biases associated with them. If bisases are associated we might have to remove them before giving our final predictions.
Distribution of average ratings by each user
Below plot shows the distribution of average ratings given by each user
Looking at the above plot reminds us of a normal distribution, telling us that there are all sort of users in our dataset. There are highly critic users and overly lenient users also . This is a good thing because our system can capture user interactions of all sort of users and can perform well in general also.
Distribution of length of jokes (in words)
Here we are going to check the distributions of joke length
From the plot we can say that most of our jokes are between 90 to 100 words long. It makes sense to have shorter length jokes because it will consume less time of the user which enhances user experience.
Relationship between joke length and ratings
In this plot we will try to see if in any way joke length is related to the user ratings. For this on X axis we have our joke length and on Y axis is our rating for that particular jokes.
Initially i thought that it might be some relationship between length of joke and ratings where i expected shorter jokes to get better ratings because maybe they have higher chance of interacting well with the user , But as you can see there is hardly any relationship as our plot is completely random. It tells us that ratings to our joke is not in any way dependent on the length of the joke. Which means there definitely some humor and funniness in the jokes which is actually influencing the ratings.
Although one interesting observation is as follows : if we focus on joke length < 50 region we find that the density of our plot is very high . zooming in on this region we can see that if our jokes are of length < 25 they are rated between -2.5 to +2.5 on average and for jokes of length > 25 our jokes are getting much higher ratings . Thus medium sized jokes tend to perform better as compared to shorter and longer jokes on average in terms of ratings. However this cannot be the ground truth because our plot is very zig zagged in nature, but its just an observation from the plot.
Well this is all for our EDA. Next we will try to build some models for our problem
Existing Approaches
Well to be fair there are not much solutions out there which directly solve our Joke recommendations problem (maybe 2 at max), but there are tons of literature around recommender systems which one can read and get several ideas from.
One direct solution is presented by the same team from University of California which generated the jester dataset for us. The paper can be found out here. On an high level this approach tries to form cluster of similar users based on the ratings given by them to the popular items in our system. We will talk about it in detail later.
Another paper which was published in 2009 also a winner of Netflix prize competition is one of the finest work out there. Here our aim is generally to fill our empty cells in the ratings matrix. We do so by either using Matrix Factorization techniques or by solving a simple optimization problem. Details can be found in this paper.
With the advancement in Deep Learning i was very curious to find if there are some Deep learning based techniques which can be used to solve our problem, and luckily there are many deep learning based literatures out there which we can use. Of them this one is one of the most simplistic one. This paper actually focuses on capturing user item interactions through a MLP based simple model. Prior to this people usually were dependent on matrix factorization technique to capture some of these interactions , one problem with MF technique is that it can only capture the linear interactions . But if our interactions actually follow a non linear function using our deep learning model could be useful. In this paper there are also techniques discussed to combine MF based interactions and DL based interactions.
Above figure shows a very simple model from this paper to get our user and items in latent space. Initially we basically represent our user and items in a classical One hot encoding format and feed to a Embedding layer differently. Then we have bunch of MLP layers connected to finally to an output layer which tries to predict the ratings . Using the actual ratings we calculate our loss and back propogate it so that our network learns our embedding weights.More about embedding layers in DL can be learnt from this blog here.
Approaches Tried
In this section we will see the various approaches that were tried by me for the problem.
Before looking at models lets see how the data is divided in train and test.
Initially we took first 90% of users for training and last 10% users for testing purposes as shown below
total_users = len(np.unique(ratings['userId'].values))
train_users = list(range(total_users))[0:int(total_users*0.90)]
test_users = list(range(total_users))[int(total_users*0.90):]
Next is we need to make our test data in a way so that it is easy to calculate our metric i.e. MAP@K. For this we need to define some relevant jokes for each of our test users which will help in calculation of precision. For this purpose we will consider last only those joke id for which the test user has given ratings > 0 . Once we get such ratings we will take only latest 20% of them as relevant jokes. This can be done using the following code
'''For each test user we will select 20% of ratings to test our predictions'''
def create_test_users():
'''This function divides ratings given by test users into train and test'''
user_joke_interaction = []
#getting all the ratings given by a test user
for userid in tqdm(test_users):
user_joke_interaction.append([(jokeid,rating)
for jokeid,rating in enumerate(ratings_matrix.toarray()[userid])
if rating != 99.0 and rating > 0])
query_users_train = []
query_users_test = []
#keeping 80% of test user ratings to train the model, 20% for testing
for ratings in user_joke_interaction:
query_users_train.append(ratings[:int(len(ratings)*0.8)])
query_users_test.append(ratings[int(len(ratings)*0.8):])
return query_users_train,query_users_test
using this function for each test user we can get the relevant and non-relevant jokeid stored in query_users_train and query_users_test.
Next important function we need to discuss is how to calculate MAP@K, below code gives us its value given query_users_train and query_users_test.
def mean_avg_precision(query_users_train,query_users_test,distance_matrix,neighbours):
'''This fucntion calculates the MAP@k value for the given train and test ratings data'''
average_precision = 0
relevant_items = []
mean_avg_precision = 0
total_users = len([user for user in query_users_train if user != []])
#getting relevant jokes for the user
for user in query_users_test:
relevant_items.append([tup[0] for tup in user])
#calculating precision@k
for userid,user in tqdm(enumerate(query_users_train)):
if(user != []):
#getting 10 recommednations for current user
recommendations = recommend_model(user,distance_matrix,neighbours)
k = len(recommendations)
#getting relevant jokes for this user
relevant_items_user = relevant_items[userid]
#getting number of relevant jokes for this user
total_relevant_items = len(relevant_items_user)
#iterating over k to calculate avg precision@k Ref:http://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html
for i in range(k):
#checking if the kth item is relevant or not
if(recommendations[i] in relevant_items_user):
#getting recommendations till i
items_till_i = recommendations[:i]
#calculating precision till i recommendations
precision_i = len(list(filter(lambda x: x in relevant_items_user,items_till_i)))/k
#calcualting average precision over all values of k
average_precision = average_precision + precision_i
average_precision = average_precision/total_relevant_items
#calculating mean average precision over all users
mean_avg_precision = mean_avg_precision + average_precision
mean_avg_precision = mean_avg_precision/total_users
return mean_avg_precision
Now lets look at our models
Item-Item Similarity based on BOW(bag of words)
The most simple model that we can try is an similarity based model. Since we have our actual jokes with us in text format we could get a simple similarity matrix between our jokes. Using this similarity matrix we can recommend jokes to a user by selecting the nearest ‘n’ jokes to the jokes rated by our user.We will get a set of jokes next we remove those jokes which are rated by the user already. After this also our set could be large so we need to get ordering out of our jokes now. The simplest way to get this is by sorting the jokes in decreasing order of their average ratings received and giving the top ‘k’ jokes from it.
Ok now we know how to get jokes recommendations , but one question is how do we get our similarity matrix? Well we could simply convert our jokes into a simple BOW representation which will result in a sparse vector representation for our joke. We can do it easily using following code
def BOW_representation():
'''Function returns the bag of words representaion for each joke'''
#initializing countvectorizer
bow = CountVectorizer()
#fitting and transforming jokes data
encoding = bow.fit_transform(preprocessed_jokes)
return encoding.toarray()
Here preprocessed_jokes contain all our jokes which are prepocessed.
For calculating distance between jokes we can use following code
def bow_i_i_similarity(bow_encoding):
'''This function genrates an joke joke similarity matrix
given the bow representations of the jokes'''
distance_matrix = pairwise_distances(bow_encoding,metric='cosine')
return distance_matrix
Here we are using cosine distance to measure the distance between two jokes.
This model basically comes under Content filtering model becaeuse we are using actual jokes as our meta data and building a model on top of them.
One other thing that we can try here is getting our MAP@K for differnt values of n (neighbours). Below plot tries to see different metric values on different number of n.
As you can see that the maximum MAP@K that our model was able to achieve was 7e-5 for n=45. This value is low actually as we want our MAP@K to be high.
Now lets see if using the same model but different joke representation can improve our MAP@K or not.
Item-Item Similarity based on TF-IDF representation
Here there is only one difference i.e. we will represent our jokes using TF-IDF representations . Since we know that tfidf vectors gives weightage to rarer words, it might be able improve our model performance.
We can represent our jokes in TF-IDF using simple code as follows
'''Tfidf representation of jokes'''
tfidf = TfidfVectorizer()
tfidf_encoding = tfidf.fit_transform(preprocessed_jokes)
tfidf_encoding.toarray()[0]
Once we have our representations in vectors we again find similarity and get recommendations. Here is our plot for various values of n
As we can see that there is no improvement in MAP@K from bow model.Since ML is all about trying now lets try another representation using tfidf weighted W2vec.
Item-Item Similarity based on TF-IDF weighted W2Vec
To be fair i was really hopeful that this representation would perform better than other two, simply because we know the power of word2vec models. Word2vec is state of the art algorithm developed by google which outperforms all other representations because it maintains the semantic relationships between words. Here we will be using a pretrained glove model which has been trained on enormous text data . What this model returns is an n dimensional dense representations of a word. Important point to note here is that it returns a dense representations , this is another advantage it has as all previous representations were sparse vectors.
To get our representations for a joke we will using idf weighted vectors. Here we traverse through each word in our joke, get its idf value and multiply it with the dense vector of the word obtained by the glove model. We continue adding these vectors till words are finished and finally divide by sum of idf values of each word.Code for it is as follows
def tfidf_w2v():
'''This function returns the jokes represented in tfidf weighted w2vec format'''
tfidf_w2v_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_jokes): # for each review/sentence
vector = np.zeros(300) # as word vectors are of zero length
tf_idf_weight =0; # num of words with a valid vector in the sentence/review
for word in sentence.split(): # for each word in a review/sentence
if (word in glove_words) and (word in tfidf_words):
vec = model[word] # getting the vector for each word
# here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
vector += (vec * tf_idf) # calculating tfidf weighted w2v
tf_idf_weight += tf_idf
if tf_idf_weight != 0:
vector /= tf_idf_weight
tfidf_w2v_vectors.append(vector)
return tfidf_w2v_vectors
The plot for various values of n is as follows
As expected our map@k improved and increased 10 times from the previous values. Our max map@k is coming for n = 15.
Collaborative Filtering Model
Next Model that we will try is based on collaborative approach, as we know collaborative method tries to use user item interactions i.e Ratings to give recommendations.
In this model again we will calculate similarity between jokes , the only difference is how we are going to vectorize our jokes. In previous models we used the actual joke text to vectorize it, here we will use the ratings that each joke received as its representation.
All other things remain the same as in previous models.
There is a small problem with this representation though, consider a general case where one item is a mobile and other a cricket bat, representing them based on the ratings received could make them similar items if same users gave almost same ratings to these items. This would be bad because in no way a mobile is similar to a cricket bat. To solve this problem we can use meta data of items and see their category. However in our joke case we are ignoring this problem.
Our plot for different number of neighbours is as shown
As we can see this model is outperforming each of. our previous models with map@k = 0.00250 for n=15. This is almost a 4 times improvement.
Eigen State Model (Clustering of Users)
This Algorithm was presented by the team at University Of California , same team that actually collected the jester dataset. Earlier whenever collaborative methods were used we were using user selected queries: the user chooses which items to rate, resulting in a sparse representation . In this paper the users interactions are recorded for certain items known as the universal items. The universal items cover almost every domain possible to get an idea about user taste e.g. lets say our items are movies, we select some subset of movies from our total movies which cover almost every genre. Now we show the trailers of these movies to all our users and record their ratings based on the trailer. Doing so we will get a smaller submatrix of ratings which are actually dense and is used further to make our predictions.Such sort of approach can only be used for items which can be compressed into short items so that a user can form its perspective for the item. Some such items are books with book title or summary as its representation, news articles with title as representation, or in our case the entire joke can be used. Full details about the algorithm can be found in this paper.
For our joke dataset there are certain jokes that are rated by almost every users , such jokes are referred to as guage items or popular items. All other jokes are called non popular jokes. Basic algorithm to follow has the following steps:
- Get the submatrix of items and users for which maximum users have given ratings. Let there be k such items then the resultant matrix will be of order n*k where n is the number of users.
- Standardize this n*k submatrix and apply PCA to it so that we can get our users in a 2 dim space which explains maximum variance.
- Get our users matrix which is n*k into 2-d using the top 2 eigen vectors.
- Cluster the users and get the average ratings for each of the non popular jokes for each cluster. Sort the ratings in decreasing order and recommend jokes based on that.
- For a query user make them rate the popular jokes and then with the help of calculated eigen vectors project the user to 2-d plane.
- Find the cluster to which the user belongs and recommend jokes.
We start by getting our submatrix of popular jokes, we know that joke id 6,7,12,13,14,15,16,17,18 are popular in our dataset. To get the submatrix we use the following code
subrating_matrix = []
for i in tqdm(train_users):
subrating_matrix.append(ratings_matrix.toarray()[i][[6,7,12,13,14,15,16,17,18]])
subrating_matrix = np.array(subrating_matrix)
subrating_matrix.shape
Next lets apply PCA to our subrating_matrix to reduce our dimensionality of users to 2.
'''Applying PCA to the submatrix to get our users vector in a 2-d space'''
pca = PCA(n_components=2)
pca.fit_transform(subrating_matrix)
pca.explained_variance_ratio_
As we can see that taking the top 2 vectors explains approximately 75% variance.
Next we will get our top 2 eigen vectors and project our submatrix in 2D using them. This can be done as shown
'''We got 76% approx varaince by taking 2 eigen vectors'''
eigen_vectors = pca.components_
eigen_vectors.shape
'''Projecting our users from subratings matrix onto 2-d'''
subrating_matrix_2d = np.matmul(subrating_matrix,eigen_vectors.T)
subrating_matrix_2d.shape
As expected we get all our users in 2d space , now lets try to see a simple scatter plot of these users.
Plotting them gives us the following plot
As we can see that our users form some form of clusters which can be useful for us in identifying similar users.
Next we want to cluster these users using simple axis parallel lines. We can see from the plot that easily we can cluster our users based on feature 1 values. We divide our dataset into 3 clusters along feature 1 as shown below
One cluster is all the users having feature1 value <-1 , second is between -1 and 0 and third is with feature 1 values > 0.
Code for clustering the dataset and getting recommendations from this system can be found on my github profile here.
Now for each cluster of users we would get the average ratings for the non guage or non popular items, sort them in descending order and would be able to give recommendations.
We would make a query user rate the guage items get the user vector , and using our eigen vectors would get the user in 2D space. Once this is done we would identify the cluster to which the query user belongs and then recommend the top k items from our set of non popular items.
We got our MAP@K for our query users here is the result we got
An map@k of 0.0054 which is almost twice better than our previous result.
One Question that you might be having is that why did we form only 3 clusters and not 7 since our dataset is well separated for 7 clusters?
Well to be fair i did initially clustered our data into 7 clusters , but what i noticed was that in all the clusters coming after feature 1 value > 0 most of the non popular joke average ratings were coming to be 0. There were hardly 5 or 6 jokes with non zero average ratings. Now this could be a problem for us because we are trying to recommend 10 jokes to a user and if in a given cluster we dont have that many jokes it would result in bad results. Hence to overcome this problem we decided to take only 3 clusters.
Also having only three clusters does not look good and it has its own issues.One issue is that there will be only 3 sets of distinct recommendations that our system would give, one for each cluster. It is not good because we want our system to provide as distinct recommendations as possible based on user behaviour. To overcome this problem i decided to increase our number of clusters from 3 to 33 as shown below.
Although clusters were increased but our MAP@K did not improve, One reason for this might be because we are splitting our users temporarily which means it might be possible that many users were falling under same cluster. We could try this with random splitting of our users and hopefully results might improve.
Regression based Models
One another way to approach recommender problems is to pose them as classical ML problem. In our case we will pose this as an Regression problem where we will try to predict the ratings of jokes. Our classical regression problem consists on (x_i,y_i) as our data where x_i is an d dimensional vector and y_i are real values. We are needed to convert our problem into same format first so that we can then apply various ML models on it.
Our first task here is to create our x_i or featurize our data. There are various ways we can do it but here we will try to do it using Matrix Factorization. Matrix Factorization is a technique which factorizes a given real valued matrix into submatrices using SVD (Singular Value Decomposition). More about svd can be found here.
Here we will apply svd decomposition on our ratings matrix, before applying it we will scale our ratings matrix in such a way that our min rating is 1 and maximum rating is 21 and when there is no interaction it is represented by 0.Code to do it is as follows
#getting total users
users = len(np.unique(ratings['userId'].values))
#getting total jokes
items = len(np.unique(jokes['jokeId'].values))
#user and item dic to fill out our matrix
user_dic = {user_id:i for i,user_id in enumerate(np.unique(ratings['userId'].values))}
item_dic = {item_id:i for i,item_id in enumerate(np.unique(jokes['jokeId'].values))}
#creating matrix of (user,items) shape
scaled_matrix = np.zeros((users,items))
#filling matrix with user ratings, 99 value represents no interaction
for i,tup in tqdm(enumerate(ratings.values)):
scaled_matrix[user_dic[tup[0]]][item_dic[tup[1]]] = float(scaled_ratings[i])
scaled_matrix = csr_matrix(scaled_matrix)
Once we get our scaled matrix we will apply svd decomposition on top of it, but now we need to decide that how many latent vectors we need to use for decomposition. For this we again look at the % variance explained by vectors. We plot the variance explained percentage along with number of latent vectors as shown
As seen clearly taking latent vectors = 100 explains 97% of our variance, hence we take 100 as number of latent vectors and decompose our ratings matrix to get our user and item representation in 100 dim space.
Following code does it for us
U,sigma,VT = randomized_svd(scaled_matrix,n_components=100,n_iter=5)
print(U.shape)
print(sigma.shape)
print(VT.shape)
Now using these U and V matrices we can featurize our data as each of the rows in U and V gives us feature vector for corresponding user and item respectively.
After this we convert our ratings matrix into x_i,y_i where x_i is an vector of 200 dim (100 for user feature and 100 for item feature) and its corresponding ratings as y_i. Once we have the dataset ready now we can train various ML models on our data. Before it we will split our dataset into 80:20 split.
We trained various models eg XGboost, Lasso, Ridge, Linear SVR, Decision Tree and LGBM on this data whose corresponding MAP@K and rmse are shown below
We can see that our regression based approach tends to perform well as compared to previous models.
In these regression based models for predicting jokes we use the following technique. We first featurize our query user and combine it with every joke in our system . Then we pass this data to our trained model and predict the ratings . Based on these predicted ratings we recommend top k jokes to the user. Once we get our predictions we calculate our map@k for every test users as shown below.
def get_mapk(query_users,model,key='not dist'):
'''This function calculates mean average precision for the given model'''
#get the query users id
uniq_users = np.unique(ratings['userId'])
query_users_id = uniq_users[query_users]
mean_avg_precision = 0
k = 10
average_precision = 0
total_users = len(query_users)
#iterating over each user
for i,userid in tqdm(enumerate(query_users_id)):
#getting relevant jokes for the current uswr
relevant_items_user = ratings[ratings['userId']==userid]['jokeId'].values
#gettig recommendations for the user using given model
if(key == 'dist'):
recommendations = get_predictions_dist_based_model(query_users[i],model)
else:
recommendations = get_predictions(query_users[i],model)
total_relevant_items = len(relevant_items_user)
#iterating over number of recommendations
for i in range(k):
#checking if the kth item is relevant or not
if(recommendations[i] in relevant_items_user):
#getting recommendations till i
items_till_i = recommendations[:i]
#calculating precision till i recommendations
precision_i = len(list(filter(lambda x: x in relevant_items_user,items_till_i)))/k
#calcualting average precision over all values of k
average_precision = average_precision + precision_i
average_precision = average_precision/total_relevant_items
#calculating mean average precision over all users
mean_avg_precision = mean_avg_precision + average_precision
mean_avg_precision = mean_avg_precision/total_users
return mean_avg_precision
Till now since these were only one models predicting our ratings , one idea that we had was to try an ensemble model next. Our next model which is our best performing model is based on the following approach.
We first divided our train dataset in 50:50 split named d1 and d2. Next decide number of base learners we want to train , in our case we choose it to be 10, however we can create as many as we want. Each of these 10 base learner is randomly selected out of XGboost, Lasso, Ridge, Linear SVR, Decision Tree,LGBM and random forest. Now we do row sampling with replacement from our d1 to get our train dataset for each of these base learners. How many samples to sample we can decide based on size of our d1 , here we chose sample size to be 200000. Now using these datasets we train our base learners and now we use d2 as our test data and pass each value in d2 to each of these base learners . Here we got 10 predictions for each data point in d2 , in same way we would get a dataset from the predictions of our base learners . Using this new dataset we will train our meta model which in our case was an XGboost model. Such a technique used is called ensemble modelling.
Once our entire ensemble model is trained we would save our best base learner models and meta models on our disk. For testing purpose we would load these models and pass our test data from them to get our predicted ratings.
Since the code for regression based approach is a bit long i will share my entire code base for your reference in the end.
The MAP@K for this ensemble model is shown below
This Map@k value just fractionally outperform our previous best of 0.0234 by xgboost model, but doing some more experiments i got an value of 0.0268 using ensemble model.
The entire code for regression based approach can be found here.
Results
Here are the metric results of all the models discussed so far:
Looking at our results we can see that our Regression based models are performing better as compared to other models and we got our best model using ensemble techniques. Also using ensemble model we noticed that we were getting distinct recommendations for many users which is an encouraging sign that our model has actually learnt something based on user interactions.
We also deployed our best solution i.e the ensemble model on an cloud platform called heroku. The solution can be accessed from this link joke-recommeder on heroku. And below is a small recording of our solution.
In the video you can see that our trained model is predicting top 10 jokes and displays the jokes for our user. The solution was deployed using Flask and creating a simple API out of our model.
Conclusions and Future Works
To conclude we can say that we have tried various techniques to solve our problem but still we can improve upon our results. We could use surprise library to get our ratings predictions as it has already implemented techniques from this paper.
There are also some other techniques we could try e.g we could use deep learning to get our user and item embeddings to featurize our data. Also other thing we could do is combine our joke text vectors with our features , in this way maybe our system will be able to learn a bit more since we will feed actual joke contents also.
All these things i will try in future and will update the blog as and when we improve our metrics.
Feel free to contact me on my linkedin for any suggestions or ML/DL related topics.
Feel free to have a look at my entire code base for this case study on my github repo.
References
Special thanks to Applied AI team ( appliedaicourse.com ).
The following references were used for this problem:
https://arxiv.org/pdf/1804.08891.pdf ,
https://goldberg.berkeley.edu/pubs/eigentaste.pdf,
https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf,
https://arxiv.org/pdf/1708.05031.pdf,
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/,
http://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-f or-recommender-systems.html,
https://towardsdatascience.com/evaluation-metrics-for-recommender-systems-df56c6611093