Width.ai

Neural Collaborative Filtering for Deep Learning Based Recommendation Systems | Architecture Breakdown & Business Use Case

Matt Payne
·
May 26, 2022
Recommendation systems continue to be one of the best tools to increase user engagement and conversion rates among new and existing customers. 35% of all of Amazon.coms revenue comes directly from its recommendation system, which they use to send personalized buying suggestions through email and show personalized products on-site. Many other companies like Netflix and Pinterest use recommendation systems to increase user engagement and keep their customers involved, which improves customer retention in the long run. These recommendation systems are becoming crucial for modern subscription business, as Netflix has stated 80% of stream time comes from recommended movies and shows. All of these companies have moved from old school matrix multiplication methods for these recommendations, like Netflix with the Netflix Prize winner, to deep learning based recommendation systems that outperform the old methods.

Neural Collaborative Filtering Architecture

Let’s take a look at the most popular and easy to use deep learning based system called Neural Collaborative Filtering (NCF), published by National University of Singapore, Columbia University, Shandong University, and Texas A&M University in 2017. NCFs power comes from the non-linear nature of neural networks to learn a deeper understanding of the relationship between users and items, mixed with input from matrix factorization. This article will help you understand the NCF architecture as well as how you can implement NCF to provide buyers and customers with high ROI recommendations.

Prerequisite

It might be helpful to have read our simple breakdown of general recommendation systems to help you understand some terms and ideas pertaining to why NCF implements different parts of the architecture.

Interaction Matrix

Data Understanding

General Data layout

To understand the Neural Collaborative filtering model’s architecture we need to understand the input data required and how we engineer it for our use. The model takes in user ids and item ids corresponding to a target action. Many examples of NCF use users and movie ids, with the target being the rating the user gave the movie. These user and item fields must be converted to one hot encoded vectors, as these are not integer fields but identification numbers or user names.

one hot encoded vectors
Converting a vector of datapoints to a vector of one hot encoded datapoints

Width.ai has used this same model for companies looking to recommend any user-item interaction such as a purchase or read article. Recommending products based on past purchases is a great way to use this model, and allows you to start making recommendations with a rather small input space.

Implicit vs explicit feedback

Before we go any further into the model, it's important to understand the idea of implicit feedback and explicit feedback, and why we will be converting our data to implicit feedback. 


Explicit feedback is data collected from users that is considered direct and matured, meaning the user had to perform an action for that data point to be created. Examples of this are all around, with things like retweeting on twitter, rating purchased goods 1-10 on a website, and clicking a link. The problem with this feedback data is these actions are rare, and the non action (not rating a video at all) are far more likely then the explicit feedback. 


Implicit feedback however is the result of indirect data collection from user interactions, and is used as a proxy for a users preference. These are things like video watch time, searches you made in google, or views on a tweet. This data is far less sparse than explicit data and is used for deep learning architectures to recommend products in real time with quick updates.


Explicit feedback data vs Implicit feedback data
Explicit feedback data vs Implicit feedback data

One important note is that most of the data used for these systems has an explicit feedback target field, with things like movie rating, youtube subscription, or netflix subscription. We have to convert these data fields to implicit feedback points by reframing the problem. For example: Instead of trying to predict what a user would rate a movie (from 1-5) we will reframe it to predicting whether the user would rate the movie or not (1 or 0). The idea is this creates the highest likelihood of an interaction, and you can update your dataset to remove low ratings if wanted. 


You might notice now if you’re following along is that you now have a dataset where your target variable is only positive output (a 1 instead of a zero). If a dataset contained only points where the user interacted with the item we now only have datapoints that are positive, no rows where the user did not interact with an item. We cannot train a machine learning model on only positive interactions, as the model will not learn the user-item relationship well. To fix this we have to add negative feedback targets to each user, to give the model a good blend of both. Many implementations recommend adding any random negative interactions to the user, arguing that even if a few are ones they’d actually want to interact with the chance of that being true is extremely low, and the benefit applied to the global loss outweighs the negatives. We’ve built a better function for applying negative sampling to our datasets that shows better results in the validation and testing datasets:


  1. Create an understanding of a generally low interaction with an item and focus on using those as the negative sampled items. These items have a much lower chance of being a positive interaction, and make it easier for the deep learning model to learn the nonlinear relationship. An example would look like this: If the item is a youtube channel you could consider the number of active subscribers, number of videos posted last month, number of comments per post etc etc. Low numbers or all 0s would indicate an inactive channel and would be great for negative interactions in the training dataset.
  2. Without side features to use as a primer for our low interaction data function, we can simply look for items with a generally low following count, and focus on using those as our negative samples in the training dataset. 

Embeddings

Now we’ve created our dataset, and converted it to an implicit dataset with one hot encoded vectors instead of ids. Before we pass data into the different parts of the model, we need to re-represent our data points in a format that helps the model understand the relationship between our vector data points. Embedding vectors are a representation of a lower dimensional space, used to capture relationships between vectors in higher dimensional space. We pass these high dimensional one hot encoded data fields into embedding layers that produce lower dimensional representations that help our model understand these sparse datapoints. We’ll use these embedding layers for both users and items and pass them into the model.

embedded vector
Let's pass a one hot encoded user vector to a 2 dimensional embedded datapoint vector

Important Note:

Although it's not really mentioned in the original paper, more recent NCF implementations use different embedding initializations and regularizations to improve performance. Glorot Uniform and L2 are both seen in Tensorflows implementation of NCF. Almost all newer implementations looking to model the same relations as NCF use these to improve early epoch training results.

Matrix Factorization: 

Matrix Factorization has been used in older recommendation system implementations, including the Netflix Prize in 2009, but has been moved on from in recent years as a lone model for these systems. MF struggles to learn the relationship between users and items for sparse matrices and can result in high loss and unfit results on larger datasets. That being said it performs well as an input to neural network models and as a side model to feed learned relationships into a final output layer.

generalized matrix factorization
Generalized Matrix Factorization Architecture

Matrix factorization of the embedding layers is the first operation we’ll perform, used to model the user-item interactions through a scalar product of the above latent vectors. Matrix factorization reduces the matrix of n-users and n-items into two sub-matrices to be multiplied together to form a final predicted matrix. Using a loss function we can compare these two matrices and focus on minimizing the loss between them. Mean squared error and root mean square error are the most common loss function used.

neural collaborative filtering
Matrix Factorization works alongside the deep neural network to learn user-item relationships

Matrix factorization of the embedding layers is the first operation we’ll perform, used to model the user-item interactions through a scalar product of the above latent vectors. Matrix factorization reduces the matrix of n-users and n-items into two sub-matrices to be multiplied together to form a final predicted matrix. Using a loss function we can compare these two matrices and focus on minimizing the loss between them. Mean squared error and root mean square error are the most common loss function used. 


Diving into what happens here a little more, each embedding layer represents a matching user and item is represented in a latent space and represented by a latent vector. The more similar these latent vectors are, the more related the corresponding users’ preference. We can measure the similarity of any two matching latent vectors because we factorize the full matrix of users and items in the same space by using cosine-similarity or simple dot product. This all being said, the final prediction coming from matrix factorization is the inner dot product of the matching latent vectors. One point to note is that many people, including the original academic paper writers argue away from the dot product given limitations to fully model the interactions of one hot encoded users and items in the latent space. 


Recommendation systems that use pretrained GMF layers to improve early training set results can be a great way to get low loss early and prevent overfitting and improve validation set hit rate. Width.ai has also experimented with Non-negative Matrix Factorization as a way to improve raw MF loss function results on sparse datasets.

Multi-Layer Perceptron: (Deep Neural Network)

MLP NCF
Architecture of the neural network part of NCF

The other “side” of NCF that operates alongside the MF implementation is a deep neural network model that takes the same one hot encoded vector to embedded vector inputs as the matrix factorization. Neural networks learn deep non linear relationships between users and items better than matrix factorization, and can scale to handle sparse inputs much better than matrix factorization with its dot product. Higher order inputs are passed into the multilayer perceptron models’ stacked hidden layers to learn these non-linear relationships between users and items. Most NCF implementations use 4-5 dense layers with under 100 neurons per layer, but many different sizes have been tested and productionized for use. For extremely sparse datasets we’ve used a 6 layer architecture with up to 256 neurons per layer, to allow the larger model to create more representations of the inputs. There is a point however, where it seems the ratio of items to users becomes way too large and requires a much deeper model, closer to DeepFM or FLEM. 

Layers and Activations

All production level NCF implementations use ReLU for the activation layer in the hidden layers given its non-saturated nature, and with sigmoid being written off for its nature to stop learning as neuron values reach 0 or 1. LeakyReLU has become popular in other fields such as CNNs and GANs as an upgrade from ReLU, but no improvements have been seen in the realm of NCF with them.

relu function graphed
ReLU Function is used to apply non-linearity to data passing through a neuron

To prevent overfitting on these datasets, especially ones with a low item to user ratio we use dropout layers and kernel regularization (mainly L2 as most modem DNNs do) throughout the model. Some models have opted to use both of these at every layer, although it's doubtful that's necessary given the struggle these models can already have to learn seriously sparse inputs. 


dropout layer applied to a standard neural net
A neural network system before and after applying dropout to layers

The Final Output Layer: NeuCF

The final proposed step in the Neural Collaborative Filtering algorithm is what they call the NeuCF layer. This final layer takes the concatenation of the output from the MLP and MF models as input into a Dense neural network layer. Combining these models together superimposes their desirable characteristics. The is the final layer that produces the prediction for a user and item pairing, determining if these two should result in being recommended. The result comes out as a percentage chance the model would predict a “1” for our target, or as we said before the chance the user would rate/like/subscribe to the item.

NeuMF math
Combination of GMF operations and MLP operations


Due to the non-convex and noisy objective function of NeuMF, gradient based optimizations like SGD struggle to find a global minimum and settle for a poor local optimization. We can solve this with two different solutions, both of which have been outlined in the original paper and modern implementations. 


  1. Most implementations at this point use Adaptive Moment Estimation (ADAM) as the optimizer for all training solutions given its ability to yield a faster convergence for both models and ease of tuning the learning rate. The paper however suggests using ADAM to pretrain weights for the MF and MLP models until convergence then going back to SGD when running the models combined with NeuMF. The reasoning for this is ADAMs need to save momentum information to update parameters, which after pretraining is not needed.  
  2. Newer implementations like Tensorflows or Microsofts simply use ADAM throughout, and just include initializers and regularizers throughout the model, excluding the NeuCF layer. 

Loss Function:

Similar to the different implementations for optimizers as seen above, the loss function has become a talking point around what is best. Many implementations, including the original paper use log loss or also known as binary cross-entropy. This is the most used loss function by far, showing great results on datasets that use both a high number of users and high sparsity item datasets. Interestly enough, Tensorflow has opted for sigmoid cross entropy with mean reduction and passes the output from the concatenation function into NeuMF.


Layer Information:

Interestingly enough, none of the standard implementations of NCF use any regularization in the final layer, but they do however use kernel initialization with Lecun Uniform, as seen in the original NeuCF code of the authors. The point where implementations diverge is the activation function, where the original authors use a sigmoid function but systems built by Microsoft and Tensorflow use none.  


NCF Evaluation

Evaluating NCF on test data can be a tricky task as we want to evaluate the system in a way that shows its results relative to a real world environment, instead of loss or accuracy based metrics. The NCF academic paper lays out results calculated from Top-k Hit Rate and Normalized Discounted Cumulative Gain with a ranked list of 10 items being the cutoff. Hit rate is used to measure if a specific positive interacted test item appears in the top 10 out of 100 items, with the other 99 items being randomly sampled. NDCG is similar but uses a focus on how high the item is in the list by assigning higher scores to hits near the top of the list. 


Evaluation datasets are set up using leave-one-out evaluation, which is widely used in recommendation systems to evaluate recent interactions from users. Leave one out will produce an evaluation dataset with just positive interactions, so random samples have to be pulled from the main dataset. Evaluation should be run often, as it can be difficult to predict the evaluation results based solely on the training results, given our metrics are quite different for both. 


Evaluation results for NCF when using Hit Rate and NDCG. Different models evaluated over epochs (X axis)

Implementing Recommendation Systems For Your Business

Amazon on-site recommendations based on item-item

Amazon's secret weapon is its personalized recommendations used to increase customer LTV and email marketing ROI. Huge ecommerce companies have made it clear that they plan on using recommendation systems at all stages of their sales funnel to put products in front of potential customers that they want to buy. If you’re an ecommerce store looking to compete moving forward or have a subscription product and want to increase user engagement, personalized recommendations have been proven to be a requirement in 2021. 


With the state of modern recommendation systems you can avoid many of the problems endured with past systems like the cold start problem, by implementing item based recommendations to build deep sales data and moving towards personalized recommendations for a user. Amazon published this paper explaining how they start with item recommendations and use them for on-site similar products and bundles. Data collected with these systems plus real sales data allows them to move towards personalized email marketing campaigns to retarget existing customers. This should be no surprise, as this combination of the best digital channel for ROI& powerful recommendations is moving ecommerce giants along.

Netflix recommendations

But let's not forget, recommendations aren’t just for ecommerce and retail. Subscription based businesses like Netflix have been at the forefront of modern recommendation systems as early as 2006. Models like neural collaborative filtering help subscription products present customers with better product services, increasing user engagement. As we said before, Netflix has stated that 80% of stream time is achieved through recommended movies and shows, clearly improving retention rate and keeping users engaged.

Neural Collaborative Filtering is a great way for small businesses to get involved in the growing push for personalized recommendations in sales and marketing. The model requires very little data in terms of features (only 2) and can be adjusted and augmented to fit different recommendation use cases depending on where in the sales cycle the data comes from. Simple product recommendations based on purchasing history can be easily done with this model and can be used in many email upsell campaigns to past customers.


Interested in learning more about how recommendation systems could be implemented in your business? Width.ai has built powerful recommendation systems for huge userbases, and understands how your data can become valuable personalized recommendations quickly. Let's talk today!