Recommender System Flashcards

Question

What are considerations of using Novelty?

Answer 1

Need to find a balance between finding familiar popular items and the serendipitous discovery of new items.

Answer 2

There will be always an exponential distribution where most sales come from a very small number of items, but long tail also makes up a large amount of sale

Answer 3

Recommender systems can help people discover those items in the long tail that are relevant to their own unique niche interests. If you can do that successfully, then the recommendations your system makes can help new authors get discovered, can help people explore their own passions, and make money for whoever you're building the system for as well.

Answer 4

How quickly does new user behavior influence the recommendations

Answer 5

The faster the responsiveness allows the system easily catch up to the current trends and patterns.

Answer 6

The trade off is complexity and responsiveness

Answer 7

straight up ask your users if they think specific recommendations are good.

Answer 8

Explicit Feedback for the recommender system

Answer 9

Noisy data as there is no clear indicate/standard for a good recommendation

Answer 10

Put recommendations from different algorithms in front of different sets of users and measure how they react to the presented recommendations

Answer 11

One of the best way to tune the recommender system. Result of online A/B test is emphasized as the most matter metric.

Answer 12

Complex and expensive to execute and maintain.

Answer 13

A technique that suggests items to users based on the **characteristics or attributes of items users have rated**, analyzing the content of the items themselves.

Answer 14

Consine Similarity, Pearson Correlation Coefficient, Jaccard Similarity, Euclidean Distance, etc.

Answer 15

Cosine similarity measures the cosine of the angle between two vectors. It ranges from -1 (completely dissimilar) to 1 (completely similar).

Answer 16

Commonly used in text mining, collaborative filtering, and information retrieval. Suitable for high-dimensional data where the direction of vectors matters more than the magnitude.

Answer 17

Euclidean distance measures the straight-line distance between two points in a multi-dimensional space. Smaller distances indicate greater similarity.

Answer 18

Suitable for **continuous numerical data** in a multidimensional space. Often used in clustering, pattern recognition, and image analysis.

Answer 19

the disimilarity between two strings of equal length. It is defined as the number of positions at which the corresponding symbols (characters or bits) are different.

Answer 20

Primarily used for **binary data**, such as **error detection** and **correction codes**. Also applicable to categorical data with a predefined order.

Answer 21

Jaccard similarity measures the proportion of common elements between two sets. It ranges from 0 (no common elements) to 1 (identical sets).

Answer 22

Commonly used in set similarity, document similarity, and recommendation systems. Suitable for scenarios where the presence or absence of elements is important.

Answer 23

measure the similarity between users or items based on their ratings. It is an extension of the traditional Cosine Similarity, with adjustments made to account for user (or item) biases. Adjusted Cosine Similarity considers the ratings given by users while accounting for the user's overall rating tendencies.

Answer 24

Collaborative filtering, especially when accounting for user/item biases.

Answer 25

Pearson correlation measures the linear relationship between two variables. It ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation).

Answer 26

Widely used in statistics, collaborative filtering, and linear relationships. Suitable for continuous data where a linear relationship is expected.

Answer 27

measures the strength and direction of association between two ranked variables

Answer 28

Suitable for ordinal or ranked data, often used in non-parametric statistics. Useful when the relationship between variables is monotonic but not necessarily linear.

Answer 29

a measure of the average squared differences between corresponding elements of two vectors.

Answer 30

Commonly used as a loss function in regression analysis, optimization problems, and model evaluation. Measures the average squared differences between corresponding elements.

Answer 31

1. Measure similarity score between a movie and all other rated items. Repeat to all unrated items. 2. Sort and produce the top items. 3. Use weight average to predict the rating.

Answer 32

A technique that taking cues from people like you and recommend stuff based on the things they like that you haven’t seen yet. Recommending stuff based on people’s collaborative behaviors.

Answer 33

sparsity refers to the proportion of missing or unrated values in the user-item interaction matrix. Recommender system matrices are often sparse because users typically interact with or rate only a small fraction of the items available in the system. Most entries in the matrix are missing.

Answer 34

The sparsity of the matrix poses challenges for recommendation algorithms because it means that a large portion of the user-item interaction data is unknown. This can lead to difficulties in accurately predicting user preferences for unrated items.

Answer 35

Recommend stuff that similar users like that you haven’t seen yet.

Answer 36

1. Collect data, 2D array, each user will have a vector that represents their ratings for each item. This is item rating matrix. 2. Compute cosine similarity between any pair of users. This is user similarity matrix. 3. Sort the list and pick up the top-n similar users 4. Candidate generation: take the items rated by the chosen users 5. Candidate scoring: there are many ways. Example: normalize rating scores. 6. Candidate sorting: using the above scoring 7. Candidate filtering: remove item that already

Answer 37

Recommend things that similar to the item users like. One advantage is that items are more permanent and suits for small amount of dataset.

Answer 38

1. Represent data in which each item have a vector contains its characteristics, or it could be how other users like it or not. —> item rating matrix 2. Compute cosine similarity between any pair of items. —> item similarity matrix 3. Sort the list and pick the top-n similar item 4. Candidate filtering

Answer 39

User-based KNN: for user u and item i 1. Find the k-most similar users who rated the item 2. Compute the mean similarity score weighted by ratings 3. Rating prediction Item-based KNN: for user u and item i 1. Find the k most-similar items also rated by a user 2. Compute mean similarity score weighted by ratings 3. Rating prediction

Answer 40

Describe users and items as combinations of different amounts of each feature.

Answer 41

Matrix factorization is particularly effective in handling sparse matrices commonly encountered in recommender systems.

Answer 42

Describe the training data in terms of smaller matrices that are factors of the ratings we want to predict. - R = MΣtrans(U), this automatically fills in unrated pairs of user and item. Given: - M is the PCA matrix - U is the original matrix, missing value can be completed by putting random default value at first and keep minimize the error rate of these missing values.

Answer 43

PCA is a feature extraction technique. In a recommender system, PCA find and extract latent features from the data. PCA tries to find principle components, which are eigenvectors Eigenvectors: a vector that describes the variance the best and its orthogonal vector. —> define a new vector space that can fit the data.

Answer 44

Singular value decomposition is technique to decompose R into M, Σ and U. Techniques such as SGD (stochastic gradient descent) and ALS (alternating least square) can be used to learn the best values of those factored matrices when having missing data. Therefore, this is more of a SVD-inspired algorithm.

Answer 45

Gradient descent is an optimization algorithm commonly used in training machine learning models. The goal of gradient descent is to minimize a cost or loss function by iteratively adjusting the model's parameters in the direction of steepest descent of the gradient. This iterative process continues until the algorithm converges to a minimum or until a predefined stopping criterion is met

Answer 46

a technique for computing derivatives of functions automatically.

Answer 47

The softmax function takes a vector of real numbers as input and outputs a probability distribution that sums to 1. Used for tasks involving multi-class classification or probability distribution

Answer 48

a key algorithm used in training artificial neural networks. The backpropagation algorithm is a supervised learning method that enables the optimization of model parameters by minimizing a chosen loss or objective function.

Answer 49

1. Forward Pass: input data is fed into the neural network to generate output values. 2. Calculate the loss function 3, Backward pass: computing the gradient of the loss with respect to each model parameter. This is achieved by applying the chain rule of calculus. 4. Parameter Update and Repeat

Answer 50

A mathematical operation applied to the output of a neuron in a neural network Example: - Sigmoid Activation Function: binary output - Hyperbolic Tangent Activation Function: Similar to the sigmoid but with a range of (-1, 1), making it zero-centered - Rectified Linear Unit (ReLU) Activation Function: The most widely used activation function. It introduces non-linearity and is computationally efficient - Softmax Activation Function: Often used in the output layer for multi-class classification

Answer 51

Defines the strategy for adjusting the model parameters during training to minimize the chosen loss or objective function. Example: - Stochastic Gradient Descent (SGD): Iteratively updates parameters based on the gradient of the loss with respect to the parameters - Adam (Adaptive Moment Estimation): Combines ideas from RMSprop and momentum. - RMSprop (Root Mean Square Propagation): Adapts the learning rates of each parameter based on their historical gradients. - Momentum Optimization: add a momentum term to the update rule for the model parameters

Answer 52

1. Regularization Techniques: L1, L2, and Elastic Net Regularization Regularization 2. Dropout: Randomly dropout (ignore) a fraction of neurons during training 3. Data Augmentation: Increase the effective size of the training dataset by applying random transformations (rotations, translations, flips) to the input data. 4. Early Stopping: Monitor the performance on a validation dataset during training and stop the training process when the performance starts to degrade

Answer 53

1. Start Small and Gradually Increase Complexity 2. Start with a sequential approach, adding layers one at a time and use early stopping when the model starts to degrade. 3. Use Model zoos and Pre-trained architecture

Answer 54

An architecture for executing a graph of numerical operations Optimizing the processing of that graph, and distribute its processing across a network and distribute work across GPUs. Tensor: an array or matrix of values

Answer 55

1. Load up the training and testing data 2. Construct a graph of the neural network: 1. Use placeholders for the input data and target labels 2. Use variables for the learned weights for each connection and learned biases for each neuron. 3. Associate an optimizer 4. Run the optimizer with the data 5. Evaluate with testing data Note: Make sure features are normalized so that every input feature is comparable in terms of magnitude.

Answer 56

Keras is an open-source high-level neural networks API written in Python

Answer 57

The idea is to take a source data, break it up into chunks called convolutions, and then assemble those and look for patterns at increasingly higher complexities at higher levels of the neural network.

Answer 58

CNN is good for unstructured data to look for feature location invariants.

Answer 59

resource-intensive, lots of hyperparameters. training data

Answer 60

Source data must have appropriate dimensions - Conv2D layer does the convolution on a 2D image - MaxPooling2D layers can reduce a 2D layer down by taking the maximum value in a given block. - Flatten layers will convert the 2D layer to 1D layer for passing into a flat hidden layer of neurons. Typical usage: Conv2D → MaxPooling2D → Dropout → Flatten → Dense → Dropout → Softmax

Answer 61

there is a loop over a neuron, in which the output from previous run will be the input and improve the learning capability of a neuron, that still remain characteristics of the previous runs’ outputs.

Answer 62

for sequence of data (time-series data), or sequence of arbitrary length (machine translation, image captions, machine-generated music)

Answer 63

sensitive to topologies, choice of hyperparameters, resource-intensive

Answer 64

Implementing Recurrent Neural Networks (RNNs) with Keras involves using the SimpleRNN, LSTM (Long Short-Term Memory), or GRU (Gated Recurrent Unit) layers provided by Keras. These layers allow you to model sequential and time-dependent data, making them suitable for tasks such as time series prediction, natural language processing, and more

Answer 65

The challenge when using deep learning for a recommender system is the algorithm has to work with sparse data. Deep learning models, especially neural collaborative filtering approaches, may struggle to effectively capture patterns in sparse data. This problem is signified with the cold start problem.

Answer 66

Contain two layers: - A visible layer: feed our training data into this layer in a forward pass - A hidden layer: train weight and biases during back propagation. Activation function produce the output. RBM’s for recommender systems: Use each user in the training data as a set of inputs into RBM: - The visible nodes represent ratings for a given user and the network constructs weights and bias to produce the prediction for unknown pair of users and items. - Deal with sparse data by excluding any missing ratings from processing while training the RBM. - Function to optimize: Contrastive Divergence - sample probability distributions during training using Gibb sampler.

Answer 67

Autoencoders are used to learn a low-dimensional representation of user-item interactions. The encoder captures essential features, and the decoder reconstructs the original input. Contain 3 layers: - An input layer that contains individual ratings - A hidden layer - An output layer A matrix of weights between the layers is maintained across every instance of this network, as well as the bias. Deal with sparse data: excluding the unrated observations. This algorithm does two thing: - Encoding the patterns in the input as a set of weights into a hidden layer - Decode the weights between output and hidden layer to construct the output.

Answer 68

Given a sequence of action, predict the actions, or items that are most likely to continue the sequence of events. The idea is that the item is coded as one event, goes into embedding layer, and goes into GRU layers, and get the scores on all of items, from which we can select the items the deep network thinks the most likely to extend the sequence.

Answer 69

a framework for processing massive datasets across a cluster of computers. Spark Driver Script communicates with cluster manager to distribute workloads to the executors. This distributed architecture allows Spark to handle large-scale data processing tasks efficiently by distributing the workload across multiple nodes.

Answer 70

an object that capsulates the data you want to process. A Spark Driver Script is about defining operations on RDD like where should load its data from, what operations to perform and where to write the outputs. RDDs serve as the foundational data structure in Spark, providing fault tolerance, parallel processing, and distributed computing capabilities

Answer 71

Deep Scalable Sparse Tensor Neural Engine. It helps set up deep neural networks using sparse data. DSSTNE can run on a GPU and work with Apache Spark to further scale. Open-source. DSSTNE is designed to efficiently process sparse data, making it particularly well-suited for recommender systems where user-item interaction matrices are often sparse

Answer 72

It allows you to create notebooks hosted on AWS that can train large scale models in the cloud. It comes with some useful algorithms.

Answer 73

Examples: Cold-start problem, Filter Bubbles, Gaming the System, Temporal Effect

Answer 74

new users or new items for which there is little to no historical interaction data

Answer 75

Solutions on new users: use implicit data, cookies, geo-ip, top-sellers or promotions, or conduct the interview. Solutions on new products: use content-based attributes, map attributes to latent features, random exploration

Answer 76

Stoplists check if an item might cause unwanted controversy. For example: It could be adult-oriented content, vulgarity, legally prohibited topics, terrorism/political extremism, bereavement, medical, competing products, drug, religion, etc.

Answer 77

Content presented to the user is filtered that it keeps within a bubble of the pre-existing interests

Answer 78

randomly introduce new interests

Answer 79

Users may find it challenging to understand the reasoning behind recommendations, leading to a lack of transparency and erode user trust in the system.

Answer 80

Let users know why the system recommends a particular item and allow users to fix the root cause themselves.

Answer 81

Outliers might be created by professional reviewers, institutional customers or bots, leading to Skewed Recommendations and model performance.

Answer 82

Use appropriate algorithms, data processing, normalization or standardization

Answer 83

people trying to interrupt your recommendation system.

Answer 84

only consider users who spent real money; restrict reviews to people who actually consumed the content; be wary of click data; implicit data can lead to many problems

Answer 85

Failure to account for cultural differences may result in recommendations that are perceived as inappropriate or irrelevant Different countries have varying regulations regarding user privacy and data protection

Answer 86

keep things separated geographically. Have to consider the regulations and policies carefully.

Answer 87

impact of time-related factors on user preferences, item popularity, and overall system dynamics. Example: Seasonal Variation, Trending Items

Answer 88

take the recency into account.

Answer 89

Create the system that drives the most profit, not the best relevance

Answer 90

use a profitability as a tie-breaker; look at profit margin instead of profit.

Answer 91

Ensemble approaches: combining different algorithms; combining behavior and semantic data

Recommender System Flashcards

(115 cards)