3_2: Market Analytics: Analysing and Predicting aggregated Demand and Competiton Flashcards

Question

What is the output of model-based clustering ?

Answer 1

Either shows the **optimal number of clusters** if g was not predetermined - BIC - Log-Liklihood

Answer 2

- observation come from **groups with different statistical distributions** -->algorithm try to find best set of such underlying distribution - it **clusters as being drawn from a mixture of normal distribution** - Can **only be used** with **numerical data**

Answer 3

differences are **attributable to unobserved groups** that one wishes to uncover (**nclass** is predetermined) -->Bayesian technique -->**Goal**: estimate **probabilities of membership** in each class and assing individual to their most likely class

Answer 4

1. Variable scores are caused by the hidden groups 2. LCA posits a latent variable that maximizes **liklihood of obserrving the scorces and the variables** 3. It creates a probability of each observation belonging to each segment 4. Segment with highest probability is the segment where most observations are placed

Answer 5

* Possible for **complex data** * Provides **optimal number of clusters** * Provided **indicator for variables** * Segment **probability score**

Answer 6

- **BIC:**Bayes information criterion (lower values better) - **Error rate** (better if lower) - **Negative log likelihood** (better if less negative)

Answer 7

1. Transofrm to right format 2. Compute **distance matrix** 3. Apply clustering method 4. Analyze groups 5. Examine solution in the model and apply

Answer 8

= use observations with **known status** to derive **predictors** which then can be applied to new observations

Answer 9

1. Colllcecct data (group membership is known) 2. Splitting data: Trainsing set with 50-80% of obserations and test set with 20-50% 3. Build prediction model: identify predictors from training data 4. Assess performance by applying predictors to test dataset

Answer 10

1. Considering **raw agreement rate** mean(raw$Segment == predition) = 0.92 -->92% correct prediction 2. compare performance against random chance using **ARI** - 1 = perfect agreement, 0 = random -1 = complete disagreement 3. asses performance for **each** different class using **Confusion Matrix** - actual segment is left (**rows**) - predicted (**columns**

Answer 11

1. compare performance against random chance using **ARI** (adjustedRandIndex) 1= perfect agreement, 0 = random -1 = complete disagreement 2. asses performance for **each** different class using **Confusion Matrix** using **test data** - actual segment is left (**rows**) - predicted (**columns**

Answer 12

1. use **test data** to predict() values based on trained model **using test data** 2. Revview the **segment frequencies** and compare to the inital **a-priori frequencies** based on the **training data**

Answer 13

1. use **test data** to predict() values based on trained model **using test data** 2. plot the **clusters** based on test data

Answer 14

= Training data is used to learn **probability of class membership** as a function of each **predictor variable** considered independently -->using bayes rule -->starts with **observed probabilities** of vairbales conditiona on segments found in the training data --->**only uses one model**

Answer 15

Instead of unsing a sinlge model, it **builds and ensemlbe of models** that jointly classify the data by fitting **many classification trees** (forrest) -->**not providing** class membership

Answer 16

using randomForest for **prediction** the model might generate values with 90% **being in one group** -->**imbalance** -->**Resolving**: - looking at frequency table of the **training data**, to see the group **allocation** -->pick **smallest** - SSet **sampsize**= "value" in randonForest model

Answer 17

the model uses many **predictor variables**, thus it is useful to know the **importance of different** classification variables -->randomForest( importance = **TRUE**)

Answer 18

- **many classification trees(format)** instead of one model - **More accurate** because more models are applied - **Useful to estimate the importance** of **predictor** variables

Answer 19

used to **predict a categorical (and usually) binary** dependent varible

Answer 20

1. Identify **most effect prediction variable** for predicting the binary dependent variable 2. Stat wit root node with all combinations and then use independent variables to split the root node to **create most improvement in class** separation

Answer 21

**Pure decision node:** all data points associated with that node have **same value of dependent variable**

Answer 22

**Impurity (unreinheit)**: assess the **degree of impurity or heterogeneity** within a **subset of data** in the classification tree. impurity of a **split**: is the weighted average of impurities for the nodes involved in that split

Answer 23

-->**Entropy** is a measure of **impurity or disorder** - Entropy is calculated using conditional probability - Always between 0 and 1 -->**Lower entropy -->decreasing impurity!

Answer 24

**filter and predict** choices based on other people behavior -->Typically using **rating matrixes** Idea: people with **similar** preferences tend to like similar items

Answer 25

1. **Neighborhood based method:** prediict unknown rating by **using the nearest neighbor approach** 2. **Model-based method:** use more complex, predictive models

Answer 26

A **rating matrix** exhibts string user and item biases - one can account for these systematic user and item effects: The model caputures **only the average user and item effect** but it can help to **absorb the biases and isolate** the signal that represent user item interactions

Answer 27

**Advantages:** - Make recommendations **without any additional information** about catalog items - helps to **produce non-trivial recommendations** - rating **captures human tastes and judgments**

Answer 28

**Disadvantages**: - **difficult to build reliable** prediction models with trustworthy rating - content filtering is **biased towards popular items** and standard choices - **cold start problem** does not work for new users - **Product standardization** is **difficult** - **Arbitrary assumptions**

Answer 29

- **easy to program** - more **attractive** when **users are personally familiar**

Answer 30

- preferred when **customers are not familiar** - **item-based matrix** of correlation is **more stable** over time - matrix needs to be **updated less often**