08 - Evaluation Flashcards
What do you need to know to assess how good or meaningful results are?
- What type of error was used
- Which data set and how it was divided
- Scale
- Context, how do other algorithms perform
What can you use as a baseline? What are other algorithms that you can compare your own method with?
- State of the art algorithms or algorithm used so far
- Simple algorithms (linear regression)
- Mean or median
- Highest Class Probalility (Modal)
- Some simple rules
- Random
Why do you need a baseline?
- Without a baseline, performance evaluations of an algorithm are typically of little or no relevance
- A baseline gives meaning to the results
What is the ground truth for recommender systems?
- Ratings, submitted ratings, relevance scores of a dataset
- These are considered “true” but may well be false, biased, sparse or noisy
What are the problems with the ground truth of recommender systems?
- Real ground truth is difficult to measure
- Ground truth is derived/approximated
- Is the best possible that is available
- Hard to find
What is called the Gold Standard?
Something is the best available thing you can get
What is the assumption of the Central Limit Theorem?
- Large number of examinations
- Large random sample with n examinations
- Samples are random (independent of the previous examination)
What is the Central Limit Theorem?
- Mean (and sum) of the samples follows a normal distribution
- The larger n, the closer the mean and sum of the samples approach the true values
What is statistical significance?
- Describes the probability that an observed difference is caused by chance
- The typical p value should be less than 0.05 or 0.01
- Statistically significant results can still be false or practically insignificant
What does statistical significance mean?
- Experimental data giving a p value of 0.05 means that there is only a 5% chance of getting the observed result if no real effect exists
- The p value provides information about the probability of obtaining evidence. It does not quantify the strength of the evidence
What is called P-hacking?
If you torture your data long enough, they will confess
Why is it important to analyze performance over time?
Standard stupid assumption: performance is always the same over time
What is called dataset pruning?
Remove data that does not fit your intention
When is it good to remove data?
- Wrong data
- Noisy data
- Missing data