Book Notes Flashcards
What is classification?
Predicts for each individual in a population, to which set of classes that individual belongs to.
List of classes must be exhaustive and mutually exclusive.
Related tasks: scoring and class probability estimation.
Scoring model: estimates a score to determine the probability that each individual belongs to each class .
What is regression?
Attempts to estimate or predict, for each individual, the numerical value of some variable for that individual: “given x, what is y?”
Value estimation > Has a numerical target (€, height, time etc.)
Predicts how much something will happen
Whst is similarity matching?
Attempts to identify similar individuals (people, product etc.) based on data known about them
Often used for making product recommendations
“Given these characteristics, who is similar to my target audience”
What is clustering?
Attempts to group individuals in a population together by their similarity (but not driven by any specific purpose)
Useful in preliminary domain exploration to see which natural groups exist in data set
Basis for developing subsequent subtasks
What is co-occurence grouping?
What is profiling?
Attempts to characterize the typical behavior of an individual, group or population
Can be done at different levels: entire population or sub-clusters of the data
Use case: Often used to establish behavioral norms (typical purchases or user behavior) for anomaly detection applications such as fraud detection and monitoring intrusions
What is link prediction?
Attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly estimating the strength of the link
Use case A: Recommending friends in social networks, based on shared social connections
Use case B: Recommending movies on Netflix
Logic at work: Searching for links that do not exist, but are predicted to exist and be strong
What is data reduction?
Attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information of in the larger set
Goal: Higher readability, thus easier to generate insights, at a (small) loss of information
What is causal remodelling?
Attempts to help us understand what events or actions actually influence others
Key question: Is event B actually influenced by A or do other factors explain the observation
Useful techniques: A/B testing
What is supervised data mining?
Problem statement is clearly defined (based on certain conditions which must be met)
⇒ If specific target can be identified a supervised method should be deployed
⇒Results are much more useful
⇒Data on the target is essential > labeling the target before analyzing it
⇒Common methods: classification, regression & causal modeling
Subclass A: Regression = numeric target
Subclass B: Classification = categorical target
What is Unsupervised data mining?
No specific outcome, purpose or target is defined at first.
Even if technique yields a result, is it not clear if causal relationships are given and if the result can be used further. Possible that the created groups are not meaningful.
Common methods: Clustering, co-occurrence grouping & profiling