1. General Flashcards
Types of structured data
- Usual dataset: Time independent.
- Time series: Ordered over uniform time intervals
- Cross section: Multiple subjects
- Multilevel: Several levels + clusters
- Panel (longitudinal): Multilevel with 2 Dimensions - Cross-sectional over time
Characteristics of unstructured data
- Non-numeric (e.g. text data).
- Multi-faceted (multiple information points, e.g. speech recording).
- Concurrent representation (study of different research goals from one dataset, e.g. an email).
Types of variables
- Discrete random (finite or countably infinite range)
- Continuous random (range is an interval)
- Nominal / Categorical (unordered names)
- Binary (unordered nominal variable of two categories)
- Ordinal (ranked)
- Numerical (quantitative, scaled either by interval or ratio)
Types of ML techniques
- Unsupervised
Observe data and construct low complexity description e.g. Clustering, dimensional reduction techniques like PCA, discretization…). - Supervised
Observe training samples and learn function mapping inputs -> outputs
e.g. regression, classification, ranking… - Reinforcement Learning
Learning a map system
states -> actions, under maximization of cumulative reward
Types of missing values in data cleaning
- Truly Missing (no value)
- Coded (value, with a different meaning)
- Miscoded (value, with wrong meaning)
Missing values can be …
MCAR: completely random and unpredictable
MAR: predictable
Missing, dependent on unobserved variables
—> Imputation (substitution) by Mean, Median, Stratified (sorting), Regression
Big Data Characteristics
- Explosion of data volume and velocity from different domains
- Cheaper storage availability
- Faster and cheaper computation power
- Analysis of unstructured data
ML vs. Statistics
Statistics vs. ML
- model-focused vs. algrithm-focused
- Hypothesis -> data collection -> hypothesis-driven analysis
vs. Data collection -> data-driven analysis
( Few hypothesis possible in ML ) - Focus on prediction
Building a model
Partition Data set
( Training, validation, Test )
Cross-Industry Standard Process for Data Mining CRISP-DM:
Business process understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment
Applicatipon of ML
Prediction (financial markets)
Insurance
Credit Scoring
Fraud Detection
Consumer Credit and Marketing, CRM (Classification?)