1. General Flashcards

1
Q

Types of structured data

A
  • Usual dataset: Time independent.
  • Time series: Ordered over uniform time intervals
  • Cross section: Multiple subjects
  • Multilevel: Several levels + clusters
  • Panel (longitudinal): Multilevel with 2 Dimensions - Cross-sectional over time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Characteristics of unstructured data

A
  • Non-numeric (e.g. text data).
  • Multi-faceted (multiple information points, e.g. speech recording).
  • Concurrent representation (study of different research goals from one dataset, e.g. an email).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Types of variables

A
  • Discrete random (finite or countably infinite range)
  • Continuous random (range is an interval)
  • Nominal / Categorical (unordered names)
  • Binary (unordered nominal variable of two categories)
  • Ordinal (ranked)
  • Numerical (quantitative, scaled either by interval or ratio)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of ML techniques

A
  • Unsupervised
    Observe data and construct low complexity description e.g. Clustering, dimensional reduction techniques like PCA, discretization…).
  • Supervised
    Observe training samples and learn function mapping inputs -> outputs
    e.g. regression, classification, ranking…
  • Reinforcement Learning
    Learning a map system
    states -> actions, under maximization of cumulative reward
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Types of missing values in data cleaning

A
  • Truly Missing (no value)
  • Coded (value, with a different meaning)
  • Miscoded (value, with wrong meaning)

Missing values can be …
MCAR: completely random and unpredictable
MAR: predictable
Missing, dependent on unobserved variables

—> Imputation (substitution) by Mean, Median, Stratified (sorting), Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Big Data Characteristics

A
  • Explosion of data volume and velocity from different domains
  • Cheaper storage availability
  • Faster and cheaper computation power
  • Analysis of unstructured data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

ML vs. Statistics

A

Statistics vs. ML

  1. model-focused vs. algrithm-focused
  2. Hypothesis -> data collection -> hypothesis-driven analysis
    vs. Data collection -> data-driven analysis
    ( Few hypothesis possible in ML )
  3. Focus on prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Building a model

A

Partition Data set
( Training, validation, Test )

Cross-Industry Standard Process for Data Mining CRISP-DM:
Business process understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Applicatipon of ML

A

Prediction (financial markets)
Insurance
Credit Scoring
Fraud Detection
Consumer Credit and Marketing, CRM (Classification?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly