1. General Flashcards

Question 1

Q

Types of structured data

Answer

A

Usual dataset: Time independent.
Time series: Ordered over uniform time intervals
Cross section: Multiple subjects
Multilevel: Several levels + clusters
Panel (longitudinal): Multilevel with 2 Dimensions - Cross-sectional over time

Question 2

Q

Characteristics of unstructured data

Answer

A

Non-numeric (e.g. text data).
Multi-faceted (multiple information points, e.g. speech recording).
Concurrent representation (study of different research goals from one dataset, e.g. an email).

Question 3

Q

Types of variables

Answer

A

Discrete random (finite or countably infinite range)
Continuous random (range is an interval)
Nominal / Categorical (unordered names)
Binary (unordered nominal variable of two categories)
Ordinal (ranked)
Numerical (quantitative, scaled either by interval or ratio)

Question 4

Q

Types of ML techniques

Answer

A

Unsupervised
Observe data and construct low complexity description e.g. Clustering, dimensional reduction techniques like PCA, discretization…).
Supervised
Observe training samples and learn function mapping inputs -> outputs
e.g. regression, classification, ranking…
Reinforcement Learning
Learning a map system
states -> actions, under maximization of cumulative reward

Question 5

Q

Types of missing values in data cleaning

Answer

A

Truly Missing (no value)
Coded (value, with a different meaning)
Miscoded (value, with wrong meaning)

Missing values can be …
MCAR: completely random and unpredictable
MAR: predictable
Missing, dependent on unobserved variables

—> Imputation (substitution) by Mean, Median, Stratified (sorting), Regression

Question 6

Q

Big Data Characteristics

Answer

A

Explosion of data volume and velocity from different domains
Cheaper storage availability
Faster and cheaper computation power
Analysis of unstructured data

Question 7

Q

ML vs. Statistics

Answer

A

Statistics vs. ML

model-focused vs. algrithm-focused
Hypothesis -> data collection -> hypothesis-driven analysis
vs. Data collection -> data-driven analysis
( Few hypothesis possible in ML )
Focus on prediction

Question 8

Q

Building a model

Answer

A

Partition Data set
( Training, validation, Test )

Cross-Industry Standard Process for Data Mining CRISP-DM:
Business process understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment

Question 9

Q

Applicatipon of ML

Answer

A

Prediction (financial markets)
Insurance
Credit Scoring
Fraud Detection
Consumer Credit and Marketing, CRM (Classification?)

1. General Flashcards

(9 cards)