Technological Aspects Flashcards

1
Q

Big Data definition

A
"The 3 Vs":
- high volume
- high velocity
- and/or high variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimisation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Purpose of Business Analytics

A

Get Value from Big Data:

  • identify where the business stands rel. to competitors
  • collect and use data to inform decision making
  • identify strengths & weaknesses of the business
  • use information for strategic planning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Three primary activities in Business Analytics

A

(Data Sources) –> ACQUIRE DATA –> PERFORM ANALYSIS –> PUBLISH RESULTS (push & pull w/ knowledge workers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Methods of Business Analytics

A
  • descriptive statistics (mean, median, variance -> histogram, bar chart, scatter plot)
  • correlation analysis (univariate & multivariate regression)
  • predictive analysis (data mining, machine learning)
  • causal inference (controlled experiment, econometrics)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Linear Regression Model

A

Yi = ß0 + ß1 (Xi) + e

Yi = outcome variable (dependent variable)
ß0 = y-intercept / constant / intercept
ß1 = slope of the line
Xi = independent variable
e = error term (vertical deviation of observation from regression line)

THE BEST LINE SHOULD BE UNBIASED & EFFICIENT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Four assumptions for valid statistical inference based on regression model

A
  1. at each value of x, there is a distribution of y with mean µ and variance σ^2
  2. the straight model is correct. the means of each of these distributions man be joined by a straight line.
  3. homoscedasticity
  4. independence of observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the OLS model do?

A

Under the four assumptions, the best fitting line is the one that minimises the sum of the squared residuals!

=> ∑ (êi)^2 = ∑(yi - ^y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why should one use a log-transformation for regression?

A
  • normalizes data / makes data look more like a normal distribution to satisfy assumption 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What to check in regression output

A

R-square: how much of the total variance in the data is explained by the model? Value should be as close to 1 as possible.

ANOVA: Significance of F-value (Levene test): should be not significant, the value thus not be lower that 0,1 for the data to have similar distribution at all Xi.

Coefficients: intercept and ß1 and their p-values. p-values should be significant, therefor below 0,1 at least!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hypothesis tests

A

p-value ≤ alpha
OR
critical value ≤ |test statistic|

=> reject H0!!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Predictive Analysis

A

Using past events to anticipate the future

  • statistical analytical techniques to find patterns in past events
  • the patters are used to develop models which predict the likelihood of future events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Mining & Machine Learning

A
  • like humans learning from past experiences, although experiences for computers are fed in via data of an application domain, from which the computer system “learns”
  • Supervised learning / classification: learn a target function that can be used to determine the values of a discrete class attribute, e.g. high.risk patient, bank loan approval
  • for this, a set of data records is needed with:
    • k-attributes (age, has_house, credit_rating …)
    • a class (approved yes/no, high-risk yes/no)

GOAL: Learn a classification model from the data! Learning means that a system performs a task better with a model than w/out a model (≈guessing or just assigning one class to all test data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Supervised vs. unsupervised learning

A

Supervised:
- the data are labeled with pre-defined classes

Unsupervised (clustering)

  • class labels of data are unknown
  • given a data set, the task is to establish the existence of classes or clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Supervised learning process

A

Two steps:

  • Learning (training): divide the data in training data and test data and learn a model using the training data
  • Testing: test the model using unseen test data to determine the model accuracy (IMPORTANT: distribution of training data is equal to test data; often violated assumption!)

Accuracy = number of correct classifications / total number of test cases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Methods of Data Mining & Machine Learning

A
Supervised learning:
- decision tree (one of most widely used techniques, very efficient!)
- perceptron
- logistic regression
...

Unsupervised learning:

  • k-means clustering (partitioning method, good for big data sets, but # of clusters has to be known/determined beforehand)
  • hierarchical clustering methods (single linkage, complete linkage etc. to determine # of clusters)
  • Hidden Markov Model (HMM)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Causal Inference: How to interpret a correlation between X and Y?

A

Correlation does not necessarily suggest causality!

3 possibilities:
X -> Y
Y -> X
W -> X & Y

Possible solutions:

  • choose an X that occurred before Y
  • if we have data on W, then we should control for it
  • GOLD STANDARD: Controlled Experiment (A/B test): population gets randomly divided into 2 groups A&B to control for unobserved confounding variables, and then one is the treatment group (gets “given” a high yelp rating, e.g.) and the other is the control group (gets given a low yelp rating)