Technological Aspects Flashcards

Question 1

Q

Big Data definition

Answer

A

"The 3 Vs":
- high volume
- high velocity
- and/or high variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimisation.

Question 2

Q

Purpose of Business Analytics

Answer

A

Get Value from Big Data:

identify where the business stands rel. to competitors
collect and use data to inform decision making
identify strengths & weaknesses of the business
use information for strategic planning

Question 3

Q

Three primary activities in Business Analytics

Answer

A

(Data Sources) –> ACQUIRE DATA –> PERFORM ANALYSIS –> PUBLISH RESULTS (push & pull w/ knowledge workers)

Question 4

Q

Methods of Business Analytics

Answer

A

descriptive statistics (mean, median, variance -> histogram, bar chart, scatter plot)
correlation analysis (univariate & multivariate regression)
predictive analysis (data mining, machine learning)
causal inference (controlled experiment, econometrics)

Question 5

Q

Linear Regression Model

Answer

A

Yi = ß0 + ß1 (Xi) + e

Yi = outcome variable (dependent variable)
ß0 = y-intercept / constant / intercept
ß1 = slope of the line
Xi = independent variable
e = error term (vertical deviation of observation from regression line)

THE BEST LINE SHOULD BE UNBIASED & EFFICIENT

Question 6

Q

Four assumptions for valid statistical inference based on regression model

Answer

A

at each value of x, there is a distribution of y with mean µ and variance σ^2
the straight model is correct. the means of each of these distributions man be joined by a straight line.
homoscedasticity
independence of observations

Question 7

Q

What does the OLS model do?

Answer

A

Under the four assumptions, the best fitting line is the one that minimises the sum of the squared residuals!

=> ∑ (êi)^2 = ∑(yi - ^y)

Question 8

Q

Why should one use a log-transformation for regression?

Answer

A

normalizes data / makes data look more like a normal distribution to satisfy assumption 1

Question 9

Q

What to check in regression output

Answer

A

R-square: how much of the total variance in the data is explained by the model? Value should be as close to 1 as possible.

ANOVA: Significance of F-value (Levene test): should be not significant, the value thus not be lower that 0,1 for the data to have similar distribution at all Xi.

Coefficients: intercept and ß1 and their p-values. p-values should be significant, therefor below 0,1 at least!

Question 10

Q

Hypothesis tests

Answer

A

p-value ≤ alpha
OR
critical value ≤ |test statistic|

=> reject H0!!!

Question 11

Q

Predictive Analysis

Answer

A

Using past events to anticipate the future

statistical analytical techniques to find patterns in past events
the patters are used to develop models which predict the likelihood of future events

Question 12

Q

Data Mining & Machine Learning

Answer

A

like humans learning from past experiences, although experiences for computers are fed in via data of an application domain, from which the computer system “learns”
Supervised learning / classification: learn a target function that can be used to determine the values of a discrete class attribute, e.g. high.risk patient, bank loan approval
for this, a set of data records is needed with:
- k-attributes (age, has_house, credit_rating …)
- a class (approved yes/no, high-risk yes/no)

GOAL: Learn a classification model from the data! Learning means that a system performs a task better with a model than w/out a model (≈guessing or just assigning one class to all test data)

Question 13

Q

Supervised vs. unsupervised learning

Answer

A

Supervised:
- the data are labeled with pre-defined classes

Unsupervised (clustering)

class labels of data are unknown
given a data set, the task is to establish the existence of classes or clusters

Question 14

Q

Supervised learning process

Answer

A

Two steps:

Learning (training): divide the data in training data and test data and learn a model using the training data
Testing: test the model using unseen test data to determine the model accuracy (IMPORTANT: distribution of training data is equal to test data; often violated assumption!)

Accuracy = number of correct classifications / total number of test cases

Question 15

Q

Methods of Data Mining & Machine Learning

Answer

A

Supervised learning:
- decision tree (one of most widely used techniques, very efficient!)
- perceptron
- logistic regression
...

Unsupervised learning:

k-means clustering (partitioning method, good for big data sets, but # of clusters has to be known/determined beforehand)
hierarchical clustering methods (single linkage, complete linkage etc. to determine # of clusters)
Hidden Markov Model (HMM)

Question 16

Q

Causal Inference: How to interpret a correlation between X and Y?

Answer

Study These Flashcards

A

Correlation does not necessarily suggest causality!

3 possibilities:
X -> Y
Y -> X
W -> X & Y

Possible solutions:

choose an X that occurred before Y
if we have data on W, then we should control for it
GOLD STANDARD: Controlled Experiment (A/B test): population gets randomly divided into 2 groups A&B to control for unobserved confounding variables, and then one is the treatment group (gets “given” a high yelp rating, e.g.) and the other is the control group (gets given a low yelp rating)

Technological Aspects Flashcards

(16 cards)