Technological Aspects Flashcards
Big Data definition
"The 3 Vs": - high volume - high velocity - and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimisation.
Purpose of Business Analytics
Get Value from Big Data:
- identify where the business stands rel. to competitors
- collect and use data to inform decision making
- identify strengths & weaknesses of the business
- use information for strategic planning
Three primary activities in Business Analytics
(Data Sources) –> ACQUIRE DATA –> PERFORM ANALYSIS –> PUBLISH RESULTS (push & pull w/ knowledge workers)
Methods of Business Analytics
- descriptive statistics (mean, median, variance -> histogram, bar chart, scatter plot)
- correlation analysis (univariate & multivariate regression)
- predictive analysis (data mining, machine learning)
- causal inference (controlled experiment, econometrics)
Linear Regression Model
Yi = ß0 + ß1 (Xi) + e
Yi = outcome variable (dependent variable) ß0 = y-intercept / constant / intercept ß1 = slope of the line Xi = independent variable e = error term (vertical deviation of observation from regression line)
THE BEST LINE SHOULD BE UNBIASED & EFFICIENT
Four assumptions for valid statistical inference based on regression model
- at each value of x, there is a distribution of y with mean µ and variance σ^2
- the straight model is correct. the means of each of these distributions man be joined by a straight line.
- homoscedasticity
- independence of observations
What does the OLS model do?
Under the four assumptions, the best fitting line is the one that minimises the sum of the squared residuals!
=> ∑ (êi)^2 = ∑(yi - ^y)
Why should one use a log-transformation for regression?
- normalizes data / makes data look more like a normal distribution to satisfy assumption 1
What to check in regression output
R-square: how much of the total variance in the data is explained by the model? Value should be as close to 1 as possible.
ANOVA: Significance of F-value (Levene test): should be not significant, the value thus not be lower that 0,1 for the data to have similar distribution at all Xi.
Coefficients: intercept and ß1 and their p-values. p-values should be significant, therefor below 0,1 at least!
Hypothesis tests
p-value ≤ alpha
OR
critical value ≤ |test statistic|
=> reject H0!!!
Predictive Analysis
Using past events to anticipate the future
- statistical analytical techniques to find patterns in past events
- the patters are used to develop models which predict the likelihood of future events
Data Mining & Machine Learning
- like humans learning from past experiences, although experiences for computers are fed in via data of an application domain, from which the computer system “learns”
- Supervised learning / classification: learn a target function that can be used to determine the values of a discrete class attribute, e.g. high.risk patient, bank loan approval
- for this, a set of data records is needed with:
- k-attributes (age, has_house, credit_rating …)
- a class (approved yes/no, high-risk yes/no)
GOAL: Learn a classification model from the data! Learning means that a system performs a task better with a model than w/out a model (≈guessing or just assigning one class to all test data)
Supervised vs. unsupervised learning
Supervised:
- the data are labeled with pre-defined classes
Unsupervised (clustering)
- class labels of data are unknown
- given a data set, the task is to establish the existence of classes or clusters
Supervised learning process
Two steps:
- Learning (training): divide the data in training data and test data and learn a model using the training data
- Testing: test the model using unseen test data to determine the model accuracy (IMPORTANT: distribution of training data is equal to test data; often violated assumption!)
Accuracy = number of correct classifications / total number of test cases
Methods of Data Mining & Machine Learning
Supervised learning: - decision tree (one of most widely used techniques, very efficient!) - perceptron - logistic regression ...
Unsupervised learning:
- k-means clustering (partitioning method, good for big data sets, but # of clusters has to be known/determined beforehand)
- hierarchical clustering methods (single linkage, complete linkage etc. to determine # of clusters)
- Hidden Markov Model (HMM)