Big data Flashcards
What are the FOUR dimensions of Big Data?
- Volume: refers to the quantity of available data
- Velocity: refers to the rate at which the data is recorded/collected
- Veracity: refers to quality and applicability of data
- Variety: refers to the different type of available data
What characterizes big data/how is it defined?
Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value
Gartner: Big data is high-volume, highvelocity and high-variety information assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making
What are the drivers behind big data?
- Increased data volumes being captured and stored
- Rapid acceleration of data growth
- Increased data volumes pushed into the network
- Growing variation in types of data assets for analysis
- Alternate and unsynchronized methods for facilitating data delivery
- Rising demand for real-time integration of analytical results
What is NoSQL?
“Not only SQL” –> alternate model for data management
- Provides a variety of methods for managing information to best suit specific business process needs, such as in-memory data management, columnar layouts to speed query response, and graph databases
What is MPP? and how is it related to Big Data?
Massively parallel Processing
–> A type of computer, utilizing high-bandwidth networks and massive I/O devices
RELATION TO BD:
- Big data is smarter, since it couples clusters of hardware components with open source tools and technology
What five aspects will a corporation considering incorporating Big Data need to consider?
• Feasibility: Is the enterprise aligned in a way that allows for new and emerging technologies to be brought into
the organization?
• Reasonability: will the resource requirements exceed the capability of the existing or planned environment?
• Value: do the result warrant the investment?
• Integrability: any constraints or impediments within the organization from a technical, social, or political
perspective?
• Sustainability: are costs associated with maintenance, configuration, skills maintenance, and adjustments to the
level of agility sustainable?
Name the 7 types of people needed for implementing Big Data?
1) Business envangelist –> understands current limitations of existing tech infrastructure
2) Technical envangelist –> undestands the emerging tech and the science behind
3) business analyst –> engages the business process owners, and identify measures to quantify
4) Big Data application architect –> Experienced in high performance computing
5) application developer –> Identify the technical resources with the right set of skills for programming
6) Program manager –> experienced in project managment
7) data scientist –> Experienced in coding and statistics/AI
What is the Big Data framework? and what key components does it consist of?
Overall picture of the Big Data landscape, consists of:
- Infrastructure (e.g. SAP and SQL)
- Analytics (e.g. Google analytics)
- Applications (e.g. Human capital, legal, security)
- Cross-infrastructure analytics (Google, Microsoft, Oracle)
- Open source (R-studio,
- Data Sources and API(Application Programming Interface)s (Garmin, Apple)
What is API?
Application programming interface
is a set of routines, protocols, and tools for building software applications. Basically, an API specifies how software components should interact. Additionally, APIs are used when programming graphical user interface (GUI) components
- Enables software components to talk to each other
Which is better row or column-oriented data?
Column-oriented data; since it reduces the latency by storing each column separately
Access performance; ROW: not good for many simultaneous queries (as opposed to column)
Speed of aggregation; Much faster in column-oriented data
Suitability to compression; column-data better suited for compression, decreasing storage needs.
Data load speed; faster in column, since data is stored separately you can load in parallel using multiple threads
Hardware versus software?
Go to slide 36 and 37 and discuss
Name the four tools and techniques?
Processing capability
- Often interconnected by several nodes, allowing tasks to be run simultaneously called MULTITHREADING
Storage of data
Memory
- Holds the data in the node currently running
Network
- Communication infrastructure between the nodes
What types of architectural clusters exist? And what the two OVERALL?
Slide 42 and 43:
OVERALL: centralized and decentralized
- Fully connected network topology
- Mesh network topology
- Star network topology
- Common bus topology
- Ring network topology
What does the general architecture distinguish between? and what are their roles?
Management of computing resources
- oversees the pool of processing nodes, assign tasks and monitors activity
Management of data/storage
- oversees the data storage pool and distributes datasets across the collection of storage resources
What are the three important layers of Hadoop?
- Hadoop Distributed File System (HDFS)
- MapReduce
- YARN: a new generation framework for job scheduling and cluster management
What are the main function of HDFS?
- Attempts to enable the storage of LARGE files, by distributing the data among a pool of data nodes
- Monitoring of communication between nodes and masters
- Rebalancing of data from one block to another if free capacity is available
- Managing integrity using checksums/digital signatures
- Metadata replication to protect against corruption
- Snapshots/copying of data to establish check-points
What are the four advantages of using HDFS?
1) decreasing the cost of specialty large-scale storage systems
2) providing the ability to rely on commodity components
3) enabling the ability to deploy using cloud-based services
4) reducing system management costs
What is MapReduce?
- It is a software framework
- Used to write applications which process vast amount of data in-parallel on large clusters
- It is fault-tolerant
- Combines both data and computational independence
(both data and computations can be distributed across nodes, which enables strong parallelization)
What are the two steps in MapReduce?
Map: Describes the computation analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs
Reduce: the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results
Example; count the number of occurences of a word in a corpus:
key: is the word
value: is the number of times the word is counted
What is parallelization?
the act of designing a computer program or system to process data in parallel. Normally, computer programs compute data serially: they solve one problem, and then the next, then the next. If a computer program or system is parallelized, it breaks a problem down into smaller pieces that can each independently be solved at the same time by discrete computing resources
What are the four use cases for big data?
Counting; document indexing, filtering, aggregation
Scanning; sorting, text analysis, pattern recognition
Modeling; analysis and prediction
Storing; rapid access to stored large datasets
What is data mining?
The art and science of discovering knowledge, insights and patterns in data
- e.g. predicting the winning chances of a sports team
- or identifying friends and foes in warfare
- or forecasting rainfall patterns in a region
It helps recognizing the hidden value in data
Describe the typical process of data mining?
- Understand the application domain
- Identify data sources and select target data
- Pre-process: cleaning, attribute selection
- Data mining to extract patterns or models
- Post-process: identifying interesting or useful patterns
- Incorporate patterns in real world tasks
OR:
Data input –> data consolidation –> data cleaning –> data transformation –> data reduction –> well-formed data
In terms of data mining, what does ETL stand for?
Extract, transform, load
What are some of the common mistakes of data-mining?
- Selecting the wrong problem for data mining
- Not leaving insufficient time for data acquisition, selection and preparation
- Looking only at aggregated results and not at individual records/predictions
- Being sloppy about keeping track of the data mining procedure and results
- Ignoring suspicious (good or bad) findings and quickly moving on
- Running mining algorithms repeatedly and blindly, without thinking about the next stage
- Naively believing everything you are told about the data
- Naively believing everything you are told about your own data mining analysis
Describe the Havard Business Review matrix
Two parameters;
1) Not useful –> Very useful
2) Time-consuming –> not time-consuming
Upper-left: Plan
Upper-right: learn
Lower-left: ignore
Lower-right: browse
BEWARE –> VERY bad matrix, cannot be used
Why do we have an error terms in a statistical linear model?
Because the model is not deterministic (perfect)
Error terms are normally distributed!
What method is usually used to estimate a linear relationship?
Ordinary Least Square (OLS)
How do we define the slope, i.e. B1?
Cov(x,y) / Var(x)
How do we define the intercept?
B0 = y - B1x
What is the estimated variance of error?
SSE / (n - 2):
SSE: Sum of squared errors AKA. sum of squared residuals SSR
How do we find R^2 and what is its nickname?
Coefficient of determination
ESS / TSS or (1 - RSS/TSS)
Or; explained sum of squares over total variance
What are the four principal assumptions that justify the use of a linear model?
1) Linearity and additivity
- Slope of the line does not depend on the values of the other variables
- Effects from each independ. var. is additive to each other to estimate the dependent variable
2) Statistical independence of the errors
3) Homoscedasticity: constant variance of errors
4) Normality of the error distribution
What happens when linearity and additivity is violated in the OLS?
- NOT good
- Use another model that is not linear
Either; logarithmic (if all datapoints are positive) or polynomial model
What happens when independence of the error terms is violated in the OLS?
Diagnosis: look at a residual time series plot
- Often a concern in panel data/longitudinal data
- If minor cases of positive serial correlation; add the lagged independent variable as predictor
- If minor cases negative serial correlation, solve by differentiating
- If large serial correlation –> RESTRUCTURE THE MODEL, maybe standardize all variables
What happens when homoscedasticity is violated using the OLS?
Diagnosis; plot the residuals versus the predicted values
- You get imprecise predictions and confidence intervals
- If errors are increasing over time, CIs for out-of-sample predictions are unrealistically narrow
SOLUTION: robust standard errors or transformation of model
What happens when normality is vioalted using the OLS?
Diagnosis; Plot of normality or a Shapiro-Wilk test
- Makes it very hard to determine if a coefficient is significantly different from zero and to provide CIs
BUT; if your goal is only to estimate the values of the coefficients, then THIS IS NOT A PROBLEM, only if you have to do predictions
What model can we use, in the case the outcome is not normal (Gaussian)?
Generalized Linear Model
What is a Generalized Linear Model (GLM)?
Characterized by THREE components:
1) Random: associated with the dependent variable and its probability distribution
2) systematic: identifies the selected covariates through a linear predictor
3) link function: identifies the function of E[Y] such that it is equal to the systematic component
Normally, will a binary response be derived from a linear relation?
No; it will typically be non-linear –> could use logistic regression
What shape does the logistic regression have?
S-curved
As you will almost never reach 1 and almost never 0
What is panel data and what is the associated model?
Panel(aka longitudinal) data; repeated measure of the same subject –> Panel Regression
- AVOID GLM and LM, since the independence assumption does not hold anymore
What are the pros of a Panel regression?
- account for sample heterogeinity and compute individual-specific estimates
- suitable to study the dynamics
- minimise bias due to aggregation (like average over time)
- enable the control for unobserved variables like cultural factors or difference in business practices across companies (i.e. national policies, federal regulations, international agreements, etc.)
- enable more complex hierarchical models (Generalized Additive Models, GAM)
What are the cons of a Panel regression?
- data collection (i.e. sampling design, coverage)
- unwanted correlation (i.e. same country but different measures)
- analyses are much more complex
What are the two possible approaches to a panel regression, and how do they work?
Fixed effects (FE) model –> explores the relationship between predictor and outcome variables within a subject
Random effects (RE) model –> Assumes that the variation across subjects is random and uncorrelated with the predictors
What are the two assumptions of the FE model?
1) correlation between subject’s error term and predictors –> FE removes the effect of time-invariants characeristics, so we can estimate the NET effect of the predictors outcome
2) All the time-invariants characteristics are individual specific so that they are NOT correlated with other individuals
CAUTIONS; Lots of dummy variables and increased risk of multicollinearity
When do we use the RE model and what is the assumptions behind?
Used when differences across subjects have some influence on the dependent variable
Assumption: The subject’s error term is uncorrelated with the predictors –> therefore, you can use time-invariant variables as EXPLANATORY variables
CAUTIONS: RE needs REALLY strong assumptions, and FE is often more convinceable. Testing the RE can be done through Hausmann test
What are the between and within-entity errors in the RE model?
Slide 102
Between = U_it Within = e_it
What are the two overall types of classifications for clustering?
1) Clustering (Unsupervised learning)
- No prior classification, algorithm explores all possible combinations
2) Supervised learning
- We do have prior knowledge on the data; we know the labels, or categories etc.
- Training a computer to learn a new system we created
- Based on data, therefore, the more data, the better the system
What are the three partitions of unsupervised learning in clustering, and what are the sub-categories?
Hard clustering
1) Hierarchical methods (agglomerative and divisive)
2) Partinioning methods (non-hierarchical)
Soft clustering
3) Fuzzy methods
More complex methods
4) Density-based methods
5) Model-based methods
In regards to clustering, explain the concept of (statistical) distance
- Needed for every classification method
- Distance defines the DISsimilarity between two points
- numerous of parameters exist –> think of the dissimilarity matrix slide 118
Name two examples of measures of statistical distance
1) Pearson correlation distance
2) Eisen cosine correlation distance
How does hierarchical clustering methods work? and which two types exist? How is visualized?
1) Agglomerative; each observation is considered a cluster. Iteratively, the most similar clusters(leafs) are merged until there on single cluster forms (roots)
2) Divisive; the INVERSE of the agglomerative. Begins with the root and subsequently the most heterogeneous clusters are divided until each observation forms a cluster
Thus, no need for defining number of cluster groups beforehand
Visualization; Dendrogram
What are the pros and cons of the hierarchical method of clustering?
Pros:
• No a-priori information about the number of clusters required
• Easy to implement
• Very replicable
Cons:
• Not very efficient O(n2 log n)
• Based on dissimilarity matrix which has to be choosed in advance
• No objective function is directly minimised
• The dendrogram is not the best tool to choose the optimum number of clusters
• Hard to treat non-convex shapes
How does the partioning method of clustering work?
- Simplest method
- Requires predefined number of clusters
- iterative methods, geometry based
3 most famous types:
1) K-means: each cluster is represented by the center of the cluster
2) K-medoids or PAM: each cluster is represented by one of the points in the cluster
3) CLARA (Clustering LARge Applications): Suitable for large datasets
What are the pros and cons of the partioning method of clustering?
Pros:
• k-means is relatively efficient O(tkn), with k, t «_space;n
• easy to implement and understand
• totally replicable
Cons:
• PAM does not scale well for large datasets
• Applicable only when mean is defined (i.e. no categorical data)
• Need to specify k in advance
• k-means is unable to handle noisy data and outliers. PAM does better
• Not suitable to discover clusters with non-convex shapes
How do we check if data can be clustered (has a tendency) or wheter we are just synthezising clusters?
- Use the Hopkins statistics test; it measures the probability of your dataset to be uniformly distributed
- Aka. it tests the spatial randomness of the data
Mechanincs:
- Based on each observations distance to its neighbor
- Then comparing this distance to a random sample’s distance to its neighbor
Formula:
Average dist. to random neighbor / (ave. dist. to random neighbor + ave. dist .to real neighbor)
Three ways to find the optimal number of clusters?
Plenty of methods exist, however we focus only on 3:
1) Silhoutte
2) Gap-statistics
3) Within sum of squares
Name the four overall mehtods of supervised clustering
1) Heuristic approach
- k-NN (nearest neighbor)
2) Model-based approach
- Linear discriminant analysis
- Quadratic discriminant analysis
- Logistic
- Näive Bayes
3) Binary decision
- Classification and regression trees (CART)
- Random forest
4) Optimisation based
- Support Vector Machine (SVM)
- Neural Networks
What is the Näive Bayes, and how does it work?
- A probalistic machine learning algorithm
- Based on two components; Conditional probability and Bayes Rule