Exam Practice Flashcards
What is Big Data?
Four pillars;
1) Information
2) Technology
3) Impact
4) Methods
Big Data is the Information Asset
What dimensions underlie Big Data?
1) Volume: quantity of available data
2) Velocity: rate at which data is collected/recorded
3) Veracity: quality and applicability of data
4) Variety: different types of data available
What is Garner’s description of Big Data?
Big data is high-volume, high velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making
Why do we call it “Big” Data?
Because the resources exceed the capabilities of traditional computing environments
What are the drivers of Big Data?
Non-exhaustive list:
- Increased data volumes
- Rapid acceleration of data growth
- Growing variation in data types for analysis
- Alternate and unsynchronized methods for facilitating data delivery
- Rising demand for real-time integration of analytical results
What is the process for piloting technologies to determine their feasibility and business value and for engaging business sponsors and socializing the benefits of a selected technique?
1) channel the energy and effort of test-driving big data technologies
2) determine whether those technologies add value
3) devise a communication strategy for sharing the message to the right people within the organisation
What must happen to bring big data analytics into organization’s system development life cycle to enable their use?
1) develop tactics for technologists, data management professionals and business stakeholders to work together
2) migrate the Big Data projects into the production environment in a controlled and managed way
How to assess value of Big Data?
1) feasibility: does organisational setup permit new and emerging technologies?
2) reasonability: is the resource requirements within capacity?
3) value: do the result warrant investment?
4) integrability: any impediments within the organisation?
5) sustainability: maintenance costs manageable?
What is Hadoop? And mention the three important layers.
Apache Hadoop is a collection of open-source software utilities for distributed storage and processing of Big Data using the MapReduce programming model
Important layers:
1) Hadoop Distributed File System (HDFS)
2) MapReduce
3) YARN: job scheduling and cluster management
How can organisations plan to support Big Data?
Get the people right (Business Evangelists, Technical Evangelists, Business Analysts, Big Data Application Architect, Application Developer, Program Manager, Data Scientists)
What is parallel computing?
Type of computation where many calculations are carried out simultaneously. Problems can be broken into pieces and solved at the same time.
Parallelism has long been employed in high-performance computing (multi-core processors).
What is distributed computing?
Model in which components of a software system are shared among multiple computers to improve efficiency and performance.
For example, in the typical distribution using the 3-tier model, user interface processing is performed in the PC at the user’s location, business processing is done in a remote computer, and database access and processing is conducted in another computer that provides centralized access for many business processes. Typically, this kind of distributed computing uses the client/server communications model.
What is the Big Data Landscape 2016?
Infrastructure (e.g. Hadoop) Analytics (e.g. Statistical Computing) Applications (e.g. Sales and Marketing) Cross-Infrastructure/Analytics (e.g. Google) Open Source Data Sources (e.g. Apple Health)
What is the Big Data Landscape of 2019?
Infrastructure Analytics and Machine Learning Applications - Enterprise Applications - Industry Cross-Infrastructure/Analytics Open Source Data Sources Data Resources
What is the Big Data framework?
Analytical applications that combine the means for developing and implementing algorithms, which must access, consume and manage data
What is encompassed in a technological ecosystem?
- Scalable storage
- Computing platform
- Data management environment
- Application development framework
- Scalable analytics
- Project management processes and tools
Describe row-oriented data
The entire record must be read to access required attributes
Traditional database systems employ a row-oriented layout: values at specific rows are laid out consecutively in memory
Describe column-oriented data
Values are stored by column per variable
Values can be stored separately
Reduced latency to access data compared to row
What are key difference between row vs. column-oriented data
Four dimensions of comparison;
1) Access performance: column faster than row
2) Speed of joins and aggregation: column has less access latency than row
3) Suitability to compression: with column you can compress data to decrease storage needs while maintaining high performance. Difficult to apply compression to row for increasing performance
4) Data load speed: in row, all data values need to be stored together and therefore prevent parallel loading. In columns, data can be segregated, thus allowing to load columns in parallel (dual cores) using multiple threads to work on each column
Describe tools and techniques for data management.
1) Processing capability: multi processing nodes often incorporate multiple cores to handle tasks run simultaneously
2) Memory: holds the data in the node currently running and generally has an upper limit per node
3) Storage: provides persistence of data and it is the place where datasets and databases are kept ready to be accessed
4) Network: this is the communication infrastructure between nodes and allows for information exchange
What is a cluster in data architecture?
A collection of interconnected nodes
Mention architecture cluster types from class.
- Fully connected network topology (all-to-all)
- Common bus topology (sequence, one-to-next)
- Mesh network topology (some-to-some)
- Star network topology (one-to-many)
- Ring network topology (neighbor-to-neighbor)
Describe in detail the three layers of Hadoop.
1) HDFS
Attempts to enable storage of large files by distributing the data among a pool of data nodes. A HDFS file appears to be one file, even though it blocks “chunks” of the file into pieces that are stored on individual data nodes. HDFS provides a level of fault-tolerance through data replication.
2) MapReduce
Used to write applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters (thousands of nodes). It is too fault-tolerant.
3) Yarn
What are the value proposition of HDFS from Hadoop?
1) Decrease cost of specialty large-scale storage systems
2) Providing the ability to rely on commodity components
3) Enabling the ability to deploy using cloud-based services
4) Reducing system management cost
Describe the framework for MapReduce from Hadoop? And give and example of where MapReduce can be used?
Two steps:
1) Map: describes the computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs
2) Produce: the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results
- MapReduce is a series of basic operations applied in a sequence to small chunks of many datasets
- Combines both data and computational independence
- It is fault-tolerant
- Can be used to count number of occurrences of a word in a corpus (breaks down each document, each paragraph, each sentence slide 22 in L2).
What is cloud computing?
Cloud computing refers to the place where data is being processed
What is data mining?
An art and science of discovering knowledge, insights and patterns in data
Why data mining?
- To recognize of hidden value in data
- To effectively gather quality data and efficiently process it
Outline steps in a typical data mining process.
1) Understand the application domain
2) Identify data sources and select target data
3) Pre-process: cleaning, attribute selection
4) Data mining to extract patterns or models
5) Post-process: identifying interesting or useful patterns
6) Incorporate process in real world task
Name common mistakes around data mining.
1) Wrong problem for mining
2) Not having sufficient time for data acquisition
3) Only focus on aggregated results
4) Being sloppy with the procedure
5) Ignoring suspicious findings
6) Running mining algorithms repeatedly and blindly
7) Naively believing in the data
What method do we use for estimating a linear relationship statistically?
Ordinary Least Squares
OLS because it minimizes the sum of squared errors
What is pseudo out-of-sampling testing?
Break out the dataset in two and estimate the model based on one and then make predictions on the other
How do we compute beta in a single variable OLS?
How do we compute the intercept in a single variable OLS?
beta = COV[X,Y]/VAR[X] intercept = plug as residual
What are the model assumptions behind OLS?
1) Linearity and additivity
2) Statistical independence of errors
3) Homoscedasticity: constant variance of the errors
4) Normality of the error distribution
How to diagnose violation of linearity?
Plot observed vs. predicted values
How to diagnose violation of independence (concern in time series)?
Run Durbin Watson test on residuals and check if they exhibit autocorrelation
What statistical tests exist for assessing normality in OLS?
Shapiro-Wilk, Kolmogorov-Smirnov, Jarque-Bera, Anderson-Darling
Why do we need a Generalized Linear Model (GLM)? And what is its three components?
If the response variable of OLS does not follow a normal distribution (not Gaussian).
Composed of;
1) Randomness: associated with the dependent variable and its probability distribution
2) Systematic: identifies the selected covariates through a linear predictor
3) Link function: identifies the function E[Y] s.t. it is equal to the systematic component (e.g. log transformation)
What is the logistic regression? (Mention nature of response variable and derive if needed)
Binary response variable
Mention key pros and cons of the panel regression model.
Pros:
- suitable for studying dynamics
- minimize bias
- time fixed effects to control for unobserved variables
Cons:
- data collection
- unwanted correlation
- complexity
Explain the difference between Random Effects and Fixed Effects Panel Regression.
FE: removes time-invariant characteristics
RE: variation across subjects is random and uncorrelated with the predictors
Types of classification of data?
Supervised Learning: train computer to learn new classification each time
Clustering (unsupervised): start from scratch every time
Mention five families of clustering methods.
1) Hierarchical methods (agglomerative and divisive)
2) Partitioning methods (non-hierarchical)
3) Fuzzy methods
4) Density-based methods
5) Model-based methods