Big Data Analytics Management Flashcards
What is the definition of Data Science?
Data science is a set of fundamental principles that guide the extraction of knowledge from data.
What is the definition of Data Mining?
Data mining is the extraction of knowledge from data, via technologies that incorporate these principles.
What is the definition of Data-Driven Decision-Making?
Data-Driven Decision-Making refers to the practice of basing decisions on the analysis of data, rather than purely intuition.
Tasks in data mining:
- Classification and class probability estimation
- Regression (“Value estimation”)
- Similarity matching
- Clustering
- Co-occurance grouping (market-basket analysis)
- Profiling
- Link prediction
- Data reduction
- Causal modeling
Describe classification and class probability estimation task
It attempts to predict, for each individual in a population, which of a set of classes this individual belongs to.
- Classification would give definitive output: will respond, will not respond.
- Class probability estimation would give output with probability that the individual belongs to that class.
Describe regression task
Regression attempts to predict, for each individual, the numerical value of some variable for that individual. Example: “How much will a given customer use a service?”
Regression vs. Classification?
Classification predicts WHETHER something will happen, whereas regression predicts HOW MUCH something will happen.
Describe similarity matching task
Similarity matching attempts to IDENTIFY individuals based on data known about them. Example: finding companies who are similar to the ones you are serving.
Describe clustering task
Clustering attempts to GROUP individuals in a population together by their similarity, but not driven by any specific purpose. Example: “Do our customers form natural groups or segments?”
Describe co-occurrence grouping task
It attempts to find ASSOCIATIONS between entities based on transactions involving them. Example: “What items are commonly purchased together?”
Clustering vs. co-occurrence?
While clustering looks at similarity between objects based on the objects’ attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions.
Describe profiling task
Profiling attempts to characterize the typical behavior of an individual, group, or population. Example: “What is the typical cell phone usage of this customer segment?”
Describe link prediction task
Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. Example: “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?”
Describe data reduction task
Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. For example, a massive dataset on consumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example,viewer genre preferences).
Describe causal modeling task
Causal modeling attempts to help us understand what events actually influence others. Example: “Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job of identifying those consumers who would have purchased anyway?” A business needs to weight the trade-off of increasing investment to reduce the assumptions made, versus deciding that the conclusions are good enough given the assumptions.
Conditions for supervised learning:
- It has to have a specific target;
- There must be data on the target.
Define label
The value for the target variable for an individual.
Supervised vs. unsupervised tasks
Supervised:
- Classification;
- Regression;
- Causal modeling.
Unsupervised:
- Clustering;
- Co-occurrence grouping;
- Profiling.
Both:
- Matching;
- Link prediction;
- Data reduction.
Second stage of CRISP process - Data Understanding
- The critical part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether further investment is merited.
- We need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply.
Third stage of CRISP process - Data Preparation
Data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results.
Define data leak
A data leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.
Fifth stage of CRISP process - Evaluation
The purpose of the evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on.
Sixth stage of CRISP process - Deployment
In deployment the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment.
Data Mining vs. Software Development
Data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on approaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem.
What are informative attributes?
Information is a quantity that reduces uncertainty about something.
Define predictive model
Predictive model is a formula for estimating the unknown value of interest: the target.
Define supervised learning
Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.
Define supervised learning
Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.
Creation of models from data is known as
Model induction. Induction is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths).
Define training data
The input data for the induction algorithm, used for inducing the model. They are also called labeled data because the value for the target variable (the label) is known.
Define Big Data (l)
Data which does not fit into single computer and needs multiple computers to process it
Define Artificial Intelligence
Entity that can mimic human behavior
Define data engineering
Methods to handle and prepare data
What characterizes Big Data?
VOLUME
- Data at Rest
- Terabytes to exabytes of existing data to process
VELOCITY
- Data in Motion
- Streaming data, milliseconds to seconds to respond
VARIETY
Data in Many Forms
- Structured, unstructured, text, multimedia
VERACITY
- Data in Doubt
- Uncertainty due to data inconsistency; incompleteness, ambiguities, latency, deception, model approximations.
What type of data has been increasing rapidly?
Unstructured data - text, video, audio
What is the issue with the growth of unstructured data?
Traditional systems have been designed for transactions, not unstructured data.
How Google solved the problem of unstructured data?
- Google goes through every page, scans its contents and has keywords ready which creates an index (and they update it every day)
- Traditional architecture is not enough to process data
- Solution: cluster architecture. However, every day 900 machines die and need to be replaced
Google challenges: How to distribute computation across multiple machines in a resource-efficient way? How to ensure that computations that are running are not lost when machines die? How to ensure that data is not lost when machines die/
Solution: Google File System - redundant storage of massive amounts of data on cheap and unreliable computers MapReduce distributed computing paradigm
What is Hadoop?
Open source software framework for distributed storage and distributed processing that replicated Google’s MapReduce model.
Define MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
The MapReduce model consists of two main stages:
- Map: input data is split into discrete chunks to be processed
- Reduce: output of the map phase is aggregated to produce the desired result
The simple nature of the programming model lends itself to efficient and
large-scale implementations across thousands of cheap nodes (computers).
Data Mining Process
- Data Engineering and Processing (big data technologies - Google Files, Hadoop) 2. Data Science -> automated DDD
Types of Analytics:
Increasing in value and difficulty:
- Descriptive - what happened? [Reports]
- Diagnostic - why did it happen? [Queries, statistical analysis]
- Predictive - what will happen? [Forecasts, machine learning]
- Diagnostic + Predictive = Prescriptive - how can we make it happen? [Optimization, planning]
Predictive analytics vs. Diagnostic analytics
Predictive analytics don’t answer question WHY it just predicts. While diagnostic analytics tries to answer WHY something happen.
Why diagnostic analytics is harder than predictive analytics?
Determining correlation is easier than implying causation (randomized controlled experiment).
How much human input is needed in each type of analytics?
- Descriptive/diagnostic analytics provide insight into the data, so that one can better understand what data to collect and store and provide insight into ways to improve future models.
- Predictive analytics is building a model to predict when something will happen.
- Prescriptive analytics automates action to be taken based on prediction.
Tasks of Descriptive/Diagnostic analytics:
Data visualization; Clustering; Co-occurrence grouping
Tasks of Prescriptive analytics:
- Uplift modeling - predict how individuals behave contingent on the action performed upon them
- Automation - determine optimal action based on predicted reaction of individuals
What does it mean for groups to be pure? (in terms of selecting meaningful attributes)
Homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure.
Complications while putting people into segments:
- Attributes rarely split a group perfectly.
- Not all attributes are binary.
- Some attributes take on numeric values (continuous or integer)
Define entropy
Entropy is a measure of disorder that can be applied to a set, such as one of our individual segments. Consider that we have a set of properties of members of the set, and each member has one and only one of the properties. In supervised segmentation, the member properties will correspond to the values of the target variable. Disorder corresponds to how mixed (impure) the segment is with respect to these properties of interest. So, for example, a mixed up segment with lots of write-offs and lots of non-write-offs would have high entropy.
What is natural measure of impurity for numeric values?
Variance
Overview of technologies on a typical Business Intelligence (BI) stack
How did Netflix create competitive advantage with data?
They had data on what people liked and they personalized/customized/adapted their movies to customers’ preferences
How can SafeSize leverage their data to move to B2C space?
- They stored customers’ data as a ‘shoe profile’;
- They sold their service to e-commerce shoe stores as a widget.
3 ways on how to collect data from IMDB:
- Manually
- Manually download the file (which someone created)
- Pretending you are a human browsing a web site (web scraping) - API
Web scraping can be done using:
- A modern programming language which offers complete flexibility but requires more effort to implement;
- Specialized tools which allow faster implementation but provide less flexibility and make it harder to replicate data collection.
Web scraping steps using a programming language:
- Request a web page
- Parse the HTML
- Filter and transform data to desired format
- Save data.
Web scraping using a dedicated tool:
import.io is a simple too that tries to infer what is interesting on a website webscraper.io gives you more flexibility
Webscraper.io steps
- Define a starting page
- Define category links
- For each individual category or product page determine which information to collect, determine which links to follow.
Why web scraping is not ideal?
- Many sites do not allow gathering information automatically.
- It detects if you are human based on detection of frequent requests, cookies, Robots Exclusion Protocol (it is stored on robots.txt), and other trackers.
- Not all information is public (you can use authentication the protected information and API)
What is the purpose of Robots Exclusion Protocol?
It tells everyone who is allowed to crawl their page.
Why people doesn’t want their website to be crawled?
Data can be costly to acquire, so companies don’t want to be found.
Three steps for API access:
It is an official way of accessing information automatically.
- Get an API key
- Query an API endpoint using the API key - an API usually provides multiple endpoints or functions (most recent movies, most popular…)
- Process the response
Twitter provides two types of API:
- Representational State Transfer (REST) APIs: Used for singular queries for one term
- Streaming API: Continuously get the tweets
Data structure (volume/velocity)
-
Cross-sectional - data that (almost) never changes;
- e.g. city names, birth date
-
Transactional - one observation represents one transaction;
- e.g. a website visit
-
Panel - one observation represents one individual during a time period
- e.g. monthly bill.
Data structure: Tidy data
Tidy data is on a single table according to the following rules:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its _own cel_l.
Types of data (Variety)
- Structured
- Unstructured
Structured data (Variety)
QUALITATITVE/CATEGORICAL DATA
- Nominal;
- Ordinal (satisfaction level);
QUANTITATIVE DATA
- Discrete (countable number);
- Continuous (interval value).
Types of unstructured data (Variety):
- Text-based documents (tweets, webpages)
- Images, videos, audio
What are the methods to transform unstructured data into structured data?
- Topic Modeling (text);
- Sentiment Analysis (text);
- Feature extraction (image/video/sound).
How can data quality be affected by (Veracity)?
- Missing data
- Measurement error
Two ways of missing data:
- Missing observations
- Missing values in some observations
Why it is important to know why data is missing?
You need to know the reason why it is missing because it will inform whether it is a problem or not. Is data missing at random or not?
What if data are missing at random?
It is fine. If data are missing at random, the remaining observations are still a representative sample of the population. Solution: listwise deletion i.e. delete all observations that do not have values for all variables in the analysis.
What if data are missing not at random?
It is a problem! The remaining observations are not a representative sample of population.
Selection bias (Veracity)
- Selection bias occurs when the sampling procedure is not random and thus the sample is not representative of the population.
- Self-selection - some members of the population are more likely to be included in the sample because of their characteristics.
Selection bias (Veracity)
Selection bias occurs when the sampling procedure is not random and thus the sample is not representative of the population.
Two types of selection bias (Veracity)
- Self-selection - some members of the population are more likely to be included in the sample because of their characteristics.
- Attrition - some observations may be less likely to be present in the sample due to time constraints
Define measurement error (Veracity)
Measurement error occurs when the data is collected errors that are non random.
Types of measurement error
- Recall bias - respondents recall some events more vividly than others (child deaths by gun vs swimming pools);
- Sensitive questions - respondents may not report data accurately (wages, health conditions);
- Faulty equipment - equipment that exhibits systematic measurement error.
What does disorder in terms of entropy represent?
Disorder corresponds to how mixed (impure) the segment is with respect to properties of interest.
Entropy equation:
entropy = - p1 log (p1) - p2 log (p2) - … Each pi is the probability of property i within the set, ranging from pi = 1 when all members of the set have property i, and pi = 0 when no members of the set have property i.
Define information gain (IG)
It measures how much an attribute improves entropy over the whole segmentation it creates.
How to address the issue in classification when the probability may be overly optimistic with small samples?
We use Laplace correction. Its purpose is to moderate the influence of leaves with only a few instances.
How do we identify informative attributes?
We measure the attribute on the basis of information gain, which is based on a purity measure called entropy, another is variance reduction (for numeric target).
How do we segment data by progressive attribute selection?
We use tree induction technique.
What is tree induction technique?
Tree induction recursively finds informative attributes for subsets of the data. In so doing it segments the space of instances into similar regions. The partitioning is “supervised” in that it tries to find segments that give increasingly precise information about the quantity to be predicted, the target. The resulting tree-structured model partitions the space of all possible instances into a set of segments with different predicted values for the target.