Big Data Analytics Management Flashcards
What is the definition of Data Science?
Data science is a set of fundamental principles that guide the extraction of knowledge from data.
What is the definition of Data Mining?
Data mining is the extraction of knowledge from data, via technologies that incorporate these principles.
What is the definition of Data-Driven Decision-Making?
Data-Driven Decision-Making refers to the practice of basing decisions on the analysis of data, rather than purely intuition.
Tasks in data mining:
- Classification and class probability estimation
- Regression (“Value estimation”)
- Similarity matching
- Clustering
- Co-occurance grouping (market-basket analysis)
- Profiling
- Link prediction
- Data reduction
- Causal modeling
Describe classification and class probability estimation task
It attempts to predict, for each individual in a population, which of a set of classes this individual belongs to.
- Classification would give definitive output: will respond, will not respond.
- Class probability estimation would give output with probability that the individual belongs to that class.
Describe regression task
Regression attempts to predict, for each individual, the numerical value of some variable for that individual. Example: “How much will a given customer use a service?”
Regression vs. Classification?
Classification predicts WHETHER something will happen, whereas regression predicts HOW MUCH something will happen.
Describe similarity matching task
Similarity matching attempts to IDENTIFY individuals based on data known about them. Example: finding companies who are similar to the ones you are serving.
Describe clustering task
Clustering attempts to GROUP individuals in a population together by their similarity, but not driven by any specific purpose. Example: “Do our customers form natural groups or segments?”
Describe co-occurrence grouping task
It attempts to find ASSOCIATIONS between entities based on transactions involving them. Example: “What items are commonly purchased together?”
Clustering vs. co-occurrence?
While clustering looks at similarity between objects based on the objects’ attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions.
Describe profiling task
Profiling attempts to characterize the typical behavior of an individual, group, or population. Example: “What is the typical cell phone usage of this customer segment?”
Describe link prediction task
Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. Example: “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?”
Describe data reduction task
Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. For example, a massive dataset on consumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example,viewer genre preferences).
Describe causal modeling task
Causal modeling attempts to help us understand what events actually influence others. Example: “Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job of identifying those consumers who would have purchased anyway?” A business needs to weight the trade-off of increasing investment to reduce the assumptions made, versus deciding that the conclusions are good enough given the assumptions.
Conditions for supervised learning:
- It has to have a specific target;
- There must be data on the target.
Define label
The value for the target variable for an individual.
Supervised vs. unsupervised tasks
Supervised:
- Classification;
- Regression;
- Causal modeling.
Unsupervised:
- Clustering;
- Co-occurrence grouping;
- Profiling.
Both:
- Matching;
- Link prediction;
- Data reduction.
Second stage of CRISP process - Data Understanding
- The critical part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether further investment is merited.
- We need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply.
Third stage of CRISP process - Data Preparation
Data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results.
Define data leak
A data leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.
Fifth stage of CRISP process - Evaluation
The purpose of the evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on.
Sixth stage of CRISP process - Deployment
In deployment the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment.
Data Mining vs. Software Development
Data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on approaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem.























