Chapters 1, 2, 3 Flashcards
What is Data Science?
A set of fundamental principles that guide the extraction of knowledge of data.
Data science is not the same as data processing and engineering. They complement eachother.
What is Data Mining?
The extraction of knowledge from data, via technologies.
Data mining techniques provide some of the clearest illustrations of the principles of data science.
What is data-driven decision making?
Practice of basing decisions on the analysis of data rather than purely on intuition.
Firms engage in DDD in varying degrees.
Firms that are data driven are more productive.
What are the two types of Data-Driven Decision making problems?
1) Decisions for which “discoveries” need to be made within data
2) Decisions that repeat, mainly at massive scale, so decision making benefits from even small increases in decision-making accuracy based on data analysis.
What is Big Data?
Datasets that are too large for traditional data processing systems and therefore require new processing technologies.
Big data techniques are most frequently used for data processing in support of data mining techniques.
What is the fundamental principle of Data Science?
Data and the capability to extract useful knowledge from data should be regarded as key strategic assets.
Within company, necessary to have a close connection between data scientists and business people.
What is overfitting?
Looking so hard to find something that you actually do, but it might not be generalizable beyond the given data.
What is Classification?
- predicts for each individual (item) in a population, which of a (small) set of classes that individual (item) belongs to.
- List must be exhaustive and mutually exclusive.
- A related task is scoring and class probability estimation.
- Scoring task is when an individual is given a score representing the probability that that individual belongs at a class.
- Example:
- “Among all the customers of MegaTelCo, which are likely to respond to a given offer?”
- In this example the two calsses could be called will respond and will not respond.
What is Regression?
- Attempts to estimate or predict, for each individual, the numerical value of some variable for that individual.
- Example:
- “How much will a given customer use the service?”
- The property (variable) to be predicted here is service usage, and a model could be generated by looking at other, similar individual in the population and their historical usage.
- Regression is related to classification, but the two are different. Informally, classification predicts *whether* something will happen, whereas regression predicts *how much* something will happen. whether vs how much.
- Regression has a numerical target.
What is similarity matching?
- Attempts to identify similar individual based on data known about them.
- Similarity matching can be used directly to find similar entities.
- Example:
- For example, IBM is interested in finding companies similar to their best business customers, in order to focus their sales force on the best opportunities.
- Making product recommendations based on people who are similar to you in terms of the products they have liekd or have purchased.
What is Clustering?
- Attempts to group individual in a population by their similarity, however not driven by any specific purpose.
- Why is it useful? Useful in preliminary domain exploration to see which natural groups exist.
- Example:
- “Do our customers form natural groups or segments?
What is Co-occurence grouping?
- Attempts to find associations between entities based on transactions involving them.
- Example:
- What items are commonly purchased together in a supermarket?
- Analyzing purchase records from a supermarket may uncover that ground meat is purchased together with hot sauce much more frequently than we might expect.
- While clustering looks at similarity between objects based on objects’ attributes, co-occurence grouping consider similarity of objects based on their appearing together in transactions.
- So its like co-occurence is more specific. Clustering looks at general similarity, co-occurence as it foucsed on objects that appear together in transactions.
What is profiling?
- Attempts to characterise the typical behaviour of an individual, group or population.
- Example:
- “What is the typical cell phone usage of this customer segment?” (not its referring to customer segment, so different to the regression question)
- Behavior may not have a simple description; profiling cell phone usage might require a complex description of night and weekend airtime averages, international usage, roaming charges, text minutes, and so on.
- Often used to establish behavioral norms (baseline) for anomaly detection applications such as fraud detection and monitoring intrusions
What is link prediction?
- Attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly estimating the strength of the link.
- Example:
- Often used in social networking systems
- “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?”
- Link prediction can also estimate the strength of a link. For example, for recommending movies to customers one can think of a graph between customers and the movies they’ve watched or rated. Within the graph, we search for links that do not exist between customers and movies, but that we predict should exist and should be strong. These links form the basis for recommendations.
What is Data Reduction?
- Attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information of the large set.
- Example
- For example, a massive dataset on consumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example, viewer genre preferences).
- May reveal better information but involves a loss of information.
What is Causal Modelling?
- Attempts to helps us understand what events or actions actually influence others.
- Can be done using experimental and observational methods (“counterfactual analysis”):
- They attempt to understand what would be the difference between the situations—which cannot both happen —where the “treatment” event (e.g., showing an advertisement to a particular individual) were to happen, and were not to happen.
What are unsupervised methods?
- The data mining task has no specific target or purpose.
- Clustering, co-occurence grouping and profiling are unsupervised methods.
- Risk is that it forms groups that are not meaningfull.
What are supervised methods?
- The data mining has a specific purpose or target. Hence it is necessary to have data on the target.
- Supervised tasks require different techniques than unsupervised and the results often more useful.
- Examples: Classification, regression, and causal modeling are solved with supervised methods.
What type of overarching method are similarity matching, link prediction, and data reduction?
They could be either supervised or unsupervised.
What is the CRISP-DM framework?
Cross Industry Standard Process for Data Mining → one codification of the data mining process
The process diagram makes explicit the fact that iteration is the rule rather than the exception. Going through the process once without having solved the problem is generally speaking not a faillure.
How do the different parts of the CRISP-DM framework relate to eachother?
Business understanding and D__ata understanding both link to eachother.
Data understanding → D_ata Preperation_.
Data preperation and Modelling both link to eachother.
Modelling → Evaluation.
Evaluation ⇒ Deployment and links back to Business Understanding.
What are the different parts of the CRISP-DM framework and how are they defined?
-
Business understanding
- Design team should think about the use scenario.
- What exactly do we want to do?
- Start off with a simplified use scenario.
-
Data Understanding
- Available data rarely matches the problem at hand.
- Understand strengths & limitations of the data
- Cost of data varies: cost/ benefits analysis of acquiring additional data
- Cleaning data for subsequent analysis
- Available data rarely matches the problem at hand.
-
Data Preparation
- Convert data into useable format.
- Data are manipulated and converted into forms that yield better results.
- Leakage must be considered → a situation in which a variable collected in historical data gives information on the target variable, but this information is not available when the decision needs to be made.
-
Modelling
- Primary place where data mining techniques are applied to the data.
- Understand techniques and algorithms that can be used
-
Evaluation
-
Assess the data mining results & gain confidence that they are valid and reliable before moving on
- Ensures that the model satisfies the original business goal.
- Goal: prove that detected patterns are truly regular
- Assessment: is qualitative & quantitative by using a comprehensive evaluation framework.
- Model needs to be comprehensible for other stakeholders (non-data scientists)
- Evaluation may be extended into the development environment.
-
Assess the data mining results & gain confidence that they are valid and reliable before moving on
-
Deployment
-
Putting the results of data mining into real use to realize some ROI.
- Use case: implementing a predictive model in a business process
- Increasingly the data mining techniques themselves are deployed
- The world changes faster than the data science team can adapt the model
- A business has too many modeling tasks to manually curate each model
- Systems automatically build models (for the associated process)
- Typically requires that the model is recoded for the production environment, e.g. to accommodate greater speed or compatibility
- After mining the data (successfully or not) the process often returns back to the initial business problem.
-
Putting the results of data mining into real use to realize some ROI.