M2 U1 - Distilling the Analytic Objective - Q1 Flashcards
At what point do you begin to define analytic objectives?
Once you Set Business Goals
Who is responsible for framing the project’s analytic objective?
Client and data science team collaborate on all aspects of framing the analytic goal, where either side supports the other in understanding its business-related and technical components.
The problem (statement?) underlying an analytic objective should satisfy the following criteria:
- Solving the problem must be part of a possible solution vision towards the business objective
- Data can facilitate that solution, and that data either exists or is feasible to gather.
- The problem must be specific and realistic and add enough value if executed properly.
What should be done before the team commits to fulfilling any analytical expectations on the part of the client?
data science projects/consultations involve a preliminary data survey that informs, or even precedes, longer substantive discussion. This checks for:
- Readiness of the organization
- Technical objections relating to the data.
What is the main purpose of the problem statement in the analytic objective?
It focuses on specific steps necessary to achieve the business objective that can benefit from data-driven methods in the project.
Describe a common variant of classification
sequence labeling: where individual data points are not independent but form a series. Typical examples: sentiment analysis of text, labeling images as depicting certain objects (animals, cars, etc.).
What are the ways to characterize a data science task? (7)
- Classification: Individual instances in the dataset have a categorical (i.e. non-numerical) label associated with them, or will be labeled as such. The goal of the project is to develop a system capable of categorizing data points using these labels. A common variant of classification is sequence labeling, where individual data points are not independent but form a series. Typical examples: sentiment analysis of text, labeling images as depicting certain objects (animals, cars, etc.).
- Regression: Individual instances in the dataset have a numerical label associated with them whose magnitude carries a specific meaning. The goal of the project is to develop a model capable of predicting this target score for individual data instances. Like classification, regression is often applied over dependent series of data points. Typical examples: predicting measurements in medical or demographic data over time
- Retrieval & Ranking: The dataset can be thought of as one or more collections of data points and queries. In response to the query, one or more “ideal” data points should be retrieved and presented in a “correct” order. Typical example: search engines
- Recommendation: The dataset consists of users and items, as well as information about preferences of users for specific items. The task is to find items to recommend to users towards the maximization of some utility. Typical examples: movie and product recommendations
- Clustering: The dataset is assumed to have some latent structure which should be discovered by dividing data points into groups that are close to each other in variants of the feature space. Typical examples: exploratory analyses of unlabeled data
- Anomaly Detection: The dataset is assumed to have some latent structure which should be discovered in order to identify instances that do not adhere to the pattern. Typical example: fake customer review detection in online retail data
- Domain-specific tasks: Aside from the generic task types explained above, some domains have developed specific task patterns and associated evaluation metrics that should be used if the analytic goal is sufficiently specific. This is particularly true of natural language processing and image analysis. For example, machine translation will commonly be characterized as a text generation task and models will be evaluated using a specialized BLEU score. Similarly, models that segment a part of an image containing a specific object may be evaluated using average precision in conjunction with an “intersection over union” threshold.
Which primary role does the task statement play in the analytic objective?
It characterizes the task for purposes of planning and the nature of the eventual evaluation.
The statement of methods fits what criteria? (3)
- It must be precise enough so that the data science team understands it as a concise summary of the technical approach .
- At the same time, it should allow for testing different techniques around the main conceptual idea.
- The methods should be suitable to produce the target functionality/insight for the task given the available data.
The most general categories of methods one can identify in the methods statement typically include:
- Supervised learning methods, which involves learning to predict a target variable (typically through regression or classification) by training on “true” example data points whose target variable has manually been labeled or is available by other means.
- Unsupervised learning methods deal with finding patterns in unlabeled data without an explicit prediction target.
- Semi-Supervised learning methods encompass hybrid methods that combine supervised and unsupervised learning in different ways.
What’s a good beginning strategy when formulating (evaluating?) proposed methods?
Checking your proposed methods against the target functionality or insight by conceptually thinking through its application and explicitly formulating expected results.
The statement of methods is important because:
Similar to the task and problem statement, you are gaining an understanding of the business needs.
Unsupervised learning methods are best applied to
tasks were an outcome is not known.
What are the most important criteria for evaluating data?
- the data is presumed to contain patterns that are informative for the analytic objective
- allows one to successfully tackle the proposed task and make progress towards solving the problem in a data-driven way.
How does data collection and curation relate to the field of data science?
Data collection and curation is a complex sub-discipline of data science and is equally important as data analysis .