Communicating Expectations to the Business Flashcards
You are a data scientist looking at a set of historical data for possible use for model training. What quality would you be looking for in the data?
That it holds a minimum of 20,000 rows of data
That it contains data representative of the target population
That it contains personally identifiable information
That it is properly formatted as an OLAP cube
That it contains data representative of the target population
You need data input on a project. When is the best time to bring in an SME?
At the end of the project
Only when problems arise
When verification is needed
At the beginning of the project
At the beginning of the project
What kind of barrier does personally identifiable data present to data use?
Formatting
Speed of query
Legal
Licensing
Legal
You are working on a project that will generate insights on potential popularity of compact cars. Your team has licensed a data set of automotive sales for training a model. Upon further inspection of the data, you find that it is made up of diesel truck sales. What is your next step?
Run the data set through a synthetic data generation utility
Transform the existing data set with discovery-transitioning-utils
Alter the model the solution will be using
Source a different set of data closer to the target population
Source a different set of data closer to the target population
What should you do when facing a potential ethical barrier to using data in a solution?
Use EDA tools to identify alternatives
Consult a legal specialist
Locate another algorithm for the model
Find another data source
Consult a legal specialist
What kind of tool allows you to anonymize data but maintain character and richness?
Data mining tools
Machine learning models
Synthetic data utilities
Exploratory data analysis tools
Synthetic data utilities
Your team has reached the end of their exploratory data analysis and come to the conclusion that the data set will require significant cleaning and feature engineering. The funding for initial data analysis has been exhausted. What should you do?
Use a data set that is adjacent to the project’s domain
Ask the stakeholders for a go/no-go decision and budget for the next phase
Search for public but free data in the domain of the project
Locate a new data source from a data broker and restart exploratory data analysis
Ask the stakeholders for a go/no-go decision and budget for the next phase
What is the minimum number of data sets typically needed for a data science/machine learning solution?
One for training, one for analysis
Two for training, two for analysis
Data set for analysis
Two for training, one for analysis
One for training, one for analysis
What does correlation point to in a data science/machine learning solution?
The cross validation of a data point from another data set
The causation of one data point by a preceding data point
The probability of causation of a data point by another data point
The relationship between two data points
The relationship between two data points
What method do exploratory data analysis tools primarily use?
XML output
Visualization
Encoded data
CSV
Visualization