Data Analysis Pipeline Flashcards
What are the 5 steps that are necessary once you collect some data?
- Figure out the question.
- Find/acquire relevant data.
- Clean & prepare the data.
- Analyze the data.
- Interpret & present results.
It’s not unusual to spend most of your time _______ ______ when processing data
cleaning data
What are some different ways to get data for the question your attempting to answer?
- Files (CSV, Excel, XML, etc)
- API
- DB
- A sensor
When first retrieving data, it may have some inconsistencies that need to be cleaned such as:
- Irrelevant things
- Different ____ for similar values
- Different _______ in files
- _____ wrong
units
formats
shaped
When first retrieving data, it may have straight up incorrect values. What are some causes of this?
- Missing values (failed sensor, incomplete collection, etc)
- Outliers (data entry errors, etc)
- Noise
The full data analysis or pipeline is often broken into steps. What are some reasons for this?
- Don’t want to spam an API every time we process data
- Test runs might take too long
- Intermediate results might be meaningful
The full data pipeline is not always obvious. In the end you may need to run ______ programs. You should always _______ your code so you know how things should be done
multiple, document
What are some reasons you might have manual steps in your pipeline?
- Easier to do by hand than automating
- Most cases can be automatically determined, but some outliers could be left for manual intervention