Introduction to Data Science Challenges Flashcards
Finding data
- There may be hundreds or thousands of tables.
* There may be many different entities that are less relevant.
Transforming data
- Reorganizing data, filtering, etc.
* Extracting relevant features.
Dealing with data
Dealing with Big data
Dealing with streaming data
Data quality
• Data may be incomplete, invalid, inconsistent, imprecise, and/or outdated.
Fitting the data
Overfitting / underfitting
Dealing with concept drift
Do Nothing (Static Model) The most common way is to not handle it at all and assume that the data does not change. …
Periodically Re-Fit. …
Periodically Update. …
Weight Data. …
Learn The Change. …
Detect and Choose Model. …
Data Preparation.
Making results actionable
• Analysis results need to be relevant, specific, novel and clear.
Ensuring fairness
Data science without prejudice: how to avoid
unfair conclusions even if they are true?
Ensuring accuracy
Data science without
guesswork: how to answer
questions with a guaranteed level of accuracy?
Ensuring confidentiality
Data science that
ensures confidentiality:
how to answer questions without revealing secrets?
Ensuring transparency
Data Science that
provides transparency:
how to clarify answers such that they become
indisputable?
Ill-posed problems
• A problem is well-posed if
− a solution exists and
− the solution is unique.
• Problems in data science are often ill-posed
− there may be many possible models explaining observed phenomena
− the (training) data set is just a sample
− there may be noise (exceptional or incorrectly recorded instances) in the data set,
− the result needs to generalize to have predictive or explanatory value.