A Crash Course in Data Science Flashcards
What is Data Science?
Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured,which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar toKnowledge Discovery in Databases (KDD).
What are some key activities that define the field of Statistics?
- Descriptive statistics
- Inference
- Prediction
- Experimental Design
What is Descriptive Statistics?
Descriptive statistics includes exploratory data analysis, unsupervised learning, clustering and basic data summaries.
Descriptive statistics have many uses, most notably helping us get familiar with a data set.
Descriptive statistics usually are the starting point for any analysis.
Often, descriptive statistics help us arrive at hypotheses to be tested later with more formal inference.
What is Inference?
Inference is the process of making conclusions about populations from samples.
Inference includes most of the activities traditionally associated with statistics such as: estimation, confidence intervals, hypothesis tests and variability.
What is prediction?
Prediction overlaps quite a bit with inference, but modern prediction tends to have a different mindset.
Prediction is the process of trying to guess an outcome given a set of realizations of the outcome and some predictors.
What are some prediction algorithms?
- Machine learning
- Regression,
- Deep learning,
- Boosting,
- Random forests
-
Logistic regression
*
are all prediction algorithms.
What is Classification?
If the target of prediction is binary or categorical, prediction is often called classification.
What is the purpose of Random Sampling?
In random sampling, one tries to randomly sample from a population of interest to get better generalizability of the results to the population.
What are the two main activities of machine learning?
- Supervised Learning
- Unsupervised Learning
What is Supervised Learning?
Supervised learning - using a collection of predictors, and some observed outcomes, to build an algorithm to predict the outcome when it is not observed.
Some examples include: neural networks, random forests, boosting and support vector machines.
What is UnSupervised Learning?
Unsupervised learning - trying to uncover unobserved factors in the data. It is called “unsupervised” as there is no gold standard outcome to judge against.
Some example algorithms including hierarchical clustering, principal components analysis, factor analysis and k-means.
What are some characteristics on Machine Learning?
- the emphasis on predictions;
- evaluating results via prediction performance;
- having concern for overfitting but not model complexity per se;
- emphasis on performance;
- obtaining generalizability through performance on novel datasets;
- usually no superpopulation model specified;
- concern over performance and robustness.
What are some characteristics of traditional Statistics?
- emphasizing superpopulation inference;
- focusing on a-priori hypotheses;
- preferring simpler models over complex ones (parsimony), even if the more complex models perform slightly better;
- emphasizing parameter interpretability;
- having statistical modeling or sampling assumptions that connect data to a population of interest;
- having concern over assumptions and robustness.
In recent years, the distinction between both fields have substantially faded. ML researchers have worked tirelessly to improve interpretations while statistical researchers have improved the prediction performance of their algorithms.
Example of Supervised Learning
For supervised learning, we give an early example, the development of regression.
In this, Francis Galton wanted to predict children’s heights from their parents. He developed linear regression in the process.
Notice that having several children with known adult heights along with their parents allows one to build the model, then apply it to parents who are expecting.
Example of UnSupervised Learning
We give a famous early example of unsupervised clustering in the computation of the g-factor.
This was postulated to be a measure of intrinsic intelligence. Early factor analytic models were used to cluster scores on psychometric questions to create the g-factor.
Notice the lack of a gold standard outcome. There was no true measure of intrinsic intelligence to train an algorithm to predict it.