A Crash Course in Data Science Flashcards

Question 1

Q

What is Data Science?

Answer

A

Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured,which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar toKnowledge Discovery in Databases (KDD).

Question 2

Q

What are some key activities that define the field of Statistics?

Answer

A

Descriptive statistics
Inference
Prediction
Experimental Design

Question 3

Q

What is Descriptive Statistics?

Answer

A

Descriptive statistics includes exploratory data analysis, unsupervised learning, clustering and basic data summaries.

Descriptive statistics have many uses, most notably helping us get familiar with a data set.

Descriptive statistics usually are the starting point for any analysis.

Often, descriptive statistics help us arrive at hypotheses to be tested later with more formal inference.

Question 4

Q

What is Inference?

Answer

A

Inference is the process of making conclusions about populations from samples.

Inference includes most of the activities traditionally associated with statistics such as: estimation, confidence intervals, hypothesis tests and variability.

Question 5

Q

What is prediction?

Answer

A

Prediction overlaps quite a bit with inference, but modern prediction tends to have a different mindset.

Prediction is the process of trying to guess an outcome given a set of realizations of the outcome and some predictors.

Question 6

Q

What are some prediction algorithms?

Answer

A

Machine learning
Regression,
Deep learning,
Boosting,
Random forests
Logistic regression
*

are all prediction algorithms.

Question 7

Q

What is Classification?

Answer

A

If the target of prediction is binary or categorical, prediction is often called classification.

Question 8

Q

What is the purpose of Random Sampling?

Answer

A

In random sampling, one tries to randomly sample from a population of interest to get better generalizability of the results to the population.

Question 9

Q

What are the two main activities of machine learning?

Answer

A

Supervised Learning
Unsupervised Learning

Question 10

Q

What is Supervised Learning?

Answer

A

Supervised learning - using a collection of predictors, and some observed outcomes, to build an algorithm to predict the outcome when it is not observed.

Some examples include: neural networks, random forests, boosting and support vector machines.

Question 11

Q

What is UnSupervised Learning?

Answer

A

Unsupervised learning - trying to uncover unobserved factors in the data. It is called “unsupervised” as there is no gold standard outcome to judge against.

Some example algorithms including hierarchical clustering, principal components analysis, factor analysis and k-means.

Question 12

Q

What are some characteristics on Machine Learning?

Answer

A

the emphasis on predictions;
evaluating results via prediction performance;
having concern for overfitting but not model complexity per se;
emphasis on performance;
obtaining generalizability through performance on novel datasets;
usually no superpopulation model specified;
concern over performance and robustness.

Question 13

Q

What are some characteristics of traditional Statistics?

Answer

A

emphasizing superpopulation inference;
focusing on a-priori hypotheses;
preferring simpler models over complex ones (parsimony), even if the more complex models perform slightly better;
emphasizing parameter interpretability;
having statistical modeling or sampling assumptions that connect data to a population of interest;
having concern over assumptions and robustness.

In recent years, the distinction between both fields have substantially faded. ML researchers have worked tirelessly to improve interpretations while statistical researchers have improved the prediction performance of their algorithms.

Question 14

Q

Example of Supervised Learning

Answer

A

For supervised learning, we give an early example, the development of regression.

In this, Francis Galton wanted to predict children’s heights from their parents. He developed linear regression in the process.

Notice that having several children with known adult heights along with their parents allows one to build the model, then apply it to parents who are expecting.

Question 15

Q

Example of UnSupervised Learning

Answer

A

We give a famous early example of unsupervised clustering in the computation of the g-factor.

This was postulated to be a measure of intrinsic intelligence. Early factor analytic models were used to cluster scores on psychometric questions to create the g-factor.

Notice the lack of a gold standard outcome. There was no true measure of intrinsic intelligence to train an algorithm to predict it.