The Data Science Handbook-II Flashcards

1
Q

What are the phases in the Data Science Road Map?

A
  1. Frame the problem
  2. Understand the data
  3. Extract features
  4. Model and analyse (loops back to Frame the problem)
  5. Present results or 5. Deploy code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data wrangling?

A

Data wrangling is the process of getting the data from its raw format into something suitable for more conventional analytics. This typically means creating a software pipeline that gets the data out of wherever it is stored, does any cleaning or filtering necessary, and puts it into a regular format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two things you typically get out of exploratory analysis?

A
  1. You develop an intuitive feel for the data, inculding what the salient patterns look like visually.
  2. You get a list of concrete hypotheses about what’s going on in the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is exploratory analysis?

A

A stage of analysis that focuses on exploring the data to generate hypotheses about it. Exploratory analysis relies heavily on visualizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Deploy code stage.

If your ultimate clients are computers, then it is your job to produce code that will be run regularly in the future by other people. Typically, this falls into one of two categories. Which two categories are there?

A
  1. Batch analytics code.
  2. Real-Time code.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are five popular programming language options for data scientists?

A
  • Python
  • R
  • MATLAB and Octave
  • SAS
  • Scala
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to identify pathologies early? Four tips.

A
  1. If the data is text, look directly at the raw file rather than just reading it into your script.
  2. Read supporting documentation, if it is available.
  3. Have a battery of standard diagnostic questions you ask about the data.
  4. Do sanity checks, where you use the data to derive things you already know.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are are eight examples of problems with data content?

A
  1. Duplicate entries
  2. Multiple entries for a single entry
  3. Missing entries
  4. NULLs
  5. Huge outliers
  6. Out-of-date data
  7. Artificial entries
  8. Irregular spacings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a regular expression?

A

A way to specify a general pattern that strings can match.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name four commonalities that pretty much all machine learning algorithms seem to work with.

A
  • It’s all done using computers, leveraging them to do calculations that would be intractable by hand.
  • It takes data as input. If you are simulating a system based on some idealized model, then you aren’t doing machine learning.
  • The data points are thought of as being samples from some underlying “real-world” probability distribution.
  • The data is tabular (or at least you can think of it that way). There is one row per data point and one column per feature. The features are all numerical, binary or categorical.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name two types of machine learning.

A
  1. Supervised
  2. Unsupervised
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is supervised machine learning?

A

In supervised machine learning, your training data consists of some points and a label or target value associated with them. The goal of the algorithms is to figure out some way to estimate that target value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is unsupervised learning?

A

In unsupervised learning, there is just raw data, without any particular thing that is supposed to be predicted. Unsupervised algorithms are used for finding patterns in the data in general, tearing apart its underlying structure. Clustering algorithms are a prototypical example of unsupervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are four ways to train on some of your data and assess performance on other data?

A
  1. Most basically, you randomly divide your data points between training and testing.
  2. A fancier method that works specifically for supervised learning is called k-fold cross validation.
  3. If you’re very rigorous about your statistics, it is common to divide your data into a training set, a testing set, and a validation set.
  4. There is another approach were a model is retrained periodically, say every week, incorporating the new data acquired in the previous week.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the goal of k-fold cross validation?

A

The goal of k-fold cross validation isn’t to measure the performance of a particular, fitted classifier, but rather a family of classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the steps in k-fold cross-validation?

A
  • Divide the data randomly into k partitions.
  • Train a classifier on all but one partition, and test its performance on the partition that was left out.
  • Repeat, but choosing a different partition to leave out and test on. Continue for all the partitions, so that you have k different trained classifiers and k performance metrics for them.
  • Take the average of the metrics. This is the best estimate of the “true” performance of this family of classifiers when it is trained on this kind of data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A machine learning classifier is a computational object that has two stages. Which two?

A
  1. It gets “trained”. It takes in its training data, which is a bunch of data points and the correct label associated with the, and tries to learn some pattern for how the points map to the labels.
  2. Once it has been trained, the classifier acts as a function that takes in additional data points and outputs pedicted classifications for them. Sometimes, the prediction will be a specific label; other times, it will give a continuous-valued number that can be seen as a confidence score for a particular label.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe a decision tree classifier.

A

Using a decision tree to classify a data point is the equivalent of following a basic flow chart. It consists of a tree structure. Every node in the tree asks a question about one feature of a data point.

If the feature is numerical, the node asks whether it is above or below a threshold, and there are child nodes for “yes” and “no”. If the feature is categorical, typically there will be a different node for each value it can take. A leaf node in the tree will be the score that is assigned to the point being classified (or several scores, one for each possible thing the point could be flagged as).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a random forest classifier?

A

A random forest is a collection of decision trees, each of which is trained on a random subset of the training data and only allowed to use some random subset of the features. There is no coordination in the randomization - a particular data point or feature could randomly get plugged into all the trees, none of the trees, or anything in between. The final classification score for a point is the average of the scores from all the trees.

The one thing that you can do with a random forest is to get a “feature importance” score for any feature in the dataset. In practice, you can often take this list of features and, with a little bit of old-fashioned data analysis, figure out compelling real-world interpretations of what they mean. But the random forest itself tells you nothing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are ensemble classifiers?

A

Random forests are the best-know example of what are called “ensemble classifiers,” where a wide range of classifiers (decision trees, in this case) are trained under randomly different conditions (in our case, random selections of data points and features) and their results are aggregated. Intuitively, the idea is that if every classifier is at least marginally good, and the different classifiers are not very correlated with each other, then the ensemble as a whole will reliably slouch toward the right classification. Basically, it’s using raw computational power in lieu of domain knowledge or mathematical sophistication, relying on the power of the law of large numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are two characteristics of Support Vector Machines?

A
  1. They makea very strong assumption about the data called lineair separability
  2. Thet are one of the few classifiers that are fundamentally binary; they don’t give continuous-valued “scores” that can be used to assess how confident the classifier is.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a Support Vector Machine (SVM)?

A

Essentially, you view every data point as a point in a d-dimensional space and then look for a hyperplane that separates the two classes. The assumption that there actually is such a hyperplane is called lineair seperability.

Training the SVM involved finding the hyperplane that (1) separates the datasets and (2) is “in the middle” of the gap between the two classes. Specifically, the “margin” of a hyperplane is min(its distance to the nearest point in class A, its distance to the nearest point in class B), and you pick the hyperplane that maximizes the margin.

Mathematically, the hyperplane is specified by the equation:
f(x) = w*x +b = 0

where w is a vector perpendicular to the hyperplane and b measures how far offset it is from the origin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are three popular valid kernals that are functions that take in two vectors?

A
  • Polynominal kernel
  • Gaussian kernel
  • Sigmoid
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Describe Logistic Regression.

A

Logistic regression is a great general-purpose classifier, striking an excellent balance between accurate classifications and real-world interpretability. It could be seen as kind of a nonbinary version of SVM, one that scores points with probabilities based on how far they are from the hyperplane, rather than using that hyperplane as a definitive cutoff.

If the training data is almost lineairly separated, then all points that aren’t near the hyperplane will get a confident prediction near 0 or 1. But if the two classes bleed over the hyperplane a lot, the predictions will be more muted, and only point far from the hyperplane will get confident scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Lasso Regression?

A

Lasso regression is a variant of logistic regression. One of the problems with logistic regression is that you can have many different features all with modest weights, instead of a few clearly meaningful features with large weights.

In lasso regression, p(x) has the same functional form. However, we train it in a way that punished modest-sized weights.
F.e.:
- if features i and j have large weights, but they usually cancel each other out when classifying a point, set both their weights to 0.
- if features i and j are highly correlated, you can reduce the weight for one while increasing the weight for the other and keeping predictions more or less the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Describe Naive Bayes.

A

Briefly, a Bayesian classifier operates on the following intuition: you start off with some initial confidence in the labels 0 and 1 (assume that it’s a binary classification problem). When new information becomes available, you adjust your confidence levels, depending on how likely that information is conditioned on each label. When you’ve gone through all available information, your final confidence levels are the probabilities of the labels 0 and 1.

The assumption that all features in a dataset are independent of each other when you condition on a target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does a naive Bayesian classifier learn during the training phase?

A
  • How common every label is in the whole training data
  • For every feature Xi, its probability distribution when the label is 0.
  • For every feature Xi, its probability distribution when the label is 1.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is a “perceptron”?

A

The simplest neural network is the perceptron. A perceptron is a network of “neurons”, each of which takes in multiple inputs and produces a single output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a ROC curve?

A

A two-dimensional box where you treat the false positive rate as a x-coordinate and the true positive rate as the y coordinate.

F.e. you can compare classifiers this way in order to see which has the best AUC metric when changing the classification threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Define “Precision”.

A

Of all flagged results flagged by a classifier, this is the fraction that are actual hits.

30
Q

Define “Recall”.

A

The fraction of all hits that get flagged by a classifier.

Recall = TP / (TP + FN)

31
Q

What are six guiding principles when creating technical documentation?

A
  1. Know your audience
  2. Show why it matters
  3. Make it concrete
  4. A picture is worth a thousand words
  5. Don’t be arrogant about your tech knowledge
  6. Make it look decent
32
Q

What is C.R.A.P. design?

A

Contrast: things that are different should look different.
Repetition: key points or design motifs should be repeated throughout your work.
Alignment: make sure that the different parts of your visual field line up with each other in a natural way.
Proximity: use distance between things to indicate their relationships

33
Q

What a common sections a technical report consist of?

A
  • an executive summary
  • background and motivation
  • datasets used
  • analytical overview
  • results
  • software overview
  • future work
  • conclusions
  • appendices with technical details
34
Q

What are two ways of unsupervised learning?

A
  1. Clustering
  2. Dimensionality reduction
35
Q

What is clustering?

A

Clustering is an attempt to group the data points into distinct “clusters”.

36
Q

What is dimensionality reduction?

A

In dimensionality reduction, the goal isn’t to look for distinct categories in the data. Instead, the idea is that the different fields are largely redundant, and we want to extract the real, underlying variability in the data.

37
Q

What is PCA?

A

Principal Component Analysis.

In PCA, the idea is generally dimensionality reduction: you find how many of these new coordinates are needed to capture most of your data’s variance, and then reduce your data points to capture most of your data’s variance, and the reduce your data points to just a few coordinates.

A prototypical application of PCA is analyzing pictures of faces: there is a staggering number of dimensions in the data, which are almost entirely redundant, and examining the principal components themselves gives insights into the behaviour of the system.

38
Q

What are three limitations of PCA?

A
  1. Your dimensions all need to be scaled to have comparable standard deviations.
  2. PCA assumes that your data is lineair. If the “real” shape of your dataset is that it’s bent into an arc in high-dimensional space, it will get blurred into several principal components. PCA will still be useful for dimensionality reduction, but the components themselves are likely not to be very meaningful.
  3. If you are using PCA on images of faces or something similar, the key parts of the pictures need to be aligned with each other.
39
Q

In which two varieties do algorithmic methods to evaluate the outcome of clustering come?

A
  1. Supervised ones, where we have some ground-truth about what the “right” clusters are, and we seee how closely the clusters we found match up to them.
  2. Unsupervised ones, where we think of the points as vectors in d-dimensional space and look at how geometrically distinct the clusters are from each other.
40
Q

Define the silhouette score.

A

Silhouette scores are the most common unsupervised method you’ll see, and they are ideal for scoring the output of k-means clustering. It is based on the intuition that clusters should be dense and widely separated from each other, so similar to k-means. It works best with dense, compact clusters that are all of comparable size. Silhouette scores aren’t applicable to things such as a doughnut-shaped cluster with a compact one in the middle.

41
Q

What does the “entropy” method measure?

A

How random a random variable is, and it comes from the field of information theory.

42
Q

What is “amortized performance”?

A

A great example is Python dictionaries.

When dictionaries start to fill up ocassionally they must be reshuffled to create more room. Reshuffling a dictionary becomes more and more costly, but they become more rare.

43
Q

Hadoop consists of two parts. Which two?

A
  1. Hadoop Distributed File System (HDFS), which allows you to store data on a cluster of computers without worrying about what data is on which node.
  2. The actual MapReduce (MR) framework, which reads in data from HDFS, processes it in parallel, and writes its output to HDFS.
44
Q

In k fold cross validation there are how many trained classifiers and performance metrics?

A

k trained classifiers and k performance metrics

45
Q

What is the difference between a database and a database management system?

A

A database refers to the data itself and its organisation, while the database management system (DBMS) is the software framework that provides access to that data.

46
Q

What are four typical applications of Fournier transforms?

A
  • Using them to identify periodicity in a signal. F.e., if you measure blood pressure several times a second, the main frequency in the data will be a person’s heart rate.
  • Fournier coefficients as features for windows. The amount of, say, 10 Hz frequency in a signal is an incredibly important, perhaps very physically meaningful feature that we can extract and plug into a machine learning algorithm. The one thing is that you would have to take the magnitude of the Fourier coefficient rather than the coefficient itself, since it is a complex number.
  • Smooting the data by removing high-frequence jitter. This is sometimes called a “low-pass filter”- you set all the higher coefficients to 0 and reconstruct the signal from that.
  • Removing long-time trends to study shorter timescale phenomena. This is called a “high-pass filter”, and it works analoguously to a low-pass filter: set the low-frequency coefficients to 0, and then reconstruct the signal.
47
Q

What is Fourier analysis?

A

Looking at a time series signal as a lineair combination of sinusoids with different frequencies.

48
Q

What is Fourier transformation?

A

The process of going from a raw signal to its Fourier decomposition.

49
Q

What is Spline?

A

A method of interpolation where a cubic polynomial is fit to a small number of points that are close together. That polynomial is used to give interpolated values near those points.

50
Q

What are the three programming paradigms?

A
  1. Imperative
  2. Functional
  3. Object-oriented
51
Q

Describe imperative programming.

A

Your code is mostly a sequence of instructions for the computer to follow.

52
Q

Describe functional programming.

A

Is largely inspired by the desire to avoid “side effects.” A side effect means any modification that is done to existing variables (such as appending an element to a list or incrementing a number) of any interaction of the program with the outside world (such as printing to the screen.

In functional programming, your code is broken up into “pure” functions, which take some input (or maybe none at all) and return an output, but have no side effects.

53
Q

Describe object-oriented programming.

A

An object-oriented language will package data and the logic that handles the data into user-friendly black boxes called “objects”. When using an object, you don’t need to worry about how the data is structured or how to untangle that structure; you only interact with special-purpose functions called “methods” that the object presents to you.

54
Q

What is a compiler?

A

A software program that translates human-readable source code into a low-level language that is more suitable for actually running. This is often machine code or bytecode for a virtual machine.

55
Q

What is an interpreter?

A

An interpreter is a special-purpose program that reads and executes your code one line at a time. The interpreter itself is a blob of machine code that was originally written in something like C.

56
Q

What is entropy?

A

Entropy is a way to measure “how random” a random variable is, and it comes from the field of information theory.

57
Q

What is a Bernoulli random variable?

A

A random variable describing the flip of a coin that is heads with some probability p.

58
Q

What type of programming language is Python?
a. Imperative
b. Functional
c. Object-oriented

A

Python is inherently an object-oriented language. Everything you ever use or define in Python code, including variables, functions, and even libraries that you import, is an object. Every action you ever take in Python is calling a method on some object.

59
Q

What is a “class” in object-oriented code?

A

The key feature of object-oriented code is the definition of whole new types of objects. These types of objects are called “classes”. A class specifies the internal structure of an object, as well as all of its associated methods. You can have many objects that are all of the same class, but the class itself is only defined once.

60
Q

What type of programming language is Python?
a. compiled language
b. interpreted language

A

In contrast to compiled languages such as C, you have interpreted languages such as Python.

61
Q

What is “garbage collection”?

A

The process of deleting data structures that you no longer need, and hence freeing up the memory that they were taking up.

62
Q

What is the difference between “statically typed” and “dynamically typed”?

A

A language is “statically typed” if the computer figures out, at the time the code is compiled, what the type is of all the variables. This allows the compiler to store and process the data in the most efficient way possible.

It is dynamically typed if the types are not known until the code is run, meaning that there will be some additional boilerplate to keep track of what variables are integers, strings, lists, and so on.

Python is a great example of a dynamically typed language.

63
Q

What is RDD?

A

Resilient Distributed Dataset

64
Q

What is Map in the MapReduce context?

A

An operation where you take a collection of data structures and apply the same function to each of them. The outputs of the functions are, collectively, the output of the process.

65
Q

What is Reduce in the MapReduce context?

A

An operation where a stream of values are processed one at a time, updating an aggregate with each value. After the last value, the aggregate is returned as the resultnof the process.

66
Q

What is the Bonferroni correction?

A

The Bonferroni correction is the most common multiple hypothesis correction that you will see, but you should be aware that it is unnaturally stringent. In particular, it assumes that every test you run is independent of every other test you run.

67
Q

What is a Bayesian network?

A

A dependency graph between several random variables, used to model which ones are likely to be conditionally dependent on which others. A good Bayesian network is more powerful than Naive Bayes, but still sparse enough that it can be effectively trained on data.

68
Q

What are parameter estimation and hypothesis testing about?

A

Hypothesis testing is about measuring whether effects are present, but not about quantifying their magnitude.

Parameter estimation is where we try to estimate the magnitude and the underlying parameters that characterize distributions.

69
Q

What is the t-test and where is it useful for?

A

The t-test is useful for situations where the thing you measure is a continuous number, rather than a binary coin flip, and you want to assess the real-world averages of two distributions are the same.

The t-test gives us two choices of a null hypothesis:
- the two datasets come from the same normal distribution
- the two datasets from normal distributions with the same means, although the standard deviations could be different

70
Q

What are interpretation and compilation?

A

You should think of compilation as just the translation of code that is in one language into some lower-level language. Conversely, you should then think of interpretation as plugging code into a “machine” (virtual or physical) that executes the code.

71
Q

What is the difference between statically typed and dynamically typed ?

A

A language is statically typed if the computer figures out, at the time the code is compiled, what the type is of all the variables. This allows the compiler to store and process the data in the most efficient way possible.

It is dynamically typed if the types are not known until the code is run, meaning that there will be some additionalboilerplate to keep track of what variables are integers, strings, lists, and so on.

Python is a great example of a dynamically typed language.

Dynamic typing has many benefits in terms of flexibility, but you pay a large performance cost.

72
Q

What is the difference between strong versus weak typing?

A

Typing strength is a much fuzzier notion than whether a language is dynamically or statically typed. Roughly, it means to what degree the language forces you to use types and their operations consistently.