The Data Science Handbook-II Flashcards
What are the phases in the Data Science Road Map?
- Frame the problem
- Understand the data
- Extract features
- Model and analyse (loops back to Frame the problem)
- Present results or 5. Deploy code
What is data wrangling?
Data wrangling is the process of getting the data from its raw format into something suitable for more conventional analytics. This typically means creating a software pipeline that gets the data out of wherever it is stored, does any cleaning or filtering necessary, and puts it into a regular format.
What are the two things you typically get out of exploratory analysis?
- You develop an intuitive feel for the data, inculding what the salient patterns look like visually.
- You get a list of concrete hypotheses about what’s going on in the data.
What is exploratory analysis?
A stage of analysis that focuses on exploring the data to generate hypotheses about it. Exploratory analysis relies heavily on visualizations.
Deploy code stage.
If your ultimate clients are computers, then it is your job to produce code that will be run regularly in the future by other people. Typically, this falls into one of two categories. Which two categories are there?
- Batch analytics code.
- Real-Time code.
What are five popular programming language options for data scientists?
- Python
- R
- MATLAB and Octave
- SAS
- Scala
How to identify pathologies early? Four tips.
- If the data is text, look directly at the raw file rather than just reading it into your script.
- Read supporting documentation, if it is available.
- Have a battery of standard diagnostic questions you ask about the data.
- Do sanity checks, where you use the data to derive things you already know.
What are are eight examples of problems with data content?
- Duplicate entries
- Multiple entries for a single entry
- Missing entries
- NULLs
- Huge outliers
- Out-of-date data
- Artificial entries
- Irregular spacings
What is a regular expression?
A way to specify a general pattern that strings can match.
Name four commonalities that pretty much all machine learning algorithms seem to work with.
- It’s all done using computers, leveraging them to do calculations that would be intractable by hand.
- It takes data as input. If you are simulating a system based on some idealized model, then you aren’t doing machine learning.
- The data points are thought of as being samples from some underlying “real-world” probability distribution.
- The data is tabular (or at least you can think of it that way). There is one row per data point and one column per feature. The features are all numerical, binary or categorical.
Name two types of machine learning.
- Supervised
- Unsupervised
What is supervised machine learning?
In supervised machine learning, your training data consists of some points and a label or target value associated with them. The goal of the algorithms is to figure out some way to estimate that target value.
What is unsupervised learning?
In unsupervised learning, there is just raw data, without any particular thing that is supposed to be predicted. Unsupervised algorithms are used for finding patterns in the data in general, tearing apart its underlying structure. Clustering algorithms are a prototypical example of unsupervised learning.
What are four ways to train on some of your data and assess performance on other data?
- Most basically, you randomly divide your data points between training and testing.
- A fancier method that works specifically for supervised learning is called k-fold cross validation.
- If you’re very rigorous about your statistics, it is common to divide your data into a training set, a testing set, and a validation set.
- There is another approach were a model is retrained periodically, say every week, incorporating the new data acquired in the previous week.
What is the goal of k-fold cross validation?
The goal of k-fold cross validation isn’t to measure the performance of a particular, fitted classifier, but rather a family of classifiers.
What are the steps in k-fold cross-validation?
- Divide the data randomly into k partitions.
- Train a classifier on all but one partition, and test its performance on the partition that was left out.
- Repeat, but choosing a different partition to leave out and test on. Continue for all the partitions, so that you have k different trained classifiers and k performance metrics for them.
- Take the average of the metrics. This is the best estimate of the “true” performance of this family of classifiers when it is trained on this kind of data.
A machine learning classifier is a computational object that has two stages. Which two?
- It gets “trained”. It takes in its training data, which is a bunch of data points and the correct label associated with the, and tries to learn some pattern for how the points map to the labels.
- Once it has been trained, the classifier acts as a function that takes in additional data points and outputs pedicted classifications for them. Sometimes, the prediction will be a specific label; other times, it will give a continuous-valued number that can be seen as a confidence score for a particular label.
Describe a decision tree classifier.
Using a decision tree to classify a data point is the equivalent of following a basic flow chart. It consists of a tree structure. Every node in the tree asks a question about one feature of a data point.
If the feature is numerical, the node asks whether it is above or below a threshold, and there are child nodes for “yes” and “no”. If the feature is categorical, typically there will be a different node for each value it can take. A leaf node in the tree will be the score that is assigned to the point being classified (or several scores, one for each possible thing the point could be flagged as).
What is a random forest classifier?
A random forest is a collection of decision trees, each of which is trained on a random subset of the training data and only allowed to use some random subset of the features. There is no coordination in the randomization - a particular data point or feature could randomly get plugged into all the trees, none of the trees, or anything in between. The final classification score for a point is the average of the scores from all the trees.
The one thing that you can do with a random forest is to get a “feature importance” score for any feature in the dataset. In practice, you can often take this list of features and, with a little bit of old-fashioned data analysis, figure out compelling real-world interpretations of what they mean. But the random forest itself tells you nothing.
What are ensemble classifiers?
Random forests are the best-know example of what are called “ensemble classifiers,” where a wide range of classifiers (decision trees, in this case) are trained under randomly different conditions (in our case, random selections of data points and features) and their results are aggregated. Intuitively, the idea is that if every classifier is at least marginally good, and the different classifiers are not very correlated with each other, then the ensemble as a whole will reliably slouch toward the right classification. Basically, it’s using raw computational power in lieu of domain knowledge or mathematical sophistication, relying on the power of the law of large numbers.
What are two characteristics of Support Vector Machines?
- They makea very strong assumption about the data called lineair separability
- Thet are one of the few classifiers that are fundamentally binary; they don’t give continuous-valued “scores” that can be used to assess how confident the classifier is.
What is a Support Vector Machine (SVM)?
Essentially, you view every data point as a point in a d-dimensional space and then look for a hyperplane that separates the two classes. The assumption that there actually is such a hyperplane is called lineair seperability.
Training the SVM involved finding the hyperplane that (1) separates the datasets and (2) is “in the middle” of the gap between the two classes. Specifically, the “margin” of a hyperplane is min(its distance to the nearest point in class A, its distance to the nearest point in class B), and you pick the hyperplane that maximizes the margin.
Mathematically, the hyperplane is specified by the equation:
f(x) = w*x +b = 0
where w is a vector perpendicular to the hyperplane and b measures how far offset it is from the origin.
What are three popular valid kernals that are functions that take in two vectors?
- Polynominal kernel
- Gaussian kernel
- Sigmoid
Describe Logistic Regression.
Logistic regression is a great general-purpose classifier, striking an excellent balance between accurate classifications and real-world interpretability. It could be seen as kind of a nonbinary version of SVM, one that scores points with probabilities based on how far they are from the hyperplane, rather than using that hyperplane as a definitive cutoff.
If the training data is almost lineairly separated, then all points that aren’t near the hyperplane will get a confident prediction near 0 or 1. But if the two classes bleed over the hyperplane a lot, the predictions will be more muted, and only point far from the hyperplane will get confident scores.
What is Lasso Regression?
Lasso regression is a variant of logistic regression. One of the problems with logistic regression is that you can have many different features all with modest weights, instead of a few clearly meaningful features with large weights.
In lasso regression, p(x) has the same functional form. However, we train it in a way that punished modest-sized weights.
F.e.:
- if features i and j have large weights, but they usually cancel each other out when classifying a point, set both their weights to 0.
- if features i and j are highly correlated, you can reduce the weight for one while increasing the weight for the other and keeping predictions more or less the same.
Describe Naive Bayes.
Briefly, a Bayesian classifier operates on the following intuition: you start off with some initial confidence in the labels 0 and 1 (assume that it’s a binary classification problem). When new information becomes available, you adjust your confidence levels, depending on how likely that information is conditioned on each label. When you’ve gone through all available information, your final confidence levels are the probabilities of the labels 0 and 1.
The assumption that all features in a dataset are independent of each other when you condition on a target variable.
What does a naive Bayesian classifier learn during the training phase?
- How common every label is in the whole training data
- For every feature Xi, its probability distribution when the label is 0.
- For every feature Xi, its probability distribution when the label is 1.
What is a “perceptron”?
The simplest neural network is the perceptron. A perceptron is a network of “neurons”, each of which takes in multiple inputs and produces a single output.
What is a ROC curve?
A two-dimensional box where you treat the false positive rate as a x-coordinate and the true positive rate as the y coordinate.
F.e. you can compare classifiers this way in order to see which has the best AUC metric when changing the classification threshold.