The Data Science Handbook-II Flashcards

Question

What is Lasso Regression?

Answer 1

Lasso regression is a variant of logistic regression. One of the problems with logistic regression is that you can have many different features all with modest weights, instead of a few clearly meaningful features with large weights. In lasso regression, p(x) has the same functional form. However, we train it in a way that punished modest-sized weights. F.e.: - if features *i* and *j* have large weights, but they usually cancel each other out when classifying a point, set both their weights to 0. - if features *i* and *j* are highly correlated, you can reduce the weight for one while increasing the weight for the other and keeping predictions more or less the same.

Answer 2

Briefly, a Bayesian classifier operates on the following intuition: you start off with some initial confidence in the labels 0 and 1 (assume that it's a binary classification problem). When new information becomes available, you adjust your confidence levels, depending on how likely that information is conditioned on each label. When you've gone through all available information, your final confidence levels are the probabilities of the labels 0 and 1. The assumption that all features in a dataset are independent of each other when you condition on a target variable.

Answer 3

- How common every label is in the whole training data - For every feature Xi, its probability distribution *when the label is 0*. - For every feature Xi, its probability distribution *when the label is 1*.

Answer 4

The simplest neural network is the perceptron. A perceptron is a network of "neurons", each of which takes in multiple inputs and produces a single output.

Answer 5

A two-dimensional box where you treat the false positive rate as a *x*-coordinate and the true positive rate as the *y* coordinate. F.e. you can compare classifiers this way in order to see which has the best AUC metric when changing the classification threshold.

Answer 6

Of all flagged results flagged by a classifier, this is the fraction that are actual hits.

Answer 7

The fraction of all hits that get flagged by a classifier. Recall = TP / (TP + FN)

Answer 8

1. Know your audience 2. Show why it matters 3. Make it concrete 4. A picture is worth a thousand words 5. Don't be arrogant about your tech knowledge 6. Make it look decent

Answer 9

*Contrast*: things that are different should look different. *Repetition*: key points or design motifs should be repeated throughout your work. *Alignment*: make sure that the different parts of your visual field line up with each other in a natural way. *Proximity*: use distance between things to indicate their relationships

Answer 10

- an executive summary - background and motivation - datasets used - analytical overview - results - software overview - future work - conclusions - appendices with technical details

Answer 11

1. Clustering 2. Dimensionality reduction

Answer 12

Clustering is an attempt to group the data points into distinct "clusters".

Answer 13

In dimensionality reduction, the goal isn't to look for distinct categories in the data. Instead, the idea is that the different fields are largely redundant, and we want to extract the real, underlying variability in the data.

Answer 14

Principal Component Analysis. In PCA, the idea is generally dimensionality reduction: you find how many of these new coordinates are needed to capture most of your data's variance, and then reduce your data points to capture most of your data's variance, and the reduce your data points to just a few coordinates. A prototypical application of PCA is analyzing pictures of faces: there is a staggering number of dimensions in the data, which are almost entirely redundant, and examining the principal components themselves gives insights into the behaviour of the system.

Answer 15

1. Your dimensions all need to be scaled to have comparable standard deviations. 2. PCA assumes that your data is lineair. If the "real" shape of your dataset is that it's bent into an arc in high-dimensional space, it will get blurred into several principal components. PCA will still be useful for dimensionality reduction, but the components themselves are likely not to be very meaningful. 3. If you are using PCA on images of faces or something similar, the key parts of the pictures need to be aligned with each other.

Answer 16

1. Supervised ones, where we have some ground-truth about what the "right" clusters are, and we seee how closely the clusters we found match up to them. 2. Unsupervised ones, where we think of the points as vectors in d-dimensional space and look at how geometrically distinct the clusters are from each other.

Answer 17

Silhouette scores are the most common unsupervised method you'll see, and they are ideal for scoring the output of k-means clustering. It is based on the intuition that clusters should be dense and widely separated from each other, so similar to k-means. It works best with dense, compact clusters that are all of comparable size. Silhouette scores aren't applicable to things such as a doughnut-shaped cluster with a compact one in the middle.

Answer 18

How random a random variable is, and it comes from the field of information theory.

Answer 19

A great example is Python dictionaries. When dictionaries start to fill up ocassionally they must be reshuffled to create more room. Reshuffling a dictionary becomes more and more costly, but they become more rare.

Answer 20

1. Hadoop Distributed File System (HDFS), which allows you to store data on a cluster of computers without worrying about what data is on which node. 2. The actual MapReduce (MR) framework, which reads in data from HDFS, processes it in parallel, and writes its output to HDFS.

Answer 21

k trained classifiers and k performance metrics

Answer 22

A database refers to the data itself and its organisation, while the database management system (DBMS) is the software framework that provides access to that data.

Answer 23

- Using them to identify periodicity in a signal. F.e., if you measure blood pressure several times a second, the main frequency in the data will be a person's heart rate. - Fournier coefficients as features for windows. The amount of, say, 10 Hz frequency in a signal is an incredibly important, perhaps very physically meaningful feature that we can extract and plug into a machine learning algorithm. The one thing is that you would have to take the magnitude of the Fourier coefficient rather than the coefficient itself, since it is a complex number. - Smooting the data by removing high-frequence jitter. This is sometimes called a "low-pass filter"- you set all the higher coefficients to 0 and reconstruct the signal from that. - Removing long-time trends to study shorter timescale phenomena. This is called a "high-pass filter", and it works analoguously to a low-pass filter: set the low-frequency coefficients to 0, and then reconstruct the signal.

Answer 24

Looking at a time series signal as a lineair combination of sinusoids with different frequencies.

Answer 25

The process of going from a raw signal to its Fourier decomposition.

Answer 26

A method of interpolation where a cubic polynomial is fit to a small number of points that are close together. That polynomial is used to give interpolated values near those points.

Answer 27

1. Imperative 2. Functional 3. Object-oriented

Answer 28

Your code is mostly a sequence of instructions for the computer to follow.

Answer 29

Is largely inspired by the desire to avoid “side effects.” A side effect means any modification that is done to existing variables (such as appending an element to a list or incrementing a number) of any interaction of the program with the outside world (such as printing to the screen. In functional programming, your code is broken up into “pure” functions, which take some input (or maybe none at all) and return an output, but have no side effects.

Answer 30

An object-oriented language will package data and the logic that handles the data into user-friendly black boxes called “objects”. When using an object, you don’t need to worry about how the data is structured or how to untangle that structure; you only interact with special-purpose functions called “methods” that the object presents to you.

Answer 31

A software program that translates human-readable source code into a low-level language that is more suitable for actually running. This is often machine code or bytecode for a virtual machine.

Answer 32

An interpreter is a special-purpose program that reads and executes your code one line at a time. The interpreter itself is a blob of machine code that was originally written in something like C.

Answer 33

Entropy is a way to measure "how random" a random variable is, and it comes from the field of information theory.

Answer 34

A random variable describing the flip of a coin that is heads with some probability *p*.

Answer 35

Python is inherently an object-oriented language. Everything you ever use or define in Python code, including variables, functions, and even libraries that you import, is an object. Every action you ever take in Python is calling a method on some object.

Answer 36

The key feature of object-oriented code is the definition of whole new types of objects. These types of objects are called "classes". A class specifies the internal structure of an object, as well as all of its associated methods. You can have many objects that are all of the same class, but the class itself is only defined once.

Answer 37

In contrast to compiled languages such as C, you have interpreted languages such as Python.

Answer 38

The process of deleting data structures that you no longer need, and hence freeing up the memory that they were taking up.

Answer 39

A language is "statically typed" if the computer figures out, at the time the code is compiled, what the type is of all the variables. This allows the compiler to store and process the data in the most efficient way possible. It is dynamically typed if the types are not known until the code is run, meaning that there will be some additional boilerplate to keep track of what variables are integers, strings, lists, and so on. Python is a great example of a dynamically typed language.

Answer 40

Resilient Distributed Dataset

Answer 41

An operation where you take a collection of data structures and apply the same function to each of them. The outputs of the functions are, collectively, the output of the process.

Answer 42

An operation where a stream of values are processed one at a time, updating an aggregate with each value. After the last value, the aggregate is returned as the resultnof the process.

Answer 43

The Bonferroni correction is the most common multiple hypothesis correction that you will see, but you should be aware that it is unnaturally stringent. In particular, it assumes that every test you run is independent of every other test you run.

Answer 44

A dependency graph between several random variables, used to model which ones are likely to be conditionally dependent on which others. A good Bayesian network is more powerful than Naive Bayes, but still sparse enough that it can be effectively trained on data.

Answer 45

Hypothesis testing is about measuring *whether* effects are present, but not about quantifying their magnitude. Parameter estimation is where we try to estimate the magnitude and the underlying parameters that characterize distributions.

Answer 46

The t-test is useful for situations where the thing you measure is a continuous number, rather than a binary coin flip, and you want to assess the real-world averages of two distributions are the same. The t-test gives us two choices of a null hypothesis: - the two datasets come from the same normal distribution - the two datasets from normal distributions with the same means, although the standard deviations could be different

Answer 47

You should think of *compilation* as just the translation of code that is in one language into some lower-level language. Conversely, you should then think of *interpretation* as plugging code into a "machine" (virtual or physical) that executes the code.

Answer 48

A language is statically typed if the computer figures out, at the time the code is compiled, what the type is of all the variables. This allows the compiler to store and process the data in the most efficient way possible. It is dynamically typed if the types are not known until the code is run, meaning that there will be some additionalboilerplate to keep track of what variables are integers, strings, lists, and so on. Python is a great example of a dynamically typed language. Dynamic typing has many benefits in terms of flexibility, but you pay a large performance cost.

Answer 49

Typing strength is a much fuzzier notion than whether a language is dynamically or statically typed. Roughly, it means to what degree the language forces you to use types and their operations consistently.

The Data Science Handbook-II Flashcards

(73 cards)