Final Review Flashcards

1
Q

Probability Density Function (PDF)

A

Tells the density around a particular point. Total area under the pdf curve sums to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Poisson Distribution

A

Used in experiments to model the numbers of events in a fixed period of time or a fixed area of space. Mean and variance is lambda.

  • Number of people who enter a grocery store every hour.
  • Number of childbirths in Hawai’i every day.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Negative Binomial Distribution

A

Sequence of trials are independent. Counts failures until a fixed number of successes. How many times you did something until that event occurred.

How many times did you have to flip a curve before it landed on heads?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Binomial Distribution

A

Models the number of successes in an experiment of n fixed, independent trials.

Each trial has a probability of success p and a probability of failure 1 - p

Such as flipping a coin with heads being 1 and tails being 0. What is the probability of the coin landing on heads after a certain amount of flips.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Normal Distribution

A
Continuous probability distribution.
Samples such as:
- Heights of males in a population.
- Errors in instrumentation.
- Total sales
Bell shaped curve, symmetric curve for continuous events.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Probability Mass Function

A

Probability distribution for a discrete random variables.
Takes discrete random variables X and assign a probability to each value of its sample space. Commonly represented with a histogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discrete Data

A

Only takes particular values. May potentially be an infinite number of values, but each is distinct. Can be numeric but also categorical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Continuous Data

A

Not restricted to defined separate values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Always numeric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kernel Density Estimation

A

Non-parametric way to estimate the probability density function (pdf) of a random variable. Number of bins and results in bar charts to plot out the data. Allows you to see the equivalent of the probability model. Tells you where the bulk of your probability data resides under the curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Top-Hat Kernel

A

Used to bypass bin boundaries. When entries overlap each other in the plot, they are stacked on top of each other which then gives a more accurate representation of the data but is a much rougher representation. Bars have heights equal to the sum of overlapping blocks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Gaussian Kernel

A

The contribution of a point at position x is simply the sum of the pdf’s with regards to each of the pdfs that overlap it. This results in smoother distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bandwidth

A

The amount of incrementation for graph representation. Higher bandwidth means a smoother curve graph, lower means more jagged with sudden changes. Too large means the curve will be too smooth and the data will be indiscernible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cumulative Distribution Function (CDF)

A

Function that maps a value to its percentile rank. Input is value x and returns percentile rank Z.

Step based graphs sometimes.
Answers:
- What is the fraction of the events that have occurred to the left or below of x?
- Where the bulk of the data lies on the Kernel Density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Series

A

Can think of as a column off a data frame.

X = pd.Series([6,3,4,6])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Dataframe

A

Two dimensional table data structure with labeled axes.

Pd.DataFrame(np.random.randint(low=0, high=10, size=(5,5), columns = [‘a’,’b’,’c’,’d’,’e’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

.loc

A
Label-based, but may also be used in a boolean array.
Gets rows (or columns) with particular labels from the index.
17
Q

.iloc

A

Integer location. Primarily integer position based (from 0 to length-1 off the axis), but may also be used with a boolean array.

Gets rows (or columns) at particular positions in the index (so it only takes integers).

Getting row number 5, column number 3 in a dataframe:
Dataframe.iloc[5,3]

coded to get an index without hardcoding it:
Dataframe.iloc[3,[]] or Dataframe.iloc[3,n]

18
Q

Slice notation [x:y]

A

States that x is the starting index and it will gather the starting and ending (y) indices along with everything in between them.

[:2] = Start from the very beginning and end at index 2.
[2:] = Start from index two and gather everything past that until the very end.
19
Q

.unique()

A

States the column that it will check for unique values in that column and return a list of the found unique values.

20
Q

.groupby( )

A

Hierarchical Index Groupby:
Data.groupby([‘col1’, ‘col2’]).mean()

Dataframe Groupby:
Data.groupby([‘col1’, ‘col2’])[‘col3’’.mean()

Taking the dataset and split it by some parameter, then take some function to it and this will help you find a mean or average then combine it back. Your new index becomes the value on which you split.

21
Q

Correlation coefficient

A

Number between -1 and 1 calculated so as to represent the linear dependence of two variables or sets of data.

R, R^2

Simple linear progression being used to predict a quantitative linear response given an input X.

22
Q

.astype()

A

Allows you to convert the type category of data.

Df[“b”] = df[“a”.astype(‘category’)

23
Q

Linear Regression

A

Y = ax +b

Y = a1 x1 + a2 x2 +b

The outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values.

24
Q

Logistic Regression

A
  • Target variable is binary
  • Predictive features are interval (continuous) or categorical
    Features are independent of one another.

Used when the response variable is categorical in nature.

Uses:

  • Whether a customer will convert to an offer.
  • Predict and preempt customer churn.
  • Clinical testing to predict whether a new drug will cure the average patient.
25
Q

Bootstrap

A
  • Size of samples
  • Number of replicates (a reasonable amount)

Taking a series of random samples that can be used.

Example:
Bootstrap_data_1 = inData.sample(inData.shape[0], replace=True)

Using the .sample() function a group of random samples can be taken from a given dataset.

26
Q

Curse of dimensionality

A

Refers to how certain learning algorithms may perform poorly in high-dimensional data.

27
Q

Principal Component Analysis

A

Technique used to emphasize variation and bring out strong patterns in a dataset.

Useful for eliminating dimensions, such as computing values along a pair of lines, one for x values and one for y values.

28
Q

Eigen vector

A

Characteristic vector of a linear transformation, a non-zero vector that only changes by a scalar factor when that linear transformation is applied to it.

When corresponding to a real nonzero eigenvalue, points in a direction that is stretched by transformation and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative the direction is reversed.

29
Q

Eigen Value

A

The field, F, which is the characteristic root associated with the eigenvector v.