Used in experiments to model the numbers of events in a fixed period of time or a fixed area of space. Mean and variance is lambda. - Number of people who enter a grocery store every hour. - Number of childbirths in Hawai’i every day.

Final Review Flashcards by Kurt Noe

Probability Density Function (PDF)

Tells the density around a particular point. Total area under the pdf curve sums to 1.

How well did you know this?

Not at all

Perfectly

Poisson Distribution

Used in experiments to model the numbers of events in a fixed period of time or a fixed area of space. Mean and variance is lambda.

Number of people who enter a grocery store every hour.
Number of childbirths in Hawai’i every day.

How well did you know this?

Not at all

Perfectly

Negative Binomial Distribution

Sequence of trials are independent. Counts failures until a fixed number of successes. How many times you did something until that event occurred.

How many times did you have to flip a curve before it landed on heads?

How well did you know this?

Not at all

Perfectly

Binomial Distribution

Models the number of successes in an experiment of n fixed, independent trials.

Each trial has a probability of success p and a probability of failure 1 - p

Such as flipping a coin with heads being 1 and tails being 0. What is the probability of the coin landing on heads after a certain amount of flips.

How well did you know this?

Not at all

Perfectly

Normal Distribution

Continuous probability distribution.
Samples such as:
- Heights of males in a population.
- Errors in instrumentation.
- Total sales
Bell shaped curve, symmetric curve for continuous events.

How well did you know this?

Not at all

Perfectly

Probability Mass Function

Probability distribution for a discrete random variables.
Takes discrete random variables X and assign a probability to each value of its sample space. Commonly represented with a histogram.

How well did you know this?

Not at all

Perfectly

Discrete Data

Only takes particular values. May potentially be an infinite number of values, but each is distinct. Can be numeric but also categorical.

How well did you know this?

Not at all

Perfectly

Continuous Data

Not restricted to defined separate values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Always numeric.

How well did you know this?

Not at all

Perfectly

Kernel Density Estimation

Non-parametric way to estimate the probability density function (pdf) of a random variable. Number of bins and results in bar charts to plot out the data. Allows you to see the equivalent of the probability model. Tells you where the bulk of your probability data resides under the curve.

How well did you know this?

Not at all

Perfectly

Top-Hat Kernel

Used to bypass bin boundaries. When entries overlap each other in the plot, they are stacked on top of each other which then gives a more accurate representation of the data but is a much rougher representation. Bars have heights equal to the sum of overlapping blocks.

How well did you know this?

Not at all

Perfectly

Gaussian Kernel

The contribution of a point at position x is simply the sum of the pdf’s with regards to each of the pdfs that overlap it. This results in smoother distribution.

How well did you know this?

Not at all

Perfectly

Bandwidth

The amount of incrementation for graph representation. Higher bandwidth means a smoother curve graph, lower means more jagged with sudden changes. Too large means the curve will be too smooth and the data will be indiscernible.

How well did you know this?

Not at all

Perfectly

Cumulative Distribution Function (CDF)

Function that maps a value to its percentile rank. Input is value x and returns percentile rank Z.

Step based graphs sometimes.
Answers:
- What is the fraction of the events that have occurred to the left or below of x?
- Where the bulk of the data lies on the Kernel Density.

How well did you know this?

Not at all

Perfectly

Series

Can think of as a column off a data frame.

X = pd.Series([6,3,4,6])

How well did you know this?

Not at all

Perfectly

Dataframe

Two dimensional table data structure with labeled axes.

Pd.DataFrame(np.random.randint(low=0, high=10, size=(5,5), columns = [‘a’,’b’,’c’,’d’,’e’])

How well did you know this?

Not at all

Perfectly

.loc

Study These Flashcards

Label-based, but may also be used in a boolean array.
Gets rows (or columns) with particular labels from the index.

.iloc

Study These Flashcards

Integer location. Primarily integer position based (from 0 to length-1 off the axis), but may also be used with a boolean array.

Gets rows (or columns) at particular positions in the index (so it only takes integers).

Getting row number 5, column number 3 in a dataframe:
Dataframe.iloc[5,3]

coded to get an index without hardcoding it:
Dataframe.iloc[3,[]] or Dataframe.iloc[3,n]

Slice notation [x:y]

Study These Flashcards

States that x is the starting index and it will gather the starting and ending (y) indices along with everything in between them.

[:2] = Start from the very beginning and end at index 2.
[2:] = Start from index two and gather everything past that until the very end.

.unique()

Study These Flashcards

States the column that it will check for unique values in that column and return a list of the found unique values.

.groupby( )

Study These Flashcards

Hierarchical Index Groupby:
Data.groupby([‘col1’, ‘col2’]).mean()

Dataframe Groupby:
Data.groupby([‘col1’, ‘col2’])[‘col3’’.mean()

Taking the dataset and split it by some parameter, then take some function to it and this will help you find a mean or average then combine it back. Your new index becomes the value on which you split.

Correlation coefficient

Study These Flashcards

Number between -1 and 1 calculated so as to represent the linear dependence of two variables or sets of data.

R, R^2

Simple linear progression being used to predict a quantitative linear response given an input X.

.astype()

Study These Flashcards

Allows you to convert the type category of data.

Df[“b”] = df[“a”.astype(‘category’)

Linear Regression

Study These Flashcards

Y = ax +b

Y = a1 x1 + a2 x2 +b

The outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values.

Logistic Regression

Study These Flashcards

Target variable is binary
Predictive features are interval (continuous) or categorical
Features are independent of one another.

Used when the response variable is categorical in nature.

Uses:

Whether a customer will convert to an offer.
Predict and preempt customer churn.
Clinical testing to predict whether a new drug will cure the average patient.

Bootstrap

- Size of samples - Number of replicates (a reasonable amount) Taking a series of random samples that can be used. Example: Bootstrap_data_1 = inData.sample(inData.shape[0], replace=True) Using the .sample() function a group of random samples can be taken from a given dataset.

Curse of dimensionality

Refers to how certain learning algorithms may perform poorly in high-dimensional data.

Principal Component Analysis

Technique used to emphasize variation and bring out strong patterns in a dataset. Useful for eliminating dimensions, such as computing values along a pair of lines, one for x values and one for y values.

Eigen vector

Characteristic vector of a linear transformation, a non-zero vector that only changes by a scalar factor when that linear transformation is applied to it. When corresponding to a real nonzero eigenvalue, points in a direction that is stretched by transformation and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative the direction is reversed.

Eigen Value

The field, F, which is the characteristic root associated with the eigenvector v.

Final Review Flashcards

(29 cards)