Final Review Flashcards
(29 cards)
Probability Density Function (PDF)
Tells the density around a particular point. Total area under the pdf curve sums to 1.
Poisson Distribution
Used in experiments to model the numbers of events in a fixed period of time or a fixed area of space. Mean and variance is lambda.
- Number of people who enter a grocery store every hour.
- Number of childbirths in Hawai’i every day.
Negative Binomial Distribution
Sequence of trials are independent. Counts failures until a fixed number of successes. How many times you did something until that event occurred.
How many times did you have to flip a curve before it landed on heads?
Binomial Distribution
Models the number of successes in an experiment of n fixed, independent trials.
Each trial has a probability of success p and a probability of failure 1 - p
Such as flipping a coin with heads being 1 and tails being 0. What is the probability of the coin landing on heads after a certain amount of flips.
Normal Distribution
Continuous probability distribution. Samples such as: - Heights of males in a population. - Errors in instrumentation. - Total sales Bell shaped curve, symmetric curve for continuous events.
Probability Mass Function
Probability distribution for a discrete random variables.
Takes discrete random variables X and assign a probability to each value of its sample space. Commonly represented with a histogram.
Discrete Data
Only takes particular values. May potentially be an infinite number of values, but each is distinct. Can be numeric but also categorical.
Continuous Data
Not restricted to defined separate values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Always numeric.
Kernel Density Estimation
Non-parametric way to estimate the probability density function (pdf) of a random variable. Number of bins and results in bar charts to plot out the data. Allows you to see the equivalent of the probability model. Tells you where the bulk of your probability data resides under the curve.
Top-Hat Kernel
Used to bypass bin boundaries. When entries overlap each other in the plot, they are stacked on top of each other which then gives a more accurate representation of the data but is a much rougher representation. Bars have heights equal to the sum of overlapping blocks.
Gaussian Kernel
The contribution of a point at position x is simply the sum of the pdf’s with regards to each of the pdfs that overlap it. This results in smoother distribution.
Bandwidth
The amount of incrementation for graph representation. Higher bandwidth means a smoother curve graph, lower means more jagged with sudden changes. Too large means the curve will be too smooth and the data will be indiscernible.
Cumulative Distribution Function (CDF)
Function that maps a value to its percentile rank. Input is value x and returns percentile rank Z.
Step based graphs sometimes.
Answers:
- What is the fraction of the events that have occurred to the left or below of x?
- Where the bulk of the data lies on the Kernel Density.
Series
Can think of as a column off a data frame.
X = pd.Series([6,3,4,6])
Dataframe
Two dimensional table data structure with labeled axes.
Pd.DataFrame(np.random.randint(low=0, high=10, size=(5,5), columns = [‘a’,’b’,’c’,’d’,’e’])
.loc
Label-based, but may also be used in a boolean array. Gets rows (or columns) with particular labels from the index.
.iloc
Integer location. Primarily integer position based (from 0 to length-1 off the axis), but may also be used with a boolean array.
Gets rows (or columns) at particular positions in the index (so it only takes integers).
Getting row number 5, column number 3 in a dataframe:
Dataframe.iloc[5,3]
coded to get an index without hardcoding it:
Dataframe.iloc[3,[]] or Dataframe.iloc[3,n]
Slice notation [x:y]
States that x is the starting index and it will gather the starting and ending (y) indices along with everything in between them.
[:2] = Start from the very beginning and end at index 2. [2:] = Start from index two and gather everything past that until the very end.
.unique()
States the column that it will check for unique values in that column and return a list of the found unique values.
.groupby( )
Hierarchical Index Groupby:
Data.groupby([‘col1’, ‘col2’]).mean()
Dataframe Groupby:
Data.groupby([‘col1’, ‘col2’])[‘col3’’.mean()
Taking the dataset and split it by some parameter, then take some function to it and this will help you find a mean or average then combine it back. Your new index becomes the value on which you split.
Correlation coefficient
Number between -1 and 1 calculated so as to represent the linear dependence of two variables or sets of data.
R, R^2
Simple linear progression being used to predict a quantitative linear response given an input X.
.astype()
Allows you to convert the type category of data.
Df[“b”] = df[“a”.astype(‘category’)
Linear Regression
Y = ax +b
Y = a1 x1 + a2 x2 +b
The outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values.
Logistic Regression
- Target variable is binary
- Predictive features are interval (continuous) or categorical
Features are independent of one another.
Used when the response variable is categorical in nature.
Uses:
- Whether a customer will convert to an offer.
- Predict and preempt customer churn.
- Clinical testing to predict whether a new drug will cure the average patient.