Final Review Flashcards
Probability Density Function (PDF)
Tells the density around a particular point. Total area under the pdf curve sums to 1.
Poisson Distribution
Used in experiments to model the numbers of events in a fixed period of time or a fixed area of space. Mean and variance is lambda.
- Number of people who enter a grocery store every hour.
- Number of childbirths in Hawai’i every day.
Negative Binomial Distribution
Sequence of trials are independent. Counts failures until a fixed number of successes. How many times you did something until that event occurred.
How many times did you have to flip a curve before it landed on heads?
Binomial Distribution
Models the number of successes in an experiment of n fixed, independent trials.
Each trial has a probability of success p and a probability of failure 1 - p
Such as flipping a coin with heads being 1 and tails being 0. What is the probability of the coin landing on heads after a certain amount of flips.
Normal Distribution
Continuous probability distribution. Samples such as: - Heights of males in a population. - Errors in instrumentation. - Total sales Bell shaped curve, symmetric curve for continuous events.
Probability Mass Function
Probability distribution for a discrete random variables.
Takes discrete random variables X and assign a probability to each value of its sample space. Commonly represented with a histogram.
Discrete Data
Only takes particular values. May potentially be an infinite number of values, but each is distinct. Can be numeric but also categorical.
Continuous Data
Not restricted to defined separate values, but can occupy any value over a continuous range. Between any two continuous data values there may be an infinite number of others. Always numeric.
Kernel Density Estimation
Non-parametric way to estimate the probability density function (pdf) of a random variable. Number of bins and results in bar charts to plot out the data. Allows you to see the equivalent of the probability model. Tells you where the bulk of your probability data resides under the curve.
Top-Hat Kernel
Used to bypass bin boundaries. When entries overlap each other in the plot, they are stacked on top of each other which then gives a more accurate representation of the data but is a much rougher representation. Bars have heights equal to the sum of overlapping blocks.
Gaussian Kernel
The contribution of a point at position x is simply the sum of the pdf’s with regards to each of the pdfs that overlap it. This results in smoother distribution.
Bandwidth
The amount of incrementation for graph representation. Higher bandwidth means a smoother curve graph, lower means more jagged with sudden changes. Too large means the curve will be too smooth and the data will be indiscernible.
Cumulative Distribution Function (CDF)
Function that maps a value to its percentile rank. Input is value x and returns percentile rank Z.
Step based graphs sometimes.
Answers:
- What is the fraction of the events that have occurred to the left or below of x?
- Where the bulk of the data lies on the Kernel Density.
Series
Can think of as a column off a data frame.
X = pd.Series([6,3,4,6])
Dataframe
Two dimensional table data structure with labeled axes.
Pd.DataFrame(np.random.randint(low=0, high=10, size=(5,5), columns = [‘a’,’b’,’c’,’d’,’e’])