1 to 200 Math Flashcards
Histogram Buckets- Bins
The size of the bucket bins are very important, you are better with more than less. When you have less the data can be too general and not accurate enough
Line Plot
Also makes it easy to stack features
Inclusive internal
need to look this one up
Distribution Plot
Allows us to visualize the dispersion of data across variables most common method Histogram
Histogram is
The most common distribution plot
X axis
Horizontal Axis
Y axis
Vertical Axis
When do we use a scatter Plot
used to show the relationship between 2 features
When is a line plot Appropriate
When we know for sure there is a continuous relationship (linear) between 2 data points
2 Types of distribution Plots
Box and Whisker, KDE Kernel Density Estimation
Categorical Plots
Metric per category, many variations, most common is the simple bar plot
Always keep in mind the information I want to share of the story I am trying to tell
How does the story help analyzing that information to another
Bin Sizes
You can make smaller or larger
Histogram is
A distribution plot
in a histogram which axis is continuous
x Axis
Hat mx+b
Linear equation
M=
How steep the line is (the slope)
X=
How far it is from the line
B=
the value of y when x=0
M formula question
Rise over run
Y=
How far up and down the line
Ojive
Accumulating line Plot
Line plots are greater when
the relationship between the data points that have no in-between points like the weather or days
yHat means
The equation of a straight line in the slope intercept form y hat represents the predicted value
X Axis needs to be
Continuous Data
Why do we use line plots?
We use them for changes over time
Data is
Data is collected and observable information about something
Discrete Data
can only take on certain values, there are no in-between numbers like Ford, Cheve,Cadillac
Continuous Data
data that can have in between values like we are 175 inches tall
Nominal Data
Nominal data is classified without a natural form or rank, cats, doge,fish
Ordinal Data
can be sorted it has an order like 1,2,3 hot, mild,cold. It has to make logical sense.
Structured Data
highly specific and is tored in a pre defined format -excel spreadsheet/ if you send the data to someone else they will be able to work with the data
Umstructured Data
not in any particular format example audio or text files irt does not follow a predefined format/ this involves deep learning= Dalle-e 2
Population
the entire data set
Sample
Sample is a random sample of the data
Mean
Mean is the most common measure of central tendency
mean formula means
Sum of all data points/number of data points
Average is the
Arithmetic mean
meu is
Population
x bar
mean of sample size
Weighted mean
Aweighted meanis a kind ofaverage. Instead of each data point contributing equally to the final mean, some data points contribute more “weight” than others. If all the weights are equal, then the weighted mean equals thearithmetic mean(the regular “average” you’re used to). Weighted means are very common in statistics, especially when studyingpopulations.
Weighted mean example
20 over 8.4- 7 over 6.1 would read 208.4 and 61 divide by 20 +7
Truncated Mean
we use this to handle outliers we would ignore the outlier and take the other side off the data set ex 9 50, 52,78 we would take off 9 and 78 and divide by number of values must note that we took x% off the data set
Mode
The value most often
median odd
it is the number in the middle
Median
add the two central numbers /2 that will be the median ( take the arithmetic mean
use discrete
mean, median, mode
Nominal Data
maybe mean, no median use on mode
Ordinal Data
mean maybe, median, mode
Numeric
mean, median,mode
Non Numeric
no mean, median, mode
Continious
median, mode,mode
no numeric reason no memean
have to divide by 2, it is is a letter we cannot find the sum
Continious
Height of people
Discrete Data
Number of Children in a family
Non Numeric
Cats, dogs, birds,fish can’t add it up
Nominal Data
Has no specific order, cannot be sorted
Ordinal Data
Data that can be sorted 1,2,3 hot, mild, cold
To calculate a mean
We need numeric data
these caterories can overlap1
nominal, numeric, non numeric
These categories can overlap 2
ordinal, numeric,non numeric
Working with Household data
we would use median because of extreme values
step one to figure out central tendency
is it even possible use that central tendency
step two to figure out central tendency
if we can measue what makes the most sense
Measurement of Dispersion
are measurements of spread
measurement of dispersion
it measures how the data is spread across the mean
Mean is
the number that is as close as possible to all of the data sets ( balancing Point)
effects of measurements of spread
we get 2 things. The standard deviation and spread, they are similar to each other
varience number meaning
the samller the value we find the less the spread
reason for squaring
if we get a negative value squaring makes it positive, squaring it emphasises the larger deviations
Standard deviation is
the square root of varience
Varience is
Not usually used we use standard deviation instead
Varience formula uses
N-1 to correct the bias we generate from the mean
Then for standard deviation
you take the square root
Quartiles are
Related to thedata set
when talking about Quartiles we are talking about
the first, second, third set of data
The 1st quartile
Will be the first half of the median
1st quartile is
the bottom or lower 25%
3rd quartile is
the upper 75% of data
Second Quartile
Is the median or 50th percentile