4 statistics and probability Flashcards
discrete data
something you can count
discrete data
something you can count
continuous data
something you measure
a hypothesis
a statement you test to see if it is true or false
raw data
data before it has been analsyed or processed
primary data
data you collect yourself
secondary data
data you use which someone else has collected
categorical data
data is words not numbers
numerical data
data given as numbers
types of numerical data
continuous or discrete
ordinal data
data that is ordered in some way
adv of secondary data
available
cheaper
easy
adv of primary data
reliable
aware of bias
ways of collecting data
measurement or experiment
survey or questionaire
modelling or simulation
mistakes to avoid when doing surverys or questionaires
asking the wrong people or a biased sample asking leading questions asking confusing questions asking personal questions asking too open ended questions
random
every member of the popuation ahs the same probability of being included
the members of a genuinely random sample have to be selected independently.
ways of collecting a sample
convience
systematic
genuinely random
convienence sample
asking whoever is easiest to get hold of
systematic sample
asking every 3rd person
genuinely random sample
picking out of a hat or using a random number generator
quota sampling
Choosing a sample that is only comprised of members of the population that fit certain characteristics.
stratified sampling
Choosing a random sample in a way that the proportion of certain characteristics matches the proportion of those characteristics in the population.
continuous data
something you measure
hypothesis
a statement you test to see if its true or false
raw data
date before analysis or processing
primary data
data you collect yourself
secondary data
data you use which someone else has collected
categorical data
word data not numbers
numerical data
number data
ordinal data
ordered in some way
adv of secondary data
available
cheaper
easy
adv of primary data
reliability
aware of bias
ways of collecting primary data
measurement or experiment
survey or questionaire
modelling or simulation
random sample
every member of population has the same probability of being included. selected indepently.
what is the opposite of a census
a random sample
convience sampling
asking friends or those easy to ask
systematic sammpling
e.g. asking every 3rd person
genuinely random sampling
pick out of hat or use random number generator on calculator.
quota sampling
the populalation is divided into groups. a given number is surveyed forme ach grouo.
cluster sampling
the population is divided into groups or clusters. a random sample of clusters is chosen and every item in it is surveyed. a large number of small clusters minimises the chances of being unrepresentative.
opinion polls
large scale opinion polls often use a combination of cluster and quota sampling. large sample size based on small proportion of population. (geographical area, age). but opinions change over time
what is a uniform distribution
flat/even
what is a normal distribution
peaked in the middle
mean, median, middle, mode in the same place
gaussian distribution
what is negatively skewed
leading up to the right
what is the positively skewed
leading up to the left or decreasing
box plot left skewed
box on the right with the median line towards the right
box plot right skewed
box on the left with the median line towards the left
normal distribution and standard deviatiosn
the standard deviations (outliers) next to the highlighted (70%) will be30% total, 15% each
box plot name
box and whisker diagram
the ends of the box in a box plot are the
interquartile range
outlier definition
a term of data that is
at least 2 standard deviations away from the mean (histogram)
OR
at least 1.5 x IQR beyond the nearer quartile
(box and whisker)
benefits of a curve in a cumulative frequency diagram
they use the data to show a gradient, so if the frequency decreases slightly then the gradient will show it by flattening a little. straight lines only show the data and not the link between them
datum
singular piece of data
why do bars not touch with discrete data
because there is no continuity between columns
what graph do you use for continuous data
histograms
what graph do you use for discrete data
bar graph
what graph do you use for cumulative frequency
line
how do you plot for cf graphs
to the upper bound
For data grouped into intervals or classes, we may identify the following:
mid-interval values interval width (though it is not common to have a varying interval width) lower interval boundary upper interval boundary modal class (the class with the highest frequency or the tallest class in the diagram; be aware, use the tallest class in the frequency diagram, not in the cumulative frequency diagram).
what is the 5 number summary
minimum Q1 median Q3 maximum
when is a box and whisker plot a normal distribution
when you can recognise symmetry
cumulative frequency polygon
The data points are connected by straight lines, implying a linear distribution of the data points within an interval.
cumulative frequency curve
All the data points are connected by a smooth curve
no correlation is
a bunch of dots
strong positive correlation is
a line goin gup to the right with all the dots very close on that line
perfect negative correlation
a line going down to the right with all the dots onit
moderate negative correlation is
going gently down to the right with dots around it
weak positive correlation is
a line faintly foing up to the right with dots all aroun dit
what is the r of no correlation
0
what is the r of strong positive correlation
0.9
what is the r of perfect negative correlation
-1
what is the r of modertae negative correlation
-0.5
what is the r of weak postiive correlation
0.3
what is the r of a curved relationship or no correlation
dont add straight line so r not meaningful
what are residuals
the vertical displacements for some of the points from the line
which residuals are positive and negative
above the line - positive
below the line - negative
what is the sum of all residuals
0
why would we square the residuals
so they are all positive
what does the sum of residuals show
how well the line fits the poitns
would a good line have a low or high sum of residuals
the line with the lowest possible sum of square residuals is called the least squares regression line of y on x
if you want to calcualte the y values from the x values how would you plot the line of best fit
vertical residuals to be as small as possible.
if you want to calcualte the x values from the y values how would you plot the line of best fit
horizontal residuals to be as small as possible.
what is the line called that has the lowest possible sum of square residuals
the least squares regression line of y on x
what are the two seperate regression least squares regression lines
one for y on x
one for x on y
what extra 3 columns should you have if youre calcualting regression and correlation
x squared
y squared
xy
what are teh sections of the graph called
quadrants
positive correlation to the quadrants
in the 1st and 3rd (top right and bottom left)
negative correlation to the quadrants
int he 2nd and 4th quadrant (top left and bottom right)
product moment correlation coefficent
(square root) SxxSyy
what is a in stats
gradient
what is b in stats
y intercept
graident of y on x line
Sxx
what is the product moment correlation coefficent
r
if measurements multiplied by 10 what effect would that have on the correlation
no effect
y = ax + b
whats a
gradient
y = ax + b
whats b
y intercept
interpoaltion
within data range
extrapolation
outside data range
r squared or variance is used to
show how clsoe the points are to the line. they remove knowledge of whether the data is trending up or down.
small sd or variance means
the data is all close together
high sd and variance means
the data is spread out
if the values when calcualting sd were all multiplied by 10 what would happen to sd
it would also be multiplied by 10
if the values when calculating variance were all multiplied by 10, what would happen to variance
it would get multiplied by 10^2
if the values when calculating mean were all multiplied by 10, what would happen to the mean
it would also be multiplied by 10
if the values when calculating r/correlation were all multiplied by 10, what would happen to the r/correlation
there would be no change
if 10 was added to all the values when calculating sd, what would happen
there would be no change
if 10 was added to all the values when calculating variance, what would happen
there would be no change
if 10 was added to all the values when calculating the mean, what would happen
10 would be added to the mean
what is variance essentially
standard deviation squred
which type of standard deviation is the notation of sigma used for
population
which type of standard deviation is the notation of Sx used for
sample
to find x on y, give the order of the columns that you would enter into the calculator
y column and then the x column
to find y on x, give the order of the columns that you would enter into the calculator
x column and then the y column
when you are calcuating x on y, what value are you finding
x
when you are calculating y on x, what value are you finding
y
what is the mean point
the line of regression x on y and the line y on x will pass through the mean point. (x bar, y bar)
what is relative frequency
the decimal probability as a percentage
0 on the probability scale is
never
1 on the probability scale is
absolutely certain
when might you use r squared
to plot a curve
probability P(A) =
n(u)
complementary events are represented by an
apostrophe
when do you multiply probabilities
when they are independent events
when do you add probabilities
when they are mutually exclusive
what are independent events
when one event does not effect the
what is relative frequency
probability multiplied by 100, so it is a percentagee
do you multiply or add AND
multiply
do you multiply or add OR
add
what are mutually exclusive events
when only one event can happen. there is no intersection
combined events
P(A∪B) =
P(A) + P(B) - P(A∩B)
what is a random variable
an outcome of a random experiment which can be represented as a number
what is a probability distribution
a table showing all the possible outcomes and their probabilities.
probabilities add up to
1