WEEK 1 Flashcards
2 types of Data
Categorical and Scalar
Categorical has 2, which are?
Nominal and Ordinal
Scalar has 2, which are? WHAT IS WORKS WELL WITH?
Continuous and Discrete
It works well with median, range interquartile range (IQR)
Doesn’t work well with mode unless you group the data and the frequency table would have too many different values for the continuous data
What is Nominal?
Data that does not have a numerical value and can only be placed in a suitable category like gender and yes and no questions, they give a label such as College or Breakfast in the example
What is Ordinal?
Ordinal is data that can be arranged in some meaningful order such as confidence with numbers (agree, dissagree, etc). The data includes the idea of order because it is categorical and the bar chart is generally the best to use.
They assume that all the distances between the confidence with a number (disagree and strongly disagree) but if it weight we can measure the distance
(Categorical variables with 2 different categories are called Dichotomous)
What is Continous?
Measured on a scale such as temperature or weight
What is Discrete?
Data that takes on whole values, usually obtained by counting e.g, the number of defective items
What is mode? IN bar chart too
The most frequent score in our data set, the bar chart shows the tallest one is the mode
What is spread? In the bar chart too
How many different categories do we have, in the bar chart shown below
What is the median number calculation?
159 +1 then divided by 2 to find the median point of the data, firstly order the data, even if it’s an even number
Only accurate if there is an even number of data points, having discrete or continuous data helps you to find the 2 observations add them and divide by 2
if categorical need to be lucky to find it
Scalar is?
It has height, weight and the guessing variable, it adds the idea of distance not just order.
What is the Interquartile range?
Describethe s the middle of 50% of values when ordered from lowest to highest
How to find IQR?
Find the median (middle value) of the lower and upper half of the data
What is Range?
The highest value (Maximum) - the lowest value (Minimum)
IQR CALCULATION
n+1th divided by 4 is the LOWER QUARTILE (LQ) and the UPPER QUARTILE (UQ) is the same but times by 3 x n+1th divided by 4 in front, then IQR = UQ - LQ
What is five number summary?
It’s a set of descriptive statistics that provides information about a dataset e.g, BOX PLOT SHOULD BE ALL EQUAL DISTANCE
1) the sample of minimum (smallest observation)
2) the lower quartile or first quartile TOTAL + 1 DIVIDE BY 4
3) the median (the middle value) TOTAL + 1 DIVIDE BY 2
4) the upper quartile or third quartile (TOTAL + 1 DIVIDE BY 4) X BY 3
5) the sample maximum (larger observation)
AVERAGE - SCALAR DATA the mean of the data, how to calculate?
ADD all values and divide by total number of observation to find the mean
THE IDEA OF DISTANCE, wha it is?
Standard deviation is the distance of the data from the mean, it does not matter if the value of the observation is above or below the mean because of the squaring (distance matters)
The variance is the standard deviation squared, variance is sometimes more useful than the standard deviation
STANDARD DEVIATION CALCULATION
So first calculate the mean by adding al and dividing by the n (number of observations)
x = muna reflects each individual height we pick the first individual height and minus the average (the mean)
And take the square plus we go to the second individual we subtract the mean from the height and square it
Calculate the square differences and add them up (1;40 min in the first lecture)
We divide by N at the end = gives us variance in the inner part
We take the square root gives us the spread standard deviation (sigma a greek letter)
What is trimmed mean?
Means there might be very high or small heigh so it’s cutting the lowest 5 % of lower data and get rid o low values (outliners) and 5% of high values if there are any to make it more accurate answer of mean
Standard deviation?
1 find mean the u word by adding all numbers and dividing by how many there are
2 the numbers given are then minus the mean of all
3 we then get the inside bracket and we do the “2
4 The E PART add all and divide by ( n ) how many points
5 then we square root it to get standard deviation
What is the best measure of spread?
Variance (before square rooting the standard deviation)
Standardisation
z = x - mean divided by the standard deviation
x is the number we want to standardise
to get same unit when there are different units
When comparing spread of 2 or more distributions we should?
compare the coefficients of variations for each as these take into account differences in the means
coefficient of variations
CV = standard deviation (sigma) divided by the mean
if the dispersion around the mean is large there is more uncertain and low accuracy of data
can be positive and negative
what are 2 relative measures?
coefficient of variation
and idea of standardising data
Different statical measures
all of them are absolute measures
IQR uses the middle 50 %| and is
less influenced by extreme values
WHAT ARE 3 MEASURES OF AVERAGE (CENTRALITY)
MEAN MEDIAN AND MODE
median if there’s sales and frequency table
n = the data points = all frequencies added
then we do n +1 divide by 2
we get the 24th (for example)
then we count from frequencies which one is 24th
order data can be written down like 0 x 5, 1 x 16, 2 x 12 etc to find the 24th data point
five number summary TIPS
the LQ = WE FIND THE DATA POINT the that’s the answer
UQ =we find the data point and its the answer
DONT SUBTRACT THEM
inter quartile range (IQR)
order data in small to large
FIND THE UQ -LQ BY DOING THE FORMULA n+ 1 divide by 4 and then times by 3 for UQTO find which data point is the number
then find for each of them the number that corresponds to the TH e.g. (9th) - (3th) number
then minus the actual number from the data set and u got IQR
If all datapoints all decrease by 7, the IQR decreases by 7 true of false?
FALSE = BECAUSE IQR is a measure of spread not centrality doesn’t change as the dataset moves
IQR is not affected by outliners why?
Because it measures the UQ-LQ so it doesnt affect the data point
what are outliers?
they are points that are far away from other data points
What is a boxplot?
it demonstrates skew in the data
if all sides equal = no skew as there is balance
skew = no equal sides or median not in the middle of data , distribution is more concerted in left or right side
its basically 5 number summary
MEAN IF there’s 1,2,3,4,
sum all then divide by 4 (the number of data points)
Mean if there is x and frequency table
find mean by doing the x number times by frequency for each of them then add all
then divide by (n) the data points
n = add all of the frequencies together