Week 3 (Data Types) Flashcards
What are the two main data types?
Categorical or nominal: things that can be counted
Measurement: things that can be measured
What are categorical/nominal variables?
Discrete- certain number
Label can be represented by name of number (vanilla strawberry chocolate, 1, 25 ,18)
Only valid mathematical operation is counting
Ordinal scales?
Discrete- certain number Inherent order (ranks) Some information about quantity Steps may not be equal Movement along the scale indicates a change in amount but doesn’t indicate how much change Can’t calculate means etc
Interval scales?
Order and equal intervals
Continuous
Mathematical operations - addition and subtraction
No true 0
0 doesn’t mean absence
Ratio scales?
Order Equal intervals True 0 = absence Physical quantities (mass, length, time) Can calculate ratios of different values
Continuous variables?
Theoretically infinite resolution between minimum and maximum
Can be converted to discrete variables but not view versa - conversion causes loss of information/precision
A construct can be continuous but the method of quantifying it may be discrete
What are the 3 measures of central tendency?
Mean
Median
Mode
What is the mode?
Only used for categorical data
Most commonly occurring value in a set
Sample can have more than one mode
Bimodal: two modal values
Multimodal: more than two modal values
If there are no values that occur more than once, there isn’t a mode for the data set
What is the median?
All scores ordered in increasing value and the middle score is the median.
Same number of observations below and above the median
Odd sample size- middle score
Even sample size - average elf two middle points
What is the mean?
Most commonly used
Add all values and divide by total number
Value around which scores are distributed
Won’t over or under estimate
Isn’t biased
What happens with extreme values?
If the outlier isn’t obvious, need to be careful about discarding
Median is unaffected by end points whereas mean uses all of the data and therefore in some cases the mean is not representative
Outliers can sometimes be seen using visual inspection
What measures are used to measure the spread of dispersion?
Range
Interquartile range
Sample standard deviation
What is range?
Maximum - minimum
Depended entirely on two extreme scores - if either are outliers, the range overestimates variability in the data
Range increases as sample size increases and this is because the bigger the sample size, the more opportunities there are to get extreme variables. Large samples allow for a good look and feel for extremes
Explain quartiles?
Group the data into 4 ordered, equal groups
Q1 lower quartile: 25% are below, 75% are above
Q3 upper quartile: 75% below, 25% above
Interquartile range: difference between Q1 and Q3
How much spread in the middle 50% of scores
the bigger the IQR = bigger dispersion
Variance and standard deviation?
Variance is roughly the average of the squared differences of the mean
Calculate how far away each score is from the mean- some are below and some are above
How do we choose descriptive statistics?
Nominal data should not be summarised using interquartile range mean median and mode
ordinal data can be associated with some descriptive statistics including median quartiles and interquartile range
Explain distributional shape?
Data values tend to be symmetrically clustered around the mean
Skewness of distributional shape?
Instead of being symmetrical the tail part is spread out much further on one side this is not a normal distribution
Positive: right skewed. More high values then low
Negative: left skewed. More low than high
Histograms can be used to assess skewness
What is Kurtosis?
Is about the shape of the two tails
Long and fat tails: low kurtosis
Peaked distribution and small tails: High kurtosis
It is typically measured in relation to the normal distribution
Skewness is more important
EDA?
Exploratory data analysis
Refers to procedures designed to present data in an informative way using graphical pictorial and summary methods
Graphs and tables help to organise explore and present data and highlight its features
What can you do with one categorical variable to turn it into visual information?
Frequency: represents counts in each category
Contingency table (two way table) of frequency: row or column percentages are means of summarising the relationship between the two variable
Pie charts: graphical representations used for a single categorical variable with typically a few categories
What can we do we one or more categorical variable to turn it into visual information?
Bar (column) graphs
Can be used for either one or two variables
How do we turn one continuous variable into visual information?
Stem and leaf plots: group data into intervals of equal length
Histograms: groups the data into equally sized intervals
The area of each box in a histogram is proportional to the frequencies of the intervals of the values
Box plots:
A way of presenting continuous data and giving a picture of how the data are distributed shows the median the interquartile range and whiskers cover the remaining data minimum and the maximum
How can we present more than one continuous variable in a visual way?
Scatter plots
Used to consider relationship between two quantitative variables eg. Price and thickness of textbooks
Can also be constructed with the inclusion. If categorical variables to differentiate the relationship eg. Price vs thickness differentiates by cover type
Principles of good graphs?
Clear images
Smooth and sharp lines
Legible and simple font
Measurement units are provided
Clearly labelled axes
Elements within figure are clearly labelled or explained
Error bars included when graphing descriptive statistics
Bad graphs?
Obscure or misinterpret info.
Graphs containing out right mistakes three-dimensional graphs in which the third dimension does not represent anything
bar charts with the scale starting above zero scatterplots with the Y and X scales not restricted to the range of the data
fanciful plot that result in optical illusion that are known to mislead