data analysis and descriptive tendencies Flashcards
what is a population?
- complete set of objects
- group containing elements of anything you want to study
what is a sample?
- subset of a given population
does the sample have to be people?
- no, can be cells, products, SMS messages
why do you take a sample?
- cannot test every individual so take a sample and infer about population causing error
what should the sample represent? what should be considered?
- represents the population
- careful considerations of sub- categories required to ensure that the sample reliably represents the population
what shouldn’t be done to the samples after determined?
- sample shouldn’t be modified or subdivided after determined for the sake of deriving a better conclusion
‘ cherry picking’
what is a variable?
- set of related events that can take on more than one value
can a variable be changed? give examples
- something that can be changed
e.g., characteristic or value like weight, exam mark, academic degree, hometown
what is statistical inference?
- involves figuring out how well a property of one variable can be predicted by that of another variable
what is an independent variable?
- value being changed or manipulated
- controlled or selected to determine its relationship on an observed outcome
what is a dependent variable?
- observed result of the IV being manipulated
- it is something that may depend on the IV
what does research aim to do with the variables?
- attempt made to find evidence that DV is dependent to IV
what do independent variables consist of?
- different categories called levels, conditions or treatments
how are levels of independent variable different from number?
- because there is multiple independent variable but you only belong to one level
what is a control variable?
- kept constant to prevent them influencing the effect of IV on DV
what are control variables critical for?
- critical for study design e.g., recruitment criteria for participants
what are the different types of data?
- categorical
- ordered
- continuous
- measured
what are nominal and ordinal variables?
- qualitative and categorical
what are interval and ratio variables?
- quantitative and continuous
what is nominal data?
- categorical
- cannot be ordered/ counted
what are examples of nominal data?
- gender
- country
- occupation
- blood type
what is ordinal data?
- can be ordered but cannot be added or subtracted
what are examples of ordinal data?
- satisfaction rating
- education level
- spice level
what is interval data?
- can be ordered
- difference can be measured but cannot compute a ratio between two values
- no meaningful zero exists
what are examples of interval data?
- exam mark
- date
- year
what is ratio data?
- interval and can take a ratio between two
- has meaningful zero
what are examples of ratio data?
- distance
- height
- annual income
- number of success
how do you distinguish between interval and ratio?
- can it be doubled?
yes= ratio; no= interval
what are the four main descriptive tendencies?
- central tendency
- spread
- shape
- outliers
what are the three central tendencies?
- mode
- median
- mean
what is the mode and what variable/ data is it used for?
- highest value
- can be used for all types of variables
- often used for nominal and ordinal variables
what is the median and what variable/ data is it used for?
- middle value
- cannot be obtained for nominal variables
- obtained only on ordered variables e.g., ordinal, interval, ratio
what is the mean and what variable/ data is it used for?
- average
- distances (1st moment) are balanced
- only defined in interval and ratio variables
what two of the central tendencies are normally similar?
- mean and median are similar
how does an outlier effect central tendencies?
- hugely affects the mean value but doesn’t affect the median
what are the three types of data found for spread?
- quantile/ quartile/ percentile
- variance and standard deviation
- Z score
how do you find out the quantile, quartile and percentiles?
- divide data into sections containing the same number of data and report where the sections are located
what is a quantile? where do we plot this data?
- sample is divided into equal sized subgroups
- for N sections = N-1 values
- plotted onto a scatterplot
what is a quartile? what is the median?
- 1st to 3rd
- when there are four sections in total
- median= 2
what is percentiles? what is the median?
- 1st to 99th
- when there are 100 sections
- median is 50
how do you calculate the 2nd moment ?
variance =(distance from mean)2 to each data point / number of data points
what is the square root of variance called and what is it?
- called standard deviation
- standard distance from mean
what does mean + / - SD provide information on?
- where the centre is
- how spread the data points are
given SD, how can distance be described? what is this called and what does it enable?
- distance can be described as a ratio with respect to SD
- known as Z - score
- enables fair comparison of deviations
what are the two main types of shapes?
- skewness
- kurtosis
what does skewness measure and correspond to?
- measures degree of asymmetry
- corresponds to 3rd moment
how do you calculate the 3rd moment? what do you divide it by and why?
3rd moment = distance from mean^3 to each data point/ number of data points
- divide by SD^3 to make it dimensionless
what does zero skewness mean?
- data are symmetrically distributed
what does high skewness mean?
- distribution is highly asymmetrical
what does positive/ negative skewness mean?
- indicates which direction data are skewed
what does kurtosis measure? what does it correspond to?
- measures the sharpness/ thinness
- corresponds to the 4th moment
how do you work out 4th moment? what do you divide it by and why?
4th moment = distance from mean^4 to each data point/ number of data points
- divide by SD^4 to make it dimensionless
what is kurtosis always by definition? what do we subtract?
- always positive
- subtract 3 (kurtosis of ‘ normal distribution)
what are outliers?
- extreme values relative to bulk of values in a data set
what are outliers due to?
- inaccuracies in data processing
- problems with methodology e.g., measures, instruments, participants not following instructions
- actual extreme value from an unusual participant
what are the two ways you can detect outliers?
- based on z- score
- based on inter-quartile range
how does Z- score detect outliers?
- outlier if z-score is more than 3 or less than 3
- when the distance from mean is more than 3x of SD
how does inter- quartile range detect outliers?
- width between 1st and 3rd quartile
- outlier if value is greater than 1.5 IQR above 3rd quartile or smaller than 1.5 IQR below 2nd
what samples do outliers distort data?
- in small samples
describe a histogram- what does height represent?
- visualises how data is distributed
- height represents frequency (how often a value appears in data)
describe a box plot
- plot summarising quartile- based stats of a data set, includes;
- location of quartiles
- range of data excluding outliers
- outliers detected by quartiles