data analysis Flashcards
when is linear regression used?
to improve correlation when measuring associations between continuous exposures and outcomes
how can you get a more representative sample?
more data
what does statistics allow for?
it allows us to take all data in and summarise it in a way that is understandable and useful
what are the two main properties of data that we want to capture through statistics?
where quantitative data sits in numerical space and what categorical data is more or less common, what the values look like and understand the relationship
what does the analysis done depend on?
how is the data recorded and how is the data distributed and the research question - does it answer what it is meant to
how is categorical data usually recorded?
as text or labels
what is ordinal data?
when it is ordered or ranked
how can you present categorical data?
counts, percentages, tables and graphs
what alters how you present data?
who you are presenting the data to
in what order does STATA follow commands?
command name, then argument for command and then further options after comma
what are arguments?
they are variables to determine how the command is run i.e. bar
when should you add graphics to the bar chart?
only if they provide more information and help to understand the information already given
what are the methods for testing relationships?
logistic regression and T tests and chi squared - this is where we have one categorical and one continuous variable
what is numerical data?
it is when the data is data is in numbers - can count or measure the values
what is discrete?
when the numerical data is whole numbers
how can you summarise the size of numerical values?
mean and median
how can you summarise the spread of numerical values?
variance, SD and IQR
how can we report some sort of extreme in numerical values?
modal value, the minimum and the maximum
what do you need to consider when analysis numerical data?
the specific reason for comparing groups or populations
what can you get from the simple plot graph?
the range and the mode and understand how the data fits together
what do histograms show?
how common the values are relative to each other - where the typical or most common values fall
what would you use in normal distribution?
it is symmetrical so mean and SD
how do you calculate SD?
you find each value and subtract the mean and then square each result. Add them altogether and divide by one less than the total number of values and take square root
what is the mean?
the sum of all values/total number of values
what is the SD?
it is the average spread of values around the mean
what is left skew?
when the low values are quite rare and the long tail goes to the left - opposite is right skew
what is the IQR?
the spread of values around the median - distance between values one quarter of way into the data and 3/4
when would you use the IQR and median?
when the data is skewed
where is the median when there is a tie?
it lies between them
what is true of a normal distribution?
median = mean
what does a scatterplot show?
it shows the relationship between two numeric variables - how the x changes relative to the y - how they covary
what is a perfect positive and negative correlation?
positive = 1 and negative corr = -1
what is the value for no correlation?
0
how would you formalise a correlation?
use a correlation test
what correlation test would you use for a) a normal distribution and b) a skewed?
a) Pearson
b) Spearman’s Rank
what are the limits of correlations?
cannot comment on the exposure: outcome relationship, only use two variables, does not comment on the direction of the correlation, can only test for linear relationship, can be an oversimplification - may show some things as similar when they are not
what is Anscombe’s quartet?
it is a set of pairs of variables that all have the same correlation between each other but when looked at individually actually have very different structures - shows how reducing the relationship between two variables to one number may miss detail
why must many correlation tests be done?
to see the effect of confounders
what is regression analysis good for?
can specify multiple exposures, include non-linear relationships and specify and exposure and outcome
what is included in regression analysis?
for the relationship between two variables an intercept value followed by an effect size is given - the effect size shows how for one unit of change per exposure how the outcome is expected to change - can add a best fit to easily calculate this
what is R2?
it is the proportion of variation in outcome explained by the exposure
what must you include in regression analysis?
the 95% CI and the P value