Data Analysis Flashcards

1
Q

When summarising and looking at relationships in data, what is the first stage?

A

summarise what the values look like

e.g. where does the quantitative data sit in numerical space?

what categorical data is more or less common?

then understand any relationships between different properties that data has been collected on (through analysis)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the analysis that is conducted depend on?

A

the data itself:

  • how is the data recorded?
  • how is the data distributed?

the research question:

  • will this analysis answer what i’m trying to ask?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why does a user (someone reading a report) need to know about statistics?

A

They need to know enough about what the researcher should have done, to know if their results are accurate and meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is categorical data usually recorded?

A

it is recorded as text (or labelled)

it is ordinal if there is some way to order or rank the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is deprivation ordinal data?

A

there is a way to order the data (low, medium, high)

the others are nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is categorical data described?

A

you need to show how often each category is seen

this is done through counts or percentages

it is shown through tables and graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the following STATA command structure?

A

the command name comes first, followed by arguments for the command to use (usually variable names)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What method is used for testing relationships involving categorical data?

A

logistic regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What tests are used for testing relationships when there is one categorical variable and one continuous variable?

A

T-tests

Chi squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is meant by numerical data?

What are the 2 types?

A

the data involves numbers - you can count or measure the values

it is discrete if only whole numbers make sense

it is continuous if the data can take any value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is numerical data described?

A
  1. give a summary of the size of the values using mean or median
  2. give a summary of the spread of the variables

using variance, standard deviation, interquartile range

  1. report some sort of extreme

minimum, maximum, modal value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the problem in using a point plot to describe numerical data?

A

It is quick, but not very useful

Summary measures are more useful for numerical data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

From the point plot, what values can be worked out?

A

Range:

this is the highest and lowest values and the difference between them

Mode:

this is the most common value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why are histograms used for numerical data?

What do they show?

A

histograms show how common the values are relative to each other

this allows you to see where the typical, or most common values fall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What type of distribution is shown here?

Why is it important to recognise this?

A

the most common values fall around the middle, forming a roughly symmetrical curve

this data is normally distributed

this is important for choosing how best to summarise the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the best summary measures to use when you have normally distributed data?

A

mean and standard deviation

17
Q

What are the mean and standard deviation?

How are they calculated?

A

Mean:

  • where, on average, the values lie
  • this is the sum of all the values / total number of values

Standard deviation:

  • this is the average spread of values around the mean
  • the mean is subtracted from each value, the total is then divided by one less than the total number of values
  • the root of this value is the SD
18
Q

What STATA command would be used to identify the mean and standard deviation?

A

the summarize command

19
Q

What type of distribution is shown here?

A

left skew

some of the low values are fairly rare, making a long tail

the peak of the curve is shifted over to the higher values

20
Q

What type of distribution is shown here?

A

right skew

21
Q

If data shows a right or left skew, what summary statistics are used?

A

median and interquartile range

22
Q

What is the median and how is it calculated?

A

it is the mid-point of the measurements

it can be found by putting all of the values in order and finding the mid-point

if there is a tie, the median lies exactly between the middle two values

23
Q

What is the interquartile range and how is it calculated?

A

it is the spread of values around the median

it is the difference between the value that is one quarter of the way into the data, and the value that is three quarters of the way in

24
Q

What type of graph is used to visualise the IQR?

A

box plot

25
Q

What is a scatterplot used to visualise?

A

the relationship between two numerical variables

as values change along the x axis, we can see how this relates to changes in the values along the y axis

26
Q

What is meant by 2 variables that “covary”?

A

the variables change together

if there is some sort of pattern in how the two sets of variables change along the axis, there is some sort of relationship between the 2 sets of values

27
Q

What are the following correlations?

A
28
Q

What correlation calculations are used for normally distributed and skewed data?

A

normally distributed - pearson

skewed - spearmans rank

29
Q

What are the 3 limitations of correlation calculations?

A
  1. ignores the direction of the relationship
  2. can only test for linear relationships
  3. can only include 2 variables
30
Q

What is the problem with a correlation test ignoring the direction of the relationship?

A

Usually, the suggested exposure is on the x axis and the suggested outcome on the y axis

the correlaiton test can’t tell which that is, so it can’t be used to comment on whether there is an exposure and outcome association

it just tells you that the sets of numbers covary

31
Q

What are the benefits of using regression analysis?

A
  1. can specify an exposure and outcome
  2. can include non-linear relationships
  3. can specify multiple exposures
32
Q

What is the regression equation?

What question is asked that describes regression?

A

for a one unit (1%) change in X, what is the corresponding change in the expected value of Y?

33
Q

What is added to a scatter plot when using regression analysis?

A

a line of best fit

this allows you to see for a given value of X, what would be the predicted value of Y

34
Q

What is the STATA command for regression analysis?

A

the outcome must come before the exposure

this is becuase regression analysis specifies an outcome and at least one exposure variable

35
Q

What values need to be identified from the regression analysis in STATA?

A
  1. coefficient
  2. confidence interval
  3. P value
  4. R squared value
36
Q

What can the P value never be reported as?

A

even if it says P is 0.000 in STATA, P can never be zero

it has been rounded down, so we report that P < 0.001

37
Q
A