Measurment and Descriptive Analysis Flashcards
Why is it important to classify the type of data?
-It determines the type of statistical test that is going to be used
-the type of data will determine how it is described
-when analyzing an article the type of data needs to be determined
Data used for Qualitative data
-Categorial (Nominal data)
-Ordinal data
How are qualitative data described?
-Qualitative
-No mathematical data
-fall into distinct and discrete categories (finite number of categories)
-Gender (1=male, 2=female)
-Pass/fail
-Race
-Eye color
-Clinical diagnosis (1=heart failure, 2=renal failure,..)
Characteristics of categorical data
-Qualitative
-There is no natural order between categories (eye color, dead or alive, male or female)
What are dichotomous data?
If there are only 2 groups, data are
dichotomous (e.g., male/female)
What is an Ordinal data?
-Qualitative
-data with natural order
-Values/observations can be ranked (put
in order) or have a rating scale attached (f.e. rate your experience from good to bad)
-Numbers are not arbitrary in ordinal data (it has a meaning, f.e. the higher the better)
What are examples of Ordinal data?
-Pain scale (ranked, but not continuous)
-Likert scale (Strongly agree=5, agree=4, undecided=3)
-both are not continuous
The average score of an ordinal data (from 1 to 5) is 4.75, what is wrong with that statement?
4.75 does not represent a given category
it is better to use the median (middlemost value) rather than the average, bc the median (=3) fits into a category
What is Quantitative data?
-have mathematical meaning
-derived from counts or measurements
-most biological systems are represented in quantitative data
What type of data is used for Quantitative data?
-Continuous data
-values can take on any number (also fractions)
-biomedical values are continuous
-temperature, blood pressure, weight, LDL, age
What is the Baseline Characteristics important for?
-Internal Validity: ensure that both groups are similar, thereby preventing cofounding
-External Validity: are the results generalizable to another location?
The duration of treatment:
The number of patients who are treated < 4 wk in (%):
Drug A: 60 (25.1%) n=239
Drug B: 44 (18.5%) n=238
Drug C: 55 (22%) n=250
What type of data is that?
-Dichotomous: YES OR NOR
-> The question is: are pt for a Drug treated for less than 4 weeks / YES OR NO
for Drug A: 60 were treated for < 4 weeks: YES (179 were not)
for Drug B: 44 were treated for <4 weeks
for Drug C: 55 were treated for <4 weeks
Concomitant psychotropic treamtment with
Trazodone: 23
Anxiolytics: 44
Seative or hypnotics: 19
What type of data?
Categorial bc it can be put into buckets, which are Trazodone/Anxiolytics/Sedative or hypnotics
How many patients had Fever/Cough/Ronny nose?
Fever: 213
Cough: 163
Runny nose:78
What type of data?
Categorial (Nominal)
-can be put in buckets
Which type of data do percentages often fall into?
-Qualitative data
-Categorial, Nominal, dichotomous
What is Descriptive Data?
-Measures of Central Tendency (values around the mean)
-Measure of Variability: How scattered, dispersed are the data
What does “Measures of Central Tendency” mean?
The data has the tendency to convert on the most central value (Median and Mean)
How is the Measures of Variability expressed?
-Standard deviation SD
-standard error of the mean SEM
-confidence intervals
-range
-percentile
-interquartile range
What is the purpose of descriptive data?
-Describe, organize, or summarize actual data
-No statistical conclusions are drawn
What is a “Mean”?
-Arithmetic average of the data
-Affected by outliers (extreme values of
data distribution
-often used to describe normally distributed continuous data
Which value is less affected by outliers?
-The median
-bc we have more values in the normal range which are near to the median and outliers far away from the median -> thereby less affected by outliers -> giving a better picture of the average
What is the Median?
-Mid-most value (50th percentile)
-Half the data points are above and below
-Unaffected by outliers
-Often used to describe non-normally
distributed continuous data
-often used to describe ordinal data (Pain scale, Likert scale)
What is the median out of these values?
1, 2, 3, 4, 5 -> 3
1, 2, 3, 4, 5, 6, 7, 8 -> calculate the average of the middlemost values -> 4+5 = 9/2 = 4.5
Can Mean/Median be used for Continuous Data and Ordinal data?
Continuous Data - Mean: Y Median: Y
Ordinal Data - Mean: N Median: Y
Nominal Data: Mean: No Median: No
Is the Mean/Median affected by outliers?
Mean: Yes Median: No
How can the distribution of data be organized?
Distribution or graph of the frequency of occurrence
What does a normal Distribution look like?
-Symmetrical, bell-shaped
-also called Gaussian distribution
What is the Standard deviation SD?
-used to describe the variability of normally distributed data
-gives an idea of the width of the curve, the spread of the data around the mean
f.e.: mean = 75 -> SD = 10
–> 75 +/- 10
-most commonly used measure of data variability with medical and health data
What are the percentages of data represented by SD?
1 SD: represents 68.2%
2 SD: represents 95.5%
3 SD: represents 99.7%
What do whiskers represent on bars in a graph?
Standard deviation
What affects the SD?
-Number of patients (a small number of patients will result in a large SD, a large number of pt in a small SD)
-outliers: will increase the SD in one direction, no longer bell-shaped, Gaussian distributed -> skewed
What does a skewed-shaped data curve imply?
-the data is distributed to one side of the data curve
-the standard deviation is the wrong measure to use bc SD is best used with a bell-shaped curve
What are Z-scores?
-The number of Standard deviations SD away from the mean you are
-f.e: Z=+1.65 represents 5% of the normal population, the rest (95%) are under the curve
-f.e. Z= if a heart rate of 65bpm lies 1.5 SD below the mean it has a Z-score of 1.5
How much data is under the curve with a Z-score of 1.5?
represents 5% outside of the curve and 95% under the curve
How much data is under the curve with a Z-score of 1.96?
represents 2.5% outside of the curve and 97.5% under the curve
-in the case of +/-1.96 it would be 2.5% on each side outside of the curve = 5% outside of the curve -> 95% under the curve
What does a confidence interval of 95% imply?
We include 95% of the data
What does a Skewed data curve look like?
-skewed to the left (negative Z-score) or to the right (positive Z-score)
-Tail off to either the right or low end of the measurement
What is the Interquartile range?
-The first quartile cuts off the lowest 25% of the data
-The third quartile cuts off the highest 25% of the data
-IQR = 25th to 75th percentile
-Midspread is the Middle 50
Example of IQR
so when given an IQR of 65-95 we know that most of the data is between 65-90 (50%) and 25% is in the lower and 25% in the higher quartile
Explain Box Plots
-The Bottom of the box is the 25th percentile
-The top of the Box is the 75th percentile
-Black bar in the middle is the median
-The whiskers on the bottom are the 10th percentile and the whiskers on the top are the 90th percentile
-The dots represent values outside of the 90th percentile
If the mean of LDL values is 100 and the SD is +/- 40, what would be the shape of the data curve?
-skewed with values at the higher end
-it can not be Gaussian shaped because with 3 SD (-120) we can’t go below 0
If the mean of the exam score is 85 and the SD is +/- 15, what would be the shape of the data curve?
-possibly negatively skewed
-we can’t go over 100 with (2 SD or 3SD)
SD formula
How are the number of patients and outliers related to the SD?
-Patients: Inversely -> the more patients the smaller the SD
-Outliers: proportional -> the greater the distance of a data point from the mean -> the greater the SD
What is the crude mortality rate?
measures the share among the entire population that have died from the disease
-CALCULATE: the number of deaths DIVIDED by the total population
Why can the crude mortality rate be misinterpreted?
-it can make a disease looke more harmless, because it takes the whole population into account, regardless if some were not even exposed to the disease
What is the Case Fatility rate CFR?
-the ones who died from the disease among all who were diagnosed with the disease OVER a period of time
-The measure of disease severity
-# of deaths in a period of time DIVIDED by the # of individuals diagnosed with the disease in that time X 100 (for percentage)
Why is the CFR the measure of severity?
Bc only if we look at the people who actually have the disease, we can tell how deadly the disease is
-> Exclude all those who don’t have the disease
How might the CFR be misinterpreted?
-it is not the same as the risk of death for an infected person
-it is the ratio between the #of deaths from the disease and the #of confirmed cases (not total cases)
-it is less accurate than the IFR because it doesn’t take patients into account who were not diagnosed but still have the disease
What is the Infection Fatality Rate IFR?
of deaths from a disease / #of ALL cases (not confirmed cases)
-the IFR tells if someone is infected with the disease, how likely is it to die from it
Why is the IFR more accurate than the CFR?
Because the IFR takes all cases into account, whereas the CFR only refers to the #of confirmed cases (diagnosed)
What is the Incidence?
-Occurrence of new cases of disease or injury in a population over time
-Incidence = New cases / population * Timeframe
How can the incidence be specified?
-in person-years
f.e. 795.000 new cases in the US (324 million)
795.000/324 million = 0.25 -> meaning for every person in the US, there will be 0.25 new cases per year -> or 2.5 new strokes in 1000 people per year
Why is the period of time and the number of people combined -> person-years?
Because some people may not be followed within the same period of time
-> So the people that have been followed are multiplied by the period of time they have been followed
-it normalizes the data and can be combined into one
f.e. 10 people w/ stroke - 6 months = 20 people (person-year)
What is the Prevalence?
How many in the population have the disease in a period of time -> in percentage
What is Sensitivity?
The probability of getting a positive test result if the patient has the suspected disease
True Positive
What is Specificity?
The probability of getting a negative test result if the patient does NOT have the disease
True Negative
If the disease in a patient is Absent and the probability of getting a positive test result is 3%, what does that say about a diagnostic test?
-Probability of getting a False positive test is low (3%)
-The Specifity is high at 97% (disease absent and getting a negative test result)
What is the strategy to prevent Interventions after False positive test results?
-Combination of test approaches
-Start with a test with a reasonably high Sensitivity to detect anyone who potentially has the disease (tested positive but could be false positive)
-For those who tested positive with a low Sensitivity test -> test again with a test with high Specificity and high Sensitivity for clarification