PU520: Principles of Epidemiology Unit 6 Epidemiology and Data Presentation Flashcards

1
Q

What does the term population refer to?

A

Refers to a collection of people who share common observable characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are other ways populations can be demarcated? Just some examples.

A
  • All of the inhabitants of a country (e.g., China)
  • All of the people who live in a city (e.g., New York)
  • All students currently enrolled in a particular university
  • All of the people diagnosed with a disease such as type 2 diabetes or lung cancer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a variable for describing a characteristic of a population and is defined as a measurable attribute of a population?

A

Parameter

An example of a parameter is the average age of the population, designated by the symbol μ.

Returning to the average age of a population (μ), the sample estimate of μ is denoted by X (the sample mean). Inferential statistics use sample-based data to make conclusions about the population from which a sample has been selected; this process is known as estimation. Thus, X can be used as an estimate for μ, the population mean (a parameter).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of statistical inference?

A

To characterize a population by using information from samples.

Thus samples must be representative of their parent population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does representativeness mean in regards to characteristics of a sample of a population?

A

Representative-ness means that the characteristics of the sample correspond to the characteristics of the population from which the sample was chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a subgroup that has been selected, by using one of several methods, from the population (universe)?

A

Sample.

In the terminology of sampling, the universe describes the total set of elements from which a sample is selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are numbers that describe a sample?

A

Statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

List of Important Terms (Review)

A

N/A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are two rationales for using samples to represent a population?

A

Improved parameter estimates and cost savings

Examples include reviewing income tax returns, verifying signatures on ballot initiatives, quality of manufacturing goods, and enumerating the U.S. population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two ways in which sampling occurs?

A

Random sampling and nonrandom sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the different ways of random sampling? (2)

A

Simple random sampling and stratified random sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the different ways of nonrandom sampling? (3)

A

Convenience sampling, systematic sampling, cluster sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does it mean when nonrandom sampling is prone to sampling bias?

A

Sampling bias means that the individuals who have been selected are not representative of the population to which the epidemiologist would like to generalize the results of the research.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are surveys on the internet and media-based polling examples of in sampling?

A

nonrandom sampling

These two methods are likely to produce nonrepresentative samples. Increasingly, the Internet has been used for conducting surveys; the resulting sample of respondents is likely to be a biased sample because of self-selection—only people who are interested in the survey topic respond to the survey. We do not know about the nonrespondents and consequently have very little information about the target population (the population denominator, as it is called in epidemiology)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What refers to the use of a random process to select a sample?

A

Simple random sampling (SRS)

A simplistic example of SRS is drawing names from a hat. Random digit dialed (RDD) telephone surveys are a more elaborate method for selecting random samples. At one time, RDD surveys obtained high response rates from the large proportion of the U.S. homes with telephones. However, as more people transition from land lines to cellular phones, RDD surveys of land-based telephones have had declining population coverage and reduced response rates.

Another method of SRS is to draw respondents randomly from lists that contain large and diverse populations (e.g., licensed drivers). In simple random sampling, one chooses a sample of size
n from a population of size N. Each member of a population has an equal chance of being chosen for the sample. In addition, all samples of size n out of a population of size N are equally possible. Considerable effort surrounds the determination of the size of n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

According to statistical theory, what produces unbiased estimates of parameters?

A

Random sampling.

In addition, random sampling permits the use of statistical methods to make inferences about population characteristics. In the context of sampling theory, the term unbiased means that the average of the sample estimates over all possible samples of a fixed size is equal to the population parameter.

For example, if we select all possible samples of size n from N and compute X
for each sample, the mean of all of the X -
s (symbol, μx–) will be equal to μ (μx–(X -
= μ). However, any individual sample mean is likely to be slightly different from μ. This difference is
from random error, which is defined as error due to chance.

Beware, therefore, that the unbiasedness property of random samples does not guarantee that any particular sample estimate will be close to the parameter value; also, a sample is not guaranteed to be representative of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When wanting to conduct random sampling of a subgroup of a population, what type of technique is used so the data represents the subgroup of interest versus the greater population in which it belongs?

A

Stratified random sampling

Returning to statistical terminology, we will designate N
as the number in the population and n as the number in the sample. Suppose an epidemiologist wants to study the health characteristics of racial or ethnic subgroups that are uncom-mon in the general population. The size of n is limited by our available budget. If n is small (which is often the case) in comparison to N, then only a few individuals from the minority group will enter the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What word do we use to define a subgroup of a greater population?

A

Stratum

For example, a population can be stratified by racial or ethnic group, age category, or socioeconomic status. Stratified random sampling uses oversampling of strata in order to ensure that a sufficient number of individuals from a
particular stratum are included in the final sample.

Statisticians have demonstrated that stratified random sampling can improve parameter estimates for large, complex populations, especially when there is substantial variability among subgroups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is nonrandom sampling that uses available groups selected by an arbitrary and easily performed method?

Samples generated by this sampling method are sometimes called “grab bag” samples.

A

Convenience sampling.

An example of a convenience sample is a group of patients who receive medical service from a physician who is treating them for a chronic disease. Convenience samples are highly likely to be biased and are not appropriate for application of inferential statistics. However they can be helpful in descriptive studies and for suggesting additional research.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What nonrandom sampling uses a systematic procedure to select a sample of a fixed size from a sampling frame (a complete list of people who constitute the population)?

A

Systematic sampling

Systematic sampling is feasible when a sampling frame such as a list of names is available.

As a hypothetical example of systematic sampling, an epidemiologist wants to select a sample of 100 individuals from an alphabetical
list that contains
2,000 names.

A way to determine the sample size is to select a desired percentage of cases (e.g., 5%). After specifying a sample size, a sampling interval must be created, say, every tenth name.

An arbitrary starting point on the list is identified (e.g., the top of the list or a randomly selected name in the list); then from that point every tenth name is chosen until the quota of 100 is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why might nonrandom systematic sampling like choosing 5% of names (of a list of 2000) and then starting at the top and choosing every 10th name not be representative of the population?

A

The 5% may be filled before reaching the end of the first third of the sample, which may exclude minorities with certain names at the bottom of the list. This sample would be biased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the sampling technique, cluster sampling, mean?

A

Cluster sampling refers to a method of sampling in which the element selected is a group (as distinguished from an individual) called a cluster.

An example of a selected element is a city block (block cluster). The U.S. Census Bureau employs cluster sampling procedures to conduct surveys in the decennial census. Because it is a more parsimonious design than random sampling, cluster sampling can produce cost savings; also, statistical theory demonstrates that cluster sampling is able to create unbiased estimates of parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are four terms used to classify data in epidemiology?

A

Qualitative, quantitative, discrete, and continuous

24
Q

What data term employ categories that do not have
numerical values or rankings?

A

Qualitative data are measured on a categorical
scale. Occupation, marital status, and
sex are examples of qualitative data that have no natural ordering.

25
Q

What data term employ categories with numerical quantities?

A

“Quantitative data [are] data expressing a certain quantity, amount or range.”3 Such data are obtained by
counting or taking measurements, for example, measuring a patient’s height

26
Q

What is data that has a finite or countable
number of values?

Examples include number of children in a family (there cannot be fractional numbers of children such as half a child); a patient’s number of missing teeth; and the number of spots on a die (one to six spots).

A

Discrete data

27
Q

What do you call discrete data when it only has two values?

Examples of this are alive or dead, present or absent, male or female.

A

Dichotomous data (Binary data)

28
Q

What do you call data that has an infinite amount of values on a continuum?

A

Continuous data.

Weight, for example, is measured on a continuous scale. A scientific weight scale in a school chemistry lab might report the weight of a substance to the nearest 100th of a gram. A research laboratory might have a scale that can report the weight of the same material to the nearest 1,000th of a gram or even more precisely.

29
Q

What is a variable, a discrete variable, and a continuous variable?

A

A variable used to describe a quantity that can vary (that is, take on different values), such as age, height, weight, or sex. In epidemiology, it is common practice to refer to exposure variables (for example, contact with a microbe or toxic chemical) and outcome variables (for example, a health outcome such as a disease).

A discrete variable is made up of discrete data and examples of this is like household size, or number of doctor visits.

A continuous variable is one comprised of continuous data such as age, height, weight, heart rate, blood cholesterol, and blood sugar levels. However, once measurement is made, the data becomes a discrete value.

30
Q

Stevens’ measurement scales encompasses four categories: nominal, ordinal, interval, and ratio. Define these terms briefly.

A

Nominal scales - These are a type of qualitative scale that consists of categories that are not ordered (ordered data have categories from worst to best).

Examples of nominal scales: race, religions, etc. This includes dichotomous scales as well.

Ordinal scales - Comprise categorical data that be ordered (ranked data) but are still considered qualitative data. The intervals between each point on scale are not equal intervals. Bar graphs can be used to present this data.

Examples of ordinal scales of qualitative data: a scale that measures self-perception of health (e.g., strongly agree, agree, disagree, and strongly disagree). Other ordinal scales measured in gradations from low to high – levels of educational attainment, socioeconomic status, occupational prestige, etc.

Interval scales - This consists of continuous data with equal intervals between points on the measurement scale and without a true zero point. These scales do not permit the calculation of ratios.

Examples of interval scales: Fahrenheit temperature scale which does not have a true zero scale. The IQ test is also measured on an interval scale. We cannot state that someone with an IQ of 120 is twice as smart as a person with an IQ of 60 nor can be say that 100 degrees Fahrenheit is twice as hot as 50 degrees Fahrenheit.

Ratio scales - This retains the properties of an interval level scale and has a true zero point. Since ratio scales do have a true zero point, allows one to form ratios with the data.

Examples of ratio scales: The Kelvin temperature scale is a ratio scale because it has a meaningful zero point, which permits the calculation of ratios. A temp of 0 degrees Kelvin signifies the absence of all heat; also, one can say that 200 degrees Kelvin is twice as hot as 100 degrees Kelvin.

31
Q

What provides one of the most convenient ways to summarize or display data in a grouped format?

A prior step to this process is counting and tabulating cases; this process must take place after the data have been reviewed for accuracy and completeness (a process called data cleaning). A clean data set contains a group of related data that are ready for coding and data analysis.

A

A frequency table.

This helps identify outliers, extreme values that differ greatly from other values in the data set. These cases may be actual extreme cases or originate from data entry errors. For example, in a frequency table of ages, an age of 149 years would be an outlier.

An example of a with nominal, discrete data, that has been formatted as a line listing of data. See the next slide for the tabulation of the data from the line listing as a frequency table.

32
Q

Frequency Table Continuation (See attached photo)

A

N/A

33
Q

What are the four ways, other than tabulating the data like using a frequency table, for data presentation?

This happens when the epidemiologist is ready to plot the data graphically.

A

Bar charts - is a type of graph that shows the frequency of cases for categories of a discrete variable. An example is a qualitative, discrete variable such as the Yes/No variable described in the foregoing example of data for hepatitis C patients. Along the base of the bar chart are categories of the variable; the height of the bars represents the frequency of cases for each category.

Histograms - charts that are used to display the frequency distributions for grouped categories of a continuous variable. (see attached photo)

Line graphs - enables the reader to detect trends, for example, time trends in the data. A line graph is a type of graph in which the points in the graph have been joined by a line. A single point rep-resents the frequency of cases for each category of a variable. When using more than one line, the epidemiologist is able to demonstrate comparisons among subgroups.

Pie charts - is a circle that shows the proportion of cases according to several categories. The size of each piece of “pie” is proportional to the frequency of cases. The pie chart demonstrates the relative importance of each subcategory.

34
Q

What is a number that signifies a typical value of a group of numbers or of a distribution of numbers?

A

Measure of central tendency (also called a measure of location)

The number gives the center of the distribution or can refer to certain numerical values in the distribution where the numbers tend to cluster. The measures of central tendency covered in this section are the mode, median, and arithmetic mean.

35
Q

What measure of central tendency is defined as the number occurring most frequently in a set or distribution of numbers?

A

Mode

36
Q

What is the middle point of a set of numbers?

A

Median.

If a group of numbers is ranked from the smallest value to the highest value, the median is the point that demarcates the lower and upper half of the numbers.

37
Q

What is the average of a set of numerical data values?

A

Mean, or arithmetic mean.

It is a common measure of central tendency with many uses in epidemiology. For example, the mean could be used to describe the average systolic blood pressure of patients enrolled in a primary care clinic.

38
Q

What are the different types of measures of variation? (4)

Synonyms for variation are dispersion and spread.

A

Range, midrange, mean deviation, and standard deviation.

39
Q

What is range in the measure of variation?

A

It is the difference between the highest (H) and lowest (L) value in a group of numbers.

Range = H-L; Range = 98 years - 67 years = 31 years (example)

40
Q

What is the arithmetic mean of the highest and lowest values?

A

Midrange.

midrange = H - L / 2 = 31 / 2 = 15.5 years (example)

41
Q

What refers to the degree of variability in a set of numbers?

A

Variance

The variance reflects how different the numbers are from one another. The variance of sample denoted by s2 indicates how variable the numbers in a sample are.

The standard deviation of a sample (s) is the square root of the variance. Refer to Formula 2 for the formulas for these terms. The formulas shown are for the deviation score method for the computations. The standard deviation can be used the quantify degree of spread of a group of numbers.

42
Q

More about standard deviation and learning about mean deviation calculations!

A

N/A

43
Q

What is a graph that is constructed from the frequencies of the values of a variable, for example, variable X?

A

Distribution curve

The values are a “… complete summary of the frequency of values of… a measurement…” for variable X collected on a group of people. Such curves can take various forms, including symmetric and nonsymmetric (skewed) shapes.

Distribution curves can be described in terms of central tendency and dispersion.

The mode of a distribution curve is the most frequently occurring value of the variable. Distributions can have one mode or more than one mode. Different distributions may exhibit different degrees of spread or dispersion, which is the tendency for observations to depart from central tendency. The standard deviation is a measure of the dispersion (spread) of a distribution curve, as are the range, percentile, and quartiles.

44
Q

Measures of Variability

Synonyms for measures of the variability of a distribution curve are dispersion and spread. Distribution curves can exhibit different degrees of spread or dispersion, which is the tendency for observations to depart from central tendency. An application of measures of variability is for comparison of distributions with respect to their dispersion. These measures include the range, percentiles, quartiles, mean deviation, and standard deviation.

Percentiles and Quartiles

Percentiles are created by dividing a distribution into 100 parts. The pth percentile is the number for which p% of the data have values equal to or smaller than that number. Thus, a value at the 80th percentile includes 80% of the values in the distribution. Quartiles subdivide a distribution into units of 25% of the distribution. For example:
* 1st quartile (Q1) = 25%
* 2nd quartile (Q2) = 50%
* 3rd quartile (Q3) = 75%

The interquartile range (IQR), which is a measure of the spread of a distribution,
is the portion of a distribution between the 1st quartile and 3rd quartile.

The formula is: IQR=Q3 - Q1

A

Normal Distribution

Many human characteristics, such as intelligence, follow a nor-mal pattern of distribution. A normal distribution (also called a Gaussian distribution) is a symmetrical distribution with several interesting properties that pertain to its central tendency and dispersion. Figure 6 shows a normal distribution. (See attached)

45
Q

What is a type of normal distribution with a mean of zero and a standard deviation of one unit?

A

Standard normal distribution

The standard normal distribution has interesting properties (e.g., areas between standard deviation units) that are used for statistical analyses.

Refer to Figure 8. The figure demonstrates the percentage of cases contained within ranges of standard deviation (SD) units. Note that the area between one standard deviation above and one standard deviation below the mean covers about 68% of the distribution.

46
Q

What is dispersion a measure of again?

A

Dispersion is a measure that shows the degree of spread of the distribution.

See attached of three distributions have the same mean and different dispersions.

47
Q

What is an asymmetric distribution; it has a concentration of values either the left or right side of the X-axis?

A

Skewed distribution.

Skewness is defined by the direction in which the tail of the distribution points. Figure 10 shows a symmetrical distribution (B) in comparison with a distribution that is skewed to the right (A; positively skewed; tail trails off to the right) and skewed to the left (C; negatively skewed; tail trails off to the left).

48
Q

What happens to the mean, median, and mode of a skewed distribution?

A

They are different from each other compared to a normal distribution where they are the same.

When a distribution is skewed, the median is a more appropriate measure of central tendency than the mean. This is because the median divides the distribution into halves. In comparison, the mean is a center of gravity (balancing point) of a distribution and does not indicate the central tendency of the skewed distribution.

The median is the 50% point of continuous distributions (distributions of continuous variables). You should bear in mind that the median is a better measure of central tendency when there are several extreme values in the data set. A note-worthy example is the use of median income instead of aver-age income to represent central tendency. The median income is preferable to the average income because the incomes of a few high earners can raise the average disproportionately, making it not reflective of the central tendency of the majority of incomes.

49
Q

More Information on Distributions

Symmetrical (Non-Skewed) Distributions When the distributions are symmetrical,
the mean and median are identical and can be used interchangeably. As a general rule, the arithmetic mean is generally preferred over the median as a measure of central tendency because it tends to be a more stable value; i.e., it varies less under sampling from one sample to the next.

Distributions with Multimodal Curves

As defined previously, the mode is the value in a frequency distribution that has the highest frequency of cases; there can be more than one mode in a frequency distribution. A multimodal curve is one that has several peaks in the frequency of a condition. Figure 12 demonstrates a hypothetical multimodal plot of age on the horizontal axis and frequency of the condition on the vertical axis. When plotted as a line graph, a multimodal curve takes the form shown in Figure 12, a multimodal distribution with three modes: A, B, and C.

A

Among the reasons for multimodal distributions are age-related changes in the immune status or lifestyle of the host (the person who develops a disease). Another explanation might be the occurrence of conditions such as chronic diseases that have long latency periods and appear later in life. (The term latency refers to the time period between initial exposure and a measurable response.) Referring back to Figure 12: As a purely hypothetical example, the increase at point A (for children) might be due to their relatively low immune status; the spike at point B (for young adults) might be due to the effect of behavioral changes that bring potential hosts into contact with other people, resulting in person-to-person spread of disease; and the increase at point C (for the oldest people) might reflect the operation of latency effects of exposures to carcinogens.

50
Q

What is a graphic plotting of the distribution of cases by time of onset?

A

Epidemic curve.

An epidemic curve is a type of unimodal (having one mode) curve that aids in identifying the cause of a disease outbreak. Let’s apply the concept of an epidemic curve to an outbreak of foodborne illness caused by Salmonella (associated illness: salmonellosis). An outbreak of Salmonella Heidelberg erupted in the United States from about mid-2012 to mid-2013.

The Pacific Northwest outbreak of 134 cases was linked with Foster Farms chickens. How did the epidemic curve support the investigation of the outbreak?

Salmonellosis is one of the leading forms of bacterially associated foodborne illnesses. Microbiologists classify the bacterium according to serotypes, which are subgroups of Salmonella. Heidelberg is a serotype of Salmonella.

The attached figure provides the epidemic curve for the outbreak. The solid line shows baseline cases of Salmonella Heidelberg. These are sporadic cases (four to eight per month) that typically occur. During the outbreak, the number of cases spiked and exceeded the 5-year baseline mean. All of the cases in the outbreak matched on the same serotype of Salmonella (Salmonella Heidelberg). A large percentage of the people who were sickened revealed that they had purchased Foster Farms chickens. The figure indicates that the outbreak peaked during September 2012. The epidemic curve aided in verifying the waxing and waning of the outbreak.

51
Q

What is the bivariate association?

A

Examine relationships between two variables.

Some of the types of bivariate analyses described in this section involve the use of scatter plots, correlation coefficients, and contingency tables. One should remember that an association between two variables signifies only that they are related and not that the association is causal. The matter of a causal association is complex and relies on a body of additional information beyond the observation of a relationship between two variables.

52
Q

What is the Pearson correlation coefficient (r)?

What is Pearson’s r also called?

A

This helps measure the strength of association between two CONTINUOUS variables.

The r is also called the Pearson product-moment correlation.

Pearson correlation coefficients (r) range from -1 to +1. When r is negative, the relationship between two variables is said to be inverse, meaning that as the value of variable increases, the value of the other variable decreases.

A positive r denotes a positive association: when one variable increases, so does the other variable.

The closer r is to either +1 or -1, the stronger the association is between two variables. As r approaches 0, the association becomes weaker; the value 0 means that is no association.

53
Q

What is a method for graphically displaying relationships between variables by plotting two variables using an XY axis?

A

Scatter plot or scatter diagram.

The examples will indicate a perfect direct linear relationship (r = +1.0) and a perfect inverse linear relationship (r = –1.0); later we will examine other types of relationships.

54
Q

What is a dose-response curve?

A

It is a plot of a dose-response relationship, which is a type of correlative association between exposure (e.g., dose of a toxic chemical) and effect (e.g., a biological outcome).

The dose is indicated along the X-axis, with the response shown along the Y-axis. At the beginning of the curve, the flat portion suggests that at low levels of the dose, no or a minimal effect occurs. This is also known as the subthreshold phase. After the threshold is reached, the curve rises steeply and then progresses to a linear state in which an increase in response is proportional to an increase in dose. The threshold refers to the lowest dose at which a particular response occurs. When the maximal response is reached, the curve flattens out

A dose-response relationship is one of the indicators used to assess a causal effect of a suspected exposure associated with an adverse health outcome. For example, there is a dose-response relationship between the number of cigarettes smoked daily and mortality from lung cancer.

As the number of cigarettes smoked per day increases, so do the rates of lung cancer mortality. This dose-response relationship was one of the considerations that led to the conclusion that smoking is a cause of lung cancer mortality.

55
Q

What is another method for demonstrating associations which is a type of table that tabulates data according to two dimensions?

A

Contingency table, or Epidemiologist 2x2.

The type of contingency table is also called a 2 by 2 table or a fourfold table because it contains four cells, labeled A through D. The column and row totals are known as marginal totals. As noted previously, analytic epidemiology is concerned with the associations between exposures and health outcomes (disease status). Two study designs employ variations of a contingency table to present the results. One of these designs is a case-control study and the other is a cohort study.

56
Q

What does A, B, C, and D represent in the epidemiologist 2x2?

A

A = Exposure is present and disease is present
B = Exposure is present and disease is absent
C = Exposure is absent and disease is present
D = Exposure is absent and disease is absent

See attached photo for example of alcohol commercials and the association with binge drinking using an epidemiologist 2x2.

57
Q

What is a single value used to estimate a parameter?

A

Point estimate.

An example of this would be the use of X (with a line over it), the sample mean, to estimate u, which is the corresponding population mean.

An alternative to a point estimate is an interval estimate, defined as a range of values that with a certain level of confidence contains the parameter. One of the common levels of confidence is the 95% confidence level, although others are possible. This level of confidence means that one is 95% certain the confidence interval contains the parameter.

See attached photo for the 95% confidence interval (CI)