Module 1: Introduction to Data Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Concept

A

Answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A frequency table exhibits how…

A

frequencies are distributed over various categories (known as a frequency distribution)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Associated variables

A

When two variables show some connection/relationship with one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Blocking (experimental design)

A

Grouping the sample based on variables which may effect the outcome and then randomizing within groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Categorical variable

A

The individual entries are categories, the possible values are called “levels”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cluster sample

A

Break the population into groups and then sample a fixed number of those groups and include all observations from each group; helpful when there’s a lot of variability between cases within a cluster but the clusters themselves don’t differ much from one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Confounding variable

A

A variable that is correlated with both the explanatory and the response variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Continuous variable

A

A numerical variable that has no limitation (e.g. infinite decimal points for precision); e.x. height, weight (think how much)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Controlling (experimental design)

A

Mitigate the differences between groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Convenience sample bias

A

When individuals who are more accessible are more likely to be included in the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cumulative frequency

A

The total of a frequency and all frequencies below it in a frequency distribution; the running total of frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cumulative relative frequency

A

Cumulative frequency for that category/Sum of all frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data

A

Information we gather with experiments and with surveys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Description

A

Summarizing the data that are obtained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Descriptive statistics

A

Refers to methods for summarizing the data; describes the sample only (graphs, numerical summaries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Design

A

Planning how to obtain data to answer the questions of interest (experimental design, sample size, power, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Discrete variable

A

A numerical variable that only takes number values in jumps (e.g. whole numbers); e.x. the number that appears when throwing a die (think how many)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Experiment

A

Used to investigate the possible causal connection between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explanatory variable

A

The variable (first) that causually affects the other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Frequency

A

The number of elements that belong in a certain category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Graphical methods

A

Histogram, boxplot, bar graph, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Graphs (categorical)

A

Bar chart, pie chart; focuses on frequencies or relative frequencies of the levels of the variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Graphs (numerical/scale)

A

Dot chart (discrete variable), stem-and-leaf plot, histogram, boxplot, scatterplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Histogram

A

A bar chart that gives the frequencies or relative frequencies of occurrances of a scale variable in certain intervals; the heights of the bars in the histogram are called the distribution of the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Characteristics of a distribution: left-skewed

A

Negatively skewed; the values to the left of the center fall further away from the center than those to the right of the center; the mean is less than the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Characteristics of a distribution: Right-skewed

A

Positively skewed; the values to the right of the center fall further away from the center than those to the left of the center; the mean is greater than the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Characteristics of a distribution: symmetric

A

Left and right sides of the graph are roughtly mirror images of eachother; the center is the mean and the mean ~ the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How to describe graphical data

A

Center, variation, distribution, outliers, time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Independent variables

A

When two variables are not associated/there is no evident relationship between the two

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Inference

A

Making decisions and predictions based on the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Inferential statistics

A

Are used when data are available only for a sample but we want to make a decision or prediction about the entire population (confidence intervals, signficiance tests)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Intensity map (heat map)

A

Colors are used to show higher and lower values of a variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Multi-stage sample

A

Clustering, but sample within each cluster rather than the entire cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Negatively associated

A

Downward trend between the two poles of the variables

35
Q

Nominal variable

A

A categorical variable where the levels have no heirarchy; e.x. eye color, type of car

36
Q

Non-response bias

A

When a sample’s recruitment’s nonresponse rate is high, so it’s unclear if those selected really represent the sample

37
Q

Numerical summaries, location (descriptive statistics)

A

Mean, median, quantile/percentile, quartile, mode

38
Q

Numerical summaries, spread (descriptive statistics)

A

Standard deviation, sample variance, range, interquartile range, coefficient of variance

39
Q

Numerical variable

A

Can take a wide range of number values, and it is sensible to add/subtract/take averages

40
Q

Observational data

A

No treatment has been explicity applied/witheld in regards to the data collected

41
Q

Observational study

A

When data is collected in a way that does not interfere with how the data arise; can provide evidence of a naturally occuring association but alone cannot show a causal connection

42
Q

Ordinal variable

A

A categorical variable where the levels have a natural ordering; e.x. level of education

43
Q

Population

A

Is the total set of subjects in which we are interested

44
Q

Positively associated

A

Upward trend between the two poles of the variables

45
Q

Probability

A

Is the basic tool for evaluating chances and is alsothe key to how well inferential statistics work

46
Q

Qualitative data in a one way table can include

A

Absolute frequency, relative requency, cumulative frequency, cumulative relative frequency

47
Q

Qualitative data in a two way table can

A

Indicate the relationship between two variables

48
Q

Random sample reduces…

A

The change of introducing biases

49
Q

Randomization (experimental design)

A

Accounts for variables that can’t be controlled

50
Q

Randomized experiment

A

When individuals are randomly assigned to a group in an experiment

51
Q

Relative frequency

A

Frequency for that category/sum of all frequencies

52
Q

Replication (experimental design)

A

Can be accomplished via a significantly large sample, or duplicating a study

53
Q

Response variable

A

The second variable that changes based on the explanatory variable

54
Q

Sample

A

The subset of the population for whom we have or plan to have data

55
Q

Sampling methods are based in the notion of…

A

Implied randomness, and tend to be a good reflection of population when each subject in the population has the same chance of being included in that sample.

56
Q

Scatterplot

A

Represents the bivartiate relationship between two variables (usually continuous variables) by plotting a data point for each observation in the data set; useful fo visualizing the relationship

57
Q

Simple random sampling

A

Each case in a population has an equal chance of being included in the final sample; knowing a case is included does not provide useful info about what other cases are included (raffle-style)

58
Q

Stratified sampling

A

Population is divided into strata (similar cases grouped together, like by age), then a second sampling is employed w/in each stratum (useful when cases in stratum are similar in respect to studied outcome)

59
Q

Subjects

A

The entities that we measure in a study

60
Q

Tabular methods

A

Table summary with frequency and or precent frequency

61
Q

Types of descriptive statistics

A

Numerical methods, tabular methods, graphical methods

62
Q

Characteristic of data: center

A

A representative or average value that indicates where the middle of the data set is located

63
Q

Characteristic of data: variation

A

A measure of the amount that the data values vary among themselves

64
Q

Characteristics of data: distribution

A

The nature or shape of the distribution of the data

65
Q

Characteristics of the data: outliers

A

Sample values that lie very far away from the vast majority of the other sample values

66
Q

Characteristics of data: time

A

Changing characteristics of the data over time (is there a trend?)

67
Q

Shape of a distribution: Modality

A

How many prominent peaks are apparent within the distribution

68
Q

Shape of a distribution: unimodal

A

A single prominent peak in the distribution

69
Q

Shape of a distribution: bimodal

A

Two prominent peaks in the distribution

70
Q

Shape of a distribution: multimodal

A

Several prominent peaks in the distribution

71
Q

Shape of a distribution: uniform

A

No prominent peaks, mostly smooth

72
Q

Mean (measure of center)

A

A measure of center; the sample mean is denoted as an x with a bar across the top, and the population mean is denoted as the greek letter mu (the little u with a tail)

73
Q

Sample mean (x with bar over it)

A

A sample statistic that serves as a point estimate of the population mean

74
Q

Variance (measures of variability)

A

The average squared deviation from the mean; we used the squared deviation to get rid of negatives so that observations equally distant from the mean are weighted equally, and to weigh larger deviation more heavily

75
Q

Standard deviation (measures of variability)

A

The square root of the variance, and has the same units as the data

76
Q

Median (measures of center)

A

The value that splits the data in half when ordered in ascending order; if there are an even number observations then the median is the average of the two values in the middle; also called the 50th percentile

77
Q

IQR (measures of variability)

A

The middle 50% of the data included between the first quartile (25th percent) and the third quartile (75th percent); IQR = Q3 - Q1

78
Q

Box plot

A

The box represents the middle 50% of the data, the line dissecting the box is the median, the upper and lower whiskers is the full range of the data and any dots are suspected outliers

79
Q

Box plot: Whiskers

A

Max upper whisker reach = Q3 + 1.5 x IQR; max lower whisker reach = Q1 - 1.5 x IQR

80
Q

Box plot: Outliers

A

Defined as an observation beyond the max reach of the whiskers, helpful for identifying extreme skew in the distribution, indentifying data collection/entry errors, provides insight into interesting features of data

81
Q

Robust statistics

A

Median and IRQ are more robust to skewness and outliers

82
Q

For skewed distributions, use…

A

Median (center) and IQR (spread)

83
Q

For symmetric distributions, use…

A

Mean (center) and standard deviation (spread)

84
Q

Log transformation

A

Useful when data is extremely skewed as it can make outliers less prominent, but the results of the analysis might be difficult to interpret because the log of a measured variable is usually meaningless