MODULE 1 - DESCRIPTIVE STATISTICS Flashcards

Question

Considerations for data visualization

Answer 1

1. Cardinality 2. depends on the kind of data being presented, and the information to be conveyed.

Answer 2

1. is the number of unique elements in a dataset. 2. scatter graphs, line charts, and histograms, work very well for high-cardinality data

Answer 3

are a good choice for low-cardinality data, and for showing the relative frequency in which unrelated categories occur.

Answer 4

can be used to identify trends.

Answer 5

is a good choice for displaying frequency or counts in low-cardinality data.

Answer 6

is a common computer application for organizing data like text or numbers, for using formulas to calculate a mathematical quantity using existing data as inputs, and for creating charts to visualize data.

Answer 7

1. A spreadsheet consists of cells organized into columns and rows. The column headings are letters and the row headings are numbers, but headings are not counted as cells. 2. A user can enter data, like words or numbers, into each cell. The spreadsheet is a convenient way to create a table of data.

Answer 8

is a predefined formula that supports common tasks such as computing the average, minimum, or maximum of a group of cells.

Answer 9

defines how the function is used, and specifies the function's name and accepted arguments

Answer 10

1. are surrounded by parentheses and specify the data that the function operates on. 2. Arguments may be numbers, cells, a range of cells, or a combination thereof. 3. The [ ] arguments are optional.

Answer 11

= is followed by the function's name and then arguments separated by commas.

Answer 12

1. defines a reference to a group of cells. 2. Ex: =SUM(A1:A4, B10) calculates the sum of cells A1, A2, A3, A4, and B10.

Answer 13

confidence intervals, and hypothesis testing

Answer 14

specify the range within which a parameter falls with a given probability

Answer 15

allows differences between population parameters to be compared.

Answer 16

Are conducted to allow statisticians to make generalizations about a population.

Answer 17

is any collection of objects, people, or things about which statistical inference are made

Answer 18

is a numerical characteristic of a population, such as mean, median, or standard deviation.

Answer 19

is an individual in the population on which a measurement can be taken.

Answer 20

is the subset of the population from which a sample is drawn.

Answer 21

is composed of the sampling units that provide data to be collected.

Answer 22

is a numerical characteristic of a sample, rather than the population.

Answer 23

is a difference between the parameter predicted from a survey from the true value of the parameter in the population.

Answer 24

selection bias and response bias.

Answer 25

exists when the sampling units selected from a population are not representative of the entire population, and are instead biased toward certain subsets of the population.

Answer 26

1. Undercoverage bias 2. Nonresponse bias 3. Voluntary response bias 4. Response bias

Answer 27

occurs when certain members of a population are inadequately represented in a sample.

Answer 28

occurs when a sample is biased toward members of a population that participate in a survey.

Answer 29

occurs when a sample is biased toward members that self-select for participation in a survey.

Answer 30

can result if the responses of survey participants are affected by how a question is asked or the behaviors or attitudes of the participant.

Answer 31

1. Acquiescence bias 2. extreme responding 3. social desirability bias

Answer 32

occurs when respondents tend to agree with a statement in a survey.

Answer 33

occurs when respondents tend to select the most extreme options available.

Answer 34

1. occurs when respondents tend to answer questions in a way that is socially accepted by others. 2. In other words, a social desirability bias exists when respondents over-report "good" behaviors or under-report "bad" behaviors.

Answer 35

Different sampling methods can help mitigate certain types of statistical bias.

Answer 36

1. simple random sampling 2. systematic sampling 3. stratified 4. cluster 5. convenience

Answer 37

1. a sample is constructed by random selection from the population. 2. Mathematically, simple random sampling is a sampling method in which all possible samples consisting of units selected from a population of units are equally likely.

Answer 38

every Kth unit from a population of units is selected to be in a sample.

Answer 39

the population is first divided into groups, or strata, depending on some characteristic. Next, samples within each stratum are randomly selected in a proportional manner.

Answer 40

1. the population is first divided into groups, or clusters, depending on some characteristic. 2. Next, the sample is constructed by randomly selecting one or more clusters.

Answer 41

units are drawn from a subset of the population that is readily available.

Answer 42

a data value that is either much greater than or much less than the rest of the data and not representative of the rest of the data being considered

Answer 43

1. is a measure of how far apart values in a dataset are to each other 2. a larger spread means that the values are more scattered. 3. A lower spread means that the values are more clustered together.

Answer 44

using dot plots, box plots, and histograms

Answer 45

the interquartile range, variance, and standard deviation.

Answer 46

variance and standard deviation

Answer 47

is the average of the square difference from the mean

Answer 48

is the square root of the variance

Answer 49

the dataset contains the whole population or a subset of the population.

Answer 50

The typical difference between a data value and the mean

Answer 51

The spread between the maximum and minimum data values

Answer 52

For symmetric data, standard deviation is usually the better measure of spread. For data that is skewed, interquartile range is usually the better measure of spread.

Answer 53

is the largest value in the dataset

Answer 54

is the smallest value in the dataset.

Answer 55

is the difference between the maximum and minimum of the dataset.

Answer 56

is the data value such that percent of the data falls at or below that value.

Answer 57

1. is the 25th percentile. One-quarter of the data fall at or below . 2. The first quartile is the median of the lower half of the data.

Answer 58

1. is the 75th percentile. 2. Three-quarters of the data fall at or below . 3. The third quartile is the median of the upper half of the data.

Answer 59

Because half of the data fall at or below the median, the median is also the 50th percentile of a dataset.

Answer 60

five-number summary.

Answer 61

1. is a data visualization that uses a box and several lines to depict the distribution of data in a dataset. 2. A box spans 50% the middle of the data, with Q1 as the lower boundary of the box and Q3 as the upper boundary of the box. 3. The median is shown as a line inside the box. Two lines, known as whiskers, extend from the lower boundary of the box to the minimum and from the upper boundary of the box to the maximum. 4. The whiskers represent the lower and upper 25% of the data.

Answer 62

1. is the difference between the mean and the median 2. A positive skew means that the distribution is skewed to the right, while a negative skew means that the distribution is skewed to the left.

Answer 63

1. One way to detect outliers using a box plot is to determine how far each data element is from either Q1 or Q3. 2. A data value greater than Q3 + 1.5(IQR) or less than Q1 - 1.4(IQR) is considered an outlier. 3. Often, an outlier is not included in either whisker and is instead represented in the plot as a marker such as an open circle or a triangle.

Answer 64

1. is the difference between Q3 and Q1 (Q3 - Q1), or the length of the box in a box plot.

Answer 65

1. is a table that displays how often an outcome occurs for a sample 2. To construct a frequency distribution, the data set is divided into mutually exclusive classeS

Answer 66

is either a value of a categorical variable or an interval of a continuous variable.

Answer 67

is the number of events or values that fall under each class.

Answer 68

depicts data values by splitting a continuous variable into a number of class intervals, each known as a bin.

Answer 69

1. estimate the probability density function of the continuous variable on the X-axis. 2. In short, the goal is to fit a smooth curve over the most rectangles, while minimizing the white space under the curve.

Answer 70

occurs when one mode exists in the histogram.

Answer 71

Contains two prevalent modes

Answer 72

Contains multiple prevalent modes

Answer 73

Contains a mode on the right with a tail of low-frequency bins on the left

Answer 74

Contains a mode on the left with a tail of low-frequency bins on the right

Answer 75

1. depicts data trends by using straight lines to connect successive data points in a scatter plot. 2. The straight lines show the general direction that data changes over time. 3. Because trends involve time, line charts commonly use a time metric for the horizontal axis.

Answer 76

1. quickly convey whether values are increasing, decreasing, or remaining constant between data points. 2. Steeper lines indicate more rapid increases or decreases, while flatter lines indicate little change between data points

Answer 77

is a straight line that depicts the general direction data changes from the first to last data point, often added to summarize the entire chart

Answer 78

1. nominal categorical data. 2. Lines suggest some relation from one item to the next, but nominal variables have no ordering so can have no such relation.

Answer 79

1. depicts data values for a categorical variable, using rectangular bars having lengths proportional to category values. 2. The chart is drawn using two axes: a category axis that displays the category names and a value axis that displays the counts

Answer 80

shows each category's portion of the total data, typically as a percentage.

MODULE 1 - DESCRIPTIVE STATISTICS Flashcards

(106 cards)