MODULE 1 - DESCRIPTIVE STATISTICS Flashcards

1
Q

The study of statistics is often broken into what two main categories?

A
  1. descriptive statistics
  2. inferential statistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

inferential statistics (3)

A
  1. Frequently, it is impossible to contact every person in large populations, so a smaller group is used, called a sample.
  2. A researcher can draw conclusions about the larger population using the sample data.
  3. Focuses on using information from the sample to make conclusions about the population from which the sample was drawn.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

descriptive statistics (4)

A
  1. focuses on summarizing survey data about a sample drawn from a population.
  2. Summary statistics include measures of central tendency such as mean, median, and mode; and dispersion such as range and standard deviation.
  3. Descriptive statistics cannot make conclusions based on the data. 4. Rather, descriptive statistics is a way to present data in a meaningful way.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is data?

A

is information, especially facts or numbers, usually collected or computed for purposes of analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Common sources of data (3)

A
  1. Social networks
  2. Traditional Business Systems
  3. Internet of Things
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data analytics

A

is the field of analyzing data to gain insight, draw conclusions, or make decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Big data

A

refers to very large data sets that cannot be processed by traditional methods, and is characterized by high volume, rapid velocity of collection, and variety in type and quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

3 Types of data analytics

A
  1. Descriptive
  2. Predictive
  3. Prescriptive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Descriptive data analytics

A

analytics seeks to describe data, providing insight and knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Predictive data analytics

A

seeks to make predictions from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prescriptive data analytics

A

seeks to make decisions (prescriptions) based on data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data is typically represented using what?

A

variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

variable

A

is an item that can have different (“varying”) values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Variables are often considered as being of two possible types:

A
  1. quantitative variable
  2. categorical variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

quantitative variable

A

can take on a numeric value (quantitative data) that can be measured and ordered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

categorical variable (qualitative variable)

A

can take on the value (usually a label) of one of several categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

reason for distinguishing variable types (3)

A
  1. Each type is handled differently in data analytics
  2. A categorical variable typically involves counting the instances of each category, often then depicted with a bar chart or pie chart.
  3. But a quantitative variable is commonly plotted versus another quantitative variable, often depicted with a scatter plot or line chart
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Two types of categorical variables are often distinguished

A
  1. Nominal
  2. Ordinal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Nominal variable

A

have no ordering, existing in name only, like apples, oranges, and grapes. (“Nominal” means “in name only”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Ordinal Variable

A

have an ordering, like disagree, neutral, and agree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Two types of quantitative variables are often distinguished

A
  1. continuous variable
  2. discrete variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

continuous variable

A

are infinite along a continuum of values within a range, typically real numbers. Continuous variables usually represent measurements, like height ( meters) or temperature ( degrees).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

discrete variable (3)

A
  1. are finite within a range, typically integers.
  2. Discrete variables usually represent countable items, like people in a family () or cars in a city ().
  3. Generally, if “number of” can be added to the beginning, the variable is discrete, like “number of people in a family”, but not “number of height”.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Data visualization

A

is the display of data in a format, such as a table or chart, that seeks to achieve a goal of conveying particular information to a viewer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Considerations for data visualization

A
  1. Cardinality
  2. depends on the kind of data being presented, and the information to be conveyed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Cardinality (2)

A
  1. is the number of unique elements in a dataset.
  2. scatter graphs, line charts, and histograms, work very well for high-cardinality data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Pie charts

A

are a good choice for low-cardinality data, and for showing the relative frequency in which unrelated categories occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

scatter plot

A

can be used to identify trends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

A bar chart

A

is a good choice for displaying frequency or counts in low-cardinality data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

spreadsheet application

A

is a common computer application for organizing data like text or numbers, for using formulas to calculate a mathematical quantity using existing data as inputs, and for creating charts to visualize data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

A spreadsheet consists of? (2)

A
  1. A spreadsheet consists of cells organized into columns and rows. The column headings are letters and the row headings are numbers, but headings are not counted as cells.
  2. A user can enter data, like words or numbers, into each cell. The spreadsheet is a convenient way to create a table of data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

spreadsheet function

A

is a predefined formula that supports common tasks such as computing the average, minimum, or maximum of a group of cells.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

function syntax

A

defines how the function is used, and specifies the function’s name and accepted arguments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Function’s arguments (3)

A
  1. are surrounded by parentheses and specify the data that the function operates on.
  2. Arguments may be numbers, cells, a range of cells, or a combination thereof.
  3. The [ ] arguments are optional.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

To call a function in a spreadsheet

A

= is followed by the function’s name and then arguments separated by commas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

range operator (:)

A
  1. defines a reference to a group of cells.
  2. Ex: =SUM(A1:A4, B10) calculates the sum of cells A1, A2, A3, A4, and B10.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

The two primary methods of inferential statistics

A

confidence intervals, and hypothesis testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Confidence Intervals

A

specify the range within which a parameter falls with a given probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

hypothesis testing

A

allows differences between population parameters to be compared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Surveys

A

Are conducted to allow statisticians to make generalizations about a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

population

A

is any collection of objects, people, or things about which statistical inference are made

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

parameter of a population

A

is a numerical characteristic of a population, such as mean, median, or standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

sampling unit

A

is an individual in the population on which a measurement can be taken.

44
Q

sampling frame

A

is the subset of the population from which a sample is drawn.

45
Q

sample

A

is composed of the sampling units that provide data to be collected.

46
Q

statistic

A

is a numerical characteristic of a sample, rather than the population.

47
Q

bias

A

is a difference between the parameter predicted from a survey from the true value of the parameter in the population.

48
Q

Two broad categories of statistical bias include

A

selection bias and response bias.

49
Q

selection bias

A

exists when the sampling units selected from a population are not representative of the entire population, and are instead biased toward certain subsets of the population.

50
Q

types of selection bias (4)

A
  1. Undercoverage bias
  2. Nonresponse bias
  3. Voluntary response bias
  4. Response bias
51
Q

Undercoverage

A

occurs when certain members of a population are inadequately represented in a sample.

52
Q

Nonresponse bias

A

occurs when a sample is biased toward members of a population that participate in a survey.

53
Q

Voluntary response bias

A

occurs when a sample is biased toward members that self-select for participation in a survey.

54
Q

response bias

A

can result if the responses of survey participants are affected by how a question is asked or the behaviors or attitudes of the participant.

55
Q

3 types of response bias

A
  1. Acquiescence bias
  2. extreme responding
  3. social desirability bias
56
Q

Acquiescence bias

A

occurs when respondents tend to agree with a statement in a survey.

57
Q

extreme responding

A

occurs when respondents tend to select the most extreme options available.

58
Q

social desirability bias (2)

A
  1. occurs when respondents tend to answer questions in a way that is socially accepted by others.
  2. In other words, a social desirability bias exists when respondents over-report “good” behaviors or under-report “bad” behaviors.
59
Q

Sampling methods

A

Different sampling methods can help mitigate certain types of statistical bias.

60
Q

Types of sampling methods (5)

A
  1. simple random sampling
  2. systematic sampling
  3. stratified
  4. cluster
  5. convenience
61
Q

simple random sampling (2)

A
  1. a sample is constructed by random selection from the population. 2. Mathematically, simple random sampling is a sampling method in which all possible samples consisting of units selected from a population of units are equally likely.
62
Q

systematic sampling

A

every Kth unit from a population of units is selected to be in a sample.

63
Q

stratified sampling

A

the population is first divided into groups, or strata, depending on some characteristic. Next, samples within each stratum are randomly selected in a proportional manner.

64
Q

cluster sampling

A
  1. the population is first divided into groups, or clusters, depending on some characteristic.
  2. Next, the sample is constructed by randomly selecting one or more clusters.
65
Q

convenience sampling

A

units are drawn from a subset of the population that is readily available.

66
Q

outlier

A

a data value that is either much greater than or much less than the rest of the data and not representative of the rest of the data being considered

67
Q

Spread (3)

A
  1. is a measure of how far apart values in a dataset are to each other
  2. a larger spread means that the values are more scattered.
  3. A lower spread means that the values are more clustered together.
68
Q

Graphical techniques include (3)

A

using dot plots, box plots, and histograms

69
Q

numerical techniques include calculating (3)

A

the interquartile range, variance, and standard deviation.

70
Q

Two common numerical measures of spread are

A

variance and standard deviation

71
Q

Variance

A

is the average of the square difference from the mean

72
Q

Standard deviation

A

is the square root of the variance

73
Q

The calculations for the variance and standard deviation depend on whether

A

the dataset contains the whole population or a subset of the population.

74
Q

What does the standard deviation represent?

A

The typical difference between a data value and the mean

75
Q

What does the range represent?

A

The spread between the maximum and minimum data values

76
Q

Which is a better measure of spread for the dataset represented in the computer output?

A

For symmetric data, standard deviation is usually the better measure of spread. For data that is skewed, interquartile range is usually the better measure of spread.

77
Q

maximum of a dataset

A

is the largest value in the dataset

78
Q

minimum of a dataset

A

is the smallest value in the dataset.

79
Q

range of a dataset

A

is the difference between the maximum and minimum of the dataset.

80
Q

percentile of a dataset

A

is the data value such that percent of the data falls at or below that value.

81
Q

first quartile (Q1)

(2)

A
  1. is the 25th percentile. One-quarter of the data fall at or below .
  2. The first quartile is the median of the lower half of the data.
82
Q

third quartile (Q3)

(3)

A
  1. is the 75th percentile.
  2. Three-quarters of the data fall at or below .
  3. The third quartile is the median of the upper half of the data.
83
Q

50th percentile of a dataset.

A

Because half of the data fall at or below the median, the median is also the 50th percentile of a dataset.

84
Q

Collectively, the minimum and maximum values, Q1 , median, and Q3 form a set of descriptive statistics called the

A

five-number summary.

85
Q

box plot (4)

A
  1. is a data visualization that uses a box and several lines to depict the distribution of data in a dataset.
  2. A box spans 50% the middle of the data, with Q1 as the lower boundary of the box and Q3 as the upper boundary of the box.
  3. The median is shown as a line inside the box. Two lines, known as whiskers, extend from the lower boundary of the box to the minimum and from the upper boundary of the box to the maximum. 4. The whiskers represent the lower and upper 25% of the data.
86
Q

Skew (2)

A
  1. is the difference between the mean and the median
  2. A positive skew means that the distribution is skewed to the right, while a negative skew means that the distribution is skewed to the left.
87
Q

Detecting outliers (3)

A
  1. One way to detect outliers using a box plot is to determine how far each data element is from either Q1 or Q3.
  2. A data value greater than Q3 + 1.5(IQR) or less than Q1 - 1.4(IQR) is considered an outlier.
  3. Often, an outlier is not included in either whisker and is instead represented in the plot as a marker such as an open circle or a triangle.
88
Q

interquartile range (IQR) OF A DATASET

A
  1. is the difference between Q3 and Q1 (Q3 - Q1), or the length of the box in a box plot.
89
Q

frequency distribution (2)

A
  1. is a table that displays how often an outcome occurs for a sample
  2. To construct a frequency distribution, the data set is divided into mutually exclusive classeS
90
Q

class

A

is either a value of a categorical variable or an interval of a continuous variable.

91
Q

frequency of a class

A

is the number of events or values that fall under each class.

92
Q

The most common graphical representation of a frequency distribution is a

A

histogram

93
Q

histogram

A

depicts data values by splitting a continuous variable into a number of class intervals, each known as a bin.

94
Q

A key goal of a histogram is to:

A
  1. estimate the probability density function of the continuous variable on the X-axis.
  2. In short, the goal is to fit a smooth curve over the most rectangles, while minimizing the white space under the curve.
95
Q

unimodal distribution

A

occurs when one mode exists in the histogram.

96
Q

Bimodal:

A

Contains two prevalent modes

97
Q

multimodal

A

Contains multiple prevalent modes

98
Q

Skewed left:

A

Contains a mode on the right with a tail of low-frequency bins on the left

99
Q

Skewed right:

A

Contains a mode on the left with a tail of low-frequency bins on the right

100
Q

line chart (3)

A
  1. depicts data trends by using straight lines to connect successive data points in a scatter plot.
  2. The straight lines show the general direction that data changes over time.
  3. Because trends involve time, line charts commonly use a time metric for the horizontal axis.
101
Q

main benefit of a line graph is to (2)

A
  1. quickly convey whether values are increasing, decreasing, or remaining constant between data points.
  2. Steeper lines indicate more rapid increases or decreases, while flatter lines indicate little change between data points
102
Q

linear trend line

A

is a straight line that depicts the general direction data changes from the first to last data point, often added to summarize the entire chart

103
Q

A line chart should not be used for (2)

A
  1. nominal categorical data.
  2. Lines suggest some relation from one item to the next, but nominal variables have no ordering so can have no such relation.
103
Q

bar chart (2)

A
  1. depicts data values for a categorical variable, using rectangular bars having lengths proportional to category values.
  2. The chart is drawn using two axes: a category axis that displays the category names and a value axis that displays the counts
104
Q

relative-frequency bar chart

A

shows each category’s portion of the total data, typically as a percentage.

105
Q
A