MODULE 1 - DESCRIPTIVE STATISTICS Flashcards
The study of statistics is often broken into what two main categories?
- descriptive statistics
- inferential statistics
inferential statistics (3)
- Frequently, it is impossible to contact every person in large populations, so a smaller group is used, called a sample.
- A researcher can draw conclusions about the larger population using the sample data.
- Focuses on using information from the sample to make conclusions about the population from which the sample was drawn.
descriptive statistics (4)
- focuses on summarizing survey data about a sample drawn from a population.
- Summary statistics include measures of central tendency such as mean, median, and mode; and dispersion such as range and standard deviation.
- Descriptive statistics cannot make conclusions based on the data. 4. Rather, descriptive statistics is a way to present data in a meaningful way.
What is data?
is information, especially facts or numbers, usually collected or computed for purposes of analysis.
Common sources of data (3)
- Social networks
- Traditional Business Systems
- Internet of Things
Data analytics
is the field of analyzing data to gain insight, draw conclusions, or make decisions.
Big data
refers to very large data sets that cannot be processed by traditional methods, and is characterized by high volume, rapid velocity of collection, and variety in type and quality.
3 Types of data analytics
- Descriptive
- Predictive
- Prescriptive
Descriptive data analytics
analytics seeks to describe data, providing insight and knowledge.
Predictive data analytics
seeks to make predictions from data
Prescriptive data analytics
seeks to make decisions (prescriptions) based on data
Data is typically represented using what?
variables
variable
is an item that can have different (“varying”) values
Variables are often considered as being of two possible types:
- quantitative variable
- categorical variable
quantitative variable
can take on a numeric value (quantitative data) that can be measured and ordered
categorical variable (qualitative variable)
can take on the value (usually a label) of one of several categories
reason for distinguishing variable types (3)
- Each type is handled differently in data analytics
- A categorical variable typically involves counting the instances of each category, often then depicted with a bar chart or pie chart.
- But a quantitative variable is commonly plotted versus another quantitative variable, often depicted with a scatter plot or line chart
Two types of categorical variables are often distinguished
- Nominal
- Ordinal
Nominal variable
have no ordering, existing in name only, like apples, oranges, and grapes. (“Nominal” means “in name only”).
Ordinal Variable
have an ordering, like disagree, neutral, and agree.
Two types of quantitative variables are often distinguished
- continuous variable
- discrete variable
continuous variable
are infinite along a continuum of values within a range, typically real numbers. Continuous variables usually represent measurements, like height ( meters) or temperature ( degrees).
discrete variable (3)
- are finite within a range, typically integers.
- Discrete variables usually represent countable items, like people in a family () or cars in a city ().
- Generally, if “number of” can be added to the beginning, the variable is discrete, like “number of people in a family”, but not “number of height”.
Data visualization
is the display of data in a format, such as a table or chart, that seeks to achieve a goal of conveying particular information to a viewer
Considerations for data visualization
- Cardinality
- depends on the kind of data being presented, and the information to be conveyed.
Cardinality (2)
- is the number of unique elements in a dataset.
- scatter graphs, line charts, and histograms, work very well for high-cardinality data
Pie charts
are a good choice for low-cardinality data, and for showing the relative frequency in which unrelated categories occur.
scatter plot
can be used to identify trends.
A bar chart
is a good choice for displaying frequency or counts in low-cardinality data.
spreadsheet application
is a common computer application for organizing data like text or numbers, for using formulas to calculate a mathematical quantity using existing data as inputs, and for creating charts to visualize data.
A spreadsheet consists of? (2)
- A spreadsheet consists of cells organized into columns and rows. The column headings are letters and the row headings are numbers, but headings are not counted as cells.
- A user can enter data, like words or numbers, into each cell. The spreadsheet is a convenient way to create a table of data.
spreadsheet function
is a predefined formula that supports common tasks such as computing the average, minimum, or maximum of a group of cells.
function syntax
defines how the function is used, and specifies the function’s name and accepted arguments
Function’s arguments (3)
- are surrounded by parentheses and specify the data that the function operates on.
- Arguments may be numbers, cells, a range of cells, or a combination thereof.
- The [ ] arguments are optional.
To call a function in a spreadsheet
= is followed by the function’s name and then arguments separated by commas.
range operator (:)
- defines a reference to a group of cells.
- Ex: =SUM(A1:A4, B10) calculates the sum of cells A1, A2, A3, A4, and B10.
The two primary methods of inferential statistics
confidence intervals, and hypothesis testing
Confidence Intervals
specify the range within which a parameter falls with a given probability
hypothesis testing
allows differences between population parameters to be compared.
Surveys
Are conducted to allow statisticians to make generalizations about a population.
population
is any collection of objects, people, or things about which statistical inference are made
parameter of a population
is a numerical characteristic of a population, such as mean, median, or standard deviation.
sampling unit
is an individual in the population on which a measurement can be taken.
sampling frame
is the subset of the population from which a sample is drawn.
sample
is composed of the sampling units that provide data to be collected.
statistic
is a numerical characteristic of a sample, rather than the population.
bias
is a difference between the parameter predicted from a survey from the true value of the parameter in the population.
Two broad categories of statistical bias include
selection bias and response bias.
selection bias
exists when the sampling units selected from a population are not representative of the entire population, and are instead biased toward certain subsets of the population.
types of selection bias (4)
- Undercoverage bias
- Nonresponse bias
- Voluntary response bias
- Response bias
Undercoverage
occurs when certain members of a population are inadequately represented in a sample.
Nonresponse bias
occurs when a sample is biased toward members of a population that participate in a survey.
Voluntary response bias
occurs when a sample is biased toward members that self-select for participation in a survey.
response bias
can result if the responses of survey participants are affected by how a question is asked or the behaviors or attitudes of the participant.
3 types of response bias
- Acquiescence bias
- extreme responding
- social desirability bias
Acquiescence bias
occurs when respondents tend to agree with a statement in a survey.
extreme responding
occurs when respondents tend to select the most extreme options available.
social desirability bias (2)
- occurs when respondents tend to answer questions in a way that is socially accepted by others.
- In other words, a social desirability bias exists when respondents over-report “good” behaviors or under-report “bad” behaviors.
Sampling methods
Different sampling methods can help mitigate certain types of statistical bias.
Types of sampling methods (5)
- simple random sampling
- systematic sampling
- stratified
- cluster
- convenience
simple random sampling (2)
- a sample is constructed by random selection from the population. 2. Mathematically, simple random sampling is a sampling method in which all possible samples consisting of units selected from a population of units are equally likely.
systematic sampling
every Kth unit from a population of units is selected to be in a sample.
stratified sampling
the population is first divided into groups, or strata, depending on some characteristic. Next, samples within each stratum are randomly selected in a proportional manner.
cluster sampling
- the population is first divided into groups, or clusters, depending on some characteristic.
- Next, the sample is constructed by randomly selecting one or more clusters.
convenience sampling
units are drawn from a subset of the population that is readily available.
outlier
a data value that is either much greater than or much less than the rest of the data and not representative of the rest of the data being considered
Spread (3)
- is a measure of how far apart values in a dataset are to each other
- a larger spread means that the values are more scattered.
- A lower spread means that the values are more clustered together.
Graphical techniques include (3)
using dot plots, box plots, and histograms
numerical techniques include calculating (3)
the interquartile range, variance, and standard deviation.
Two common numerical measures of spread are
variance and standard deviation
Variance
is the average of the square difference from the mean
Standard deviation
is the square root of the variance
The calculations for the variance and standard deviation depend on whether
the dataset contains the whole population or a subset of the population.
What does the standard deviation represent?
The typical difference between a data value and the mean
What does the range represent?
The spread between the maximum and minimum data values
Which is a better measure of spread for the dataset represented in the computer output?
For symmetric data, standard deviation is usually the better measure of spread. For data that is skewed, interquartile range is usually the better measure of spread.
maximum of a dataset
is the largest value in the dataset
minimum of a dataset
is the smallest value in the dataset.
range of a dataset
is the difference between the maximum and minimum of the dataset.
percentile of a dataset
is the data value such that percent of the data falls at or below that value.
first quartile (Q1)
(2)
- is the 25th percentile. One-quarter of the data fall at or below .
- The first quartile is the median of the lower half of the data.
third quartile (Q3)
(3)
- is the 75th percentile.
- Three-quarters of the data fall at or below .
- The third quartile is the median of the upper half of the data.
50th percentile of a dataset.
Because half of the data fall at or below the median, the median is also the 50th percentile of a dataset.
Collectively, the minimum and maximum values, Q1 , median, and Q3 form a set of descriptive statistics called the
five-number summary.
box plot (4)
- is a data visualization that uses a box and several lines to depict the distribution of data in a dataset.
- A box spans 50% the middle of the data, with Q1 as the lower boundary of the box and Q3 as the upper boundary of the box.
- The median is shown as a line inside the box. Two lines, known as whiskers, extend from the lower boundary of the box to the minimum and from the upper boundary of the box to the maximum. 4. The whiskers represent the lower and upper 25% of the data.
Skew (2)
- is the difference between the mean and the median
- A positive skew means that the distribution is skewed to the right, while a negative skew means that the distribution is skewed to the left.
Detecting outliers (3)
- One way to detect outliers using a box plot is to determine how far each data element is from either Q1 or Q3.
- A data value greater than Q3 + 1.5(IQR) or less than Q1 - 1.4(IQR) is considered an outlier.
- Often, an outlier is not included in either whisker and is instead represented in the plot as a marker such as an open circle or a triangle.
interquartile range (IQR) OF A DATASET
- is the difference between Q3 and Q1 (Q3 - Q1), or the length of the box in a box plot.
frequency distribution (2)
- is a table that displays how often an outcome occurs for a sample
- To construct a frequency distribution, the data set is divided into mutually exclusive classeS
class
is either a value of a categorical variable or an interval of a continuous variable.
frequency of a class
is the number of events or values that fall under each class.
The most common graphical representation of a frequency distribution is a
histogram
histogram
depicts data values by splitting a continuous variable into a number of class intervals, each known as a bin.
A key goal of a histogram is to:
- estimate the probability density function of the continuous variable on the X-axis.
- In short, the goal is to fit a smooth curve over the most rectangles, while minimizing the white space under the curve.
unimodal distribution
occurs when one mode exists in the histogram.
Bimodal:
Contains two prevalent modes
multimodal
Contains multiple prevalent modes
Skewed left:
Contains a mode on the right with a tail of low-frequency bins on the left
Skewed right:
Contains a mode on the left with a tail of low-frequency bins on the right
line chart (3)
- depicts data trends by using straight lines to connect successive data points in a scatter plot.
- The straight lines show the general direction that data changes over time.
- Because trends involve time, line charts commonly use a time metric for the horizontal axis.
main benefit of a line graph is to (2)
- quickly convey whether values are increasing, decreasing, or remaining constant between data points.
- Steeper lines indicate more rapid increases or decreases, while flatter lines indicate little change between data points
linear trend line
is a straight line that depicts the general direction data changes from the first to last data point, often added to summarize the entire chart
A line chart should not be used for (2)
- nominal categorical data.
- Lines suggest some relation from one item to the next, but nominal variables have no ordering so can have no such relation.
bar chart (2)
- depicts data values for a categorical variable, using rectangular bars having lengths proportional to category values.
- The chart is drawn using two axes: a category axis that displays the category names and a value axis that displays the counts
relative-frequency bar chart
shows each category’s portion of the total data, typically as a percentage.