Chapter 1: Exploring data Flashcards
Individuals
Individuals are the objects described by a set of data
Variable
an attribute that describes a person, place, thing, or idea
Categorial Variable
categorical variables take on values that are names or labels
Quantitative Variable
quantitative variables are numerical
Continuous
continuous distribution is one in which data can take on any value within a specified range
Univariate Data
a study that looks at only one variable
Bivariate Data
a study that examines the relationship between two variables
Population
population refers to the total set of observations that can be made
Sample
a sample refers to a set of observations drawn from a population
Census
a study that obtains data from every member of a population
Distribution
The distribution of a statistical data set (or a population) is a listing or function showing all the possible values
Inference
inference is the process of using data analysis to deduce properties of an underlying distribution of probability
Frequency Table
when a table shows frequency counts for a categorical variable, it is called a frequency table
Relative Frequency
Relative frequency = Subgroup count / Total count
Table
tables showing the values of the cumulative distribution functions, probability functions, or probability
Roundoff Error
the difference between an approximation of a number used in computation and its exact (correct) value
Pie Chart
a circular statistical graphic, which is divided into slices to illustrate numerical proportion
Bar Graph
a chart that plots data using rectangular bars or columns
Two-way Table
a statistical table that shows the observed number or frequency for two variables
Marginal Distribution
marginal distribution is the percentages out of totals
Conditional
conditional distribution is the percentages out of some column
Segmented Bar Graph
a bar graph with two columns. one of them shows a discrete value (i.e. numbers) while the other one compares the values with different bars in different categories
Side-by-side Bar Graph
the bars are split into colored bar segments
Association
any relationship between two measured quantities that renders them statistically dependent
Simpson’s Paradox
when we combine all of the groups together and look at the data in aggregate form, the correlation that we noticed before may reverse itself
Dot Plot
a graph for displaying the distribution of numerical variables where each dot represents a value
Shape
symmetric, how many peaks it has, if it is skewed to the left or right, and whether it is uniform
Mode
a number that appears the most amount of times in a set of data
Center
mean or median of the data
Spread
how similar or varied the set of observed values are for a particular variable (data item)
Range
a simple measure of variation in a set of random variables
Outlier
a data point that diverges greatly from the overall pattern of data is called an outlier
Symmetric
a symmetric distribution can be divided at the center so that each half is a mirror image of the other
Skewed Right
fewer observations on the right (toward higher values) are said to be skewed right
Skewed Left
fewer observations on the left (toward lower values) are said to be skewed left
Unimodal
distributions with one clear peak are called unimodal
Bimodal
distributions with two clear peaks are called bimodal
Multimodal
a probability distribution with more than one peak, or “mode”
Stemplot
the entries on the left are called stems; and the entries on the right are called leaves
Splitting Stems
stem-and-leaf plots that have more than 1 space on the stem for the same interval
Back-to-back Stem
back-to-back stem plots are a graphic option for comparing data from two populations
Plots
a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables
Histogram
columns are positioned over a label that represents a continuous, quantitative variable, and the height of the column indicates the size of the group defined by the column label
Mean
the average of the data
Median
the middle of all the data points collected
Interquartile Range
a measure of variability, based on dividing a data set into quartiles
Five-number
gives information about the location (from the median), spread (from the quartiles) and range (from the sample minimum and maximum) of the observations
Summary
A summary is a brief statement or restatement of main points
Boxplot
a type of graph used to display patterns of quantitative data
Standard deviation
a numerical value used to indicate how widely individuals in a group vary
Variance
a numerical value used to indicate how widely individuals in a group vary