Exam 1 Flashcards
Cases
the objects described by a set of data. These can be companies, customers, etc. Basically anything that is described by the set of data collected.
Variable
A special characteristic of a case. Such as age, sex, gender, etc. This is a specific attribute of the cases/individual.
Categorical variables
variables which describe the characteristic of the individual or category.
Quantitative variables
variables which takes numerical values on which arithmetic operations make sense.
Data set
the collection of raw statistics and information collected by a research study.
Graphs for categorial variables examples
Bar graphs & pie charts
Graphs for quantitative variables examples
Stemplots and Histograms
Individual
Object described by data.
We have a data set where the cases are college students. One of the variables in the data set is “gender.” The values of gender are 1 if the student is male and 2 if the student is female. What type of variable is gender?
Categorical(because it doesn’t make sense to do any arithmetic with group 1 and group 2).
Units of measurement are an important part of the description of what type of variables?
Quantitative
The first day of class, the professor collects information on each student to make a data set that will be analyzed throughout the semester. The information asked includes hometown, GPA, number of classes taking, number of siblings, and favorite subject. How many quantitative variables are in this data set?
three (number of classes, GPA, number of siblings. All of these can be averaged and/or put through other arithmetic operations).
Consider the following data, which describe the amount of time in minutes that students spend studying for a quiz:
10, 11, 11, 12, 12, 14, 15, 18, 19, 20, 22, 24, 39, 40, 41, 44, 46, 50, 52, 52, 53, 55, 70
What numbers make up the leaf of the first stem?
0,1,1,2,2,4,5,8,9
classes
intervals on histograms of equal width.
Right skewed
When the right side of the distribution is “longer” aka not as large as the left.
What is a time series?
A data set which tracks a sample overtime.
When examining a distribution of a quantitative variable, which of the following features do we look for?
- Overall shape, center, and spread
- Symmetry or skewness
- Deviations from overall patterns, such as outliers
- The number of peaks or modes
Pie charts are useful for…
Seeing which part of the whole a group forms.
What is the interquartile range for a set of data?
The interquartile range is the difference between the 75th and 25th percentiles of data.
IQR = Q3 - Q1
A reporter wishes to portray baseball players as overpaid. Which measure of center should he report as the average salary of major league players?
The mean, as higher pays will pull the mean in their direction.
If the median is much larger than the mean, this is indicative of what type of stemplot?
The stemplot of the data would be skewed left
If the skew is symmetrical, what can be said about the median and mean?
They are equal
What does the five number summary consist of for a set of data?
- Minimum
- Q1
- Median
- Q3
- Maximum
Define Histogram
a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.
A boxplot is
A graph of the five number summary.
Box plot’s central box spans Q1 and Q3 with a line in the box indicating the median and lines extend out from the box to the smallest and largest observations.
What is the 1.5-IQR formula?
Outlier = 1.5 * IQR
What is the five number summary of this list of numbers?
175 199 204 234 259 275 299 304 317 345 355 384 549
Minimum: 175
Q1:219
Median: 299
Q3: 350
Maximum: 549
Density curves do not adeuquately reveal…
Outliers.
In a skewed density curve, what is pulled away in the direction of the long tail?
Mean
The time to complete an exam is approximately Normal with a mean of 70 minutes and a standard deviation of 10 minutes. Using the 68-95-99.7 rule, what percent of students will complete the exam in under an hour?
16%. This is because 68% will fall within one standard deviation of the time, the rest, 32% will fall outside of the 68% range (one STDV range). Since 32% will be split evenly above and below the 68% range, only 16% will fall in the “under and hour” group.
What information can we get from Z-scores?
How many standard deviations an observation in a set falls from the mean and in which direction.
Formula for z-score?
z = ((obervation value of set)-(mean of set))/(standard deviation)
A normal distribution is…
Symmetric about the mean
A data set lists apartments available for students to rent. Information provided includes the monthly rent per person, whether cable is included free of charge, whether or not pets are allowed, the number of bedrooms, the number of bathrooms, and the distance to the campus. Describe the cases in the data set, give the number of variables, and specify whether each variable is categorical or quantitative.
Cases = Apartments for students to rent
Variables: monthly rent per person(quantitative), cable (categorical), pets(categorical), number of bedrroms (quantitative - categorical), number of bathrooms(quantitative - categorical), distance to campus (quantitative - categorical).
This is the term used to describe how the value of a variable changes from case to case.
Distribution.
Exploratory data anaylsis is…
The “first impressions” of a data set. Just describing the general features of what is seen.
The processes of using data to predict aspects of the future is known as…
Predictive analytics
Bar graphs and pie charts are graphical representations for this type of variable.
Categorical
How are the percent and proportion related?
Percent = proportion * 100
What are the types of variables and their values?

Reasons are the categorical variables and their values are count.
Pie charts naturally use what type of value?
Percents
Stemplots and histograms are used for…
Quantitative variables
What is the process of separating each stem into 0-4 and 5-9?
Splitting stems
The processes of removing the last digit or digits before making a stemplot is known as…
Trimming
This type of graphical display breaks the values of a variable into classes and only show the percentage or count that fall into each class.
Histogram
Between stemplots and histograms, which is prefered for small data sets and why?
Stemplots, because they show the actual value of the data of the observations.
The Minimum and Quartile 1 values make up what “tail”?
Left tail.
The maximum and quartile 3 values make up what tail?
Right tail.
When does s = 0?
when all observations have the same value/ no spread.
Is the standard deviation resistant or not- resistant?
Not resistant. depends upon the mean.
x-bar and s are associated with what type of data set size?
Sample
What are µ and σ associated with (population or sample)?
Population
If standard deviation of a density curve changes, what will change graphically?
The spread
What is the abbreviation for normal distribution’s mean and standard deviation?
N(µ, σ)
Variables must be measured from the same ____ for a relationship to be measured.
Cases.
To evaluate or check something against a standard is called:
Benchmarking
The strength of a relationship between two quantitative variables is measured with…
correlation r
The , is the fraction of the variation in values of y that is explained by the least-squares regression of y on x.
square of the correlation, r2
R^2 formula
(var(mean) - var(line))/(var(mean))
“There is 81% less variation around the line than around the mean.”
This is saying that 81% of the association is accounted for by the relationship between the x and y variables.
The line that makes the sum of the squares of the vertical distance of the data points as small as possible.
Least Squares Regression Line
Correlations based on average tend to be _____ than correlations based on individuals.
Higher than
Interest rates for home mortgages have, in general, declined during recent months. With the apparent favorable influence for new-home building, there seems to be a clear relationship between 𝑥=the prevailing mortgage interest rate and 𝑦= the number of new houses being built per month in a Midwestern city over a period of 18 months. A scatterplot of the data collected shows that the linear model is appropriate. The equation of the least-squares regression line is
Number of new houses =672.89−30.65 × Interest rate and 𝑟2=0.49
Which of the following descriptions best represents the value of the slope?
When the interest rate increases by 1%, the number of new houses being built is expected to drop by 30.65.