Module 1 Flashcards

1
Q

Define the DCOVA framework.

A

The DCOVA framework consists of the following tasks:
• Define the data that you want to study in order to solve a problem or meet an objective.
• Collect the data from appropriate sources.
• Organize the data collected by developing tables.
• Visualize the data collected by developing charts.
• Analyze the data collected to reach conclusions and present those results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data?

A

Data are “the values associated with a trait or property that help distinguish the occurrences of something.” Data
The set of individual values associated with a variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a variable?

A

The trait or property of something that values (data) are associated with Variable
A characteristic of an item or individual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is statistics?

A

Statistics

The methods that help transform data into useful information for decision makers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is big data?

A

data that are being collected in huge volumes and at very fast rates (typically
in real time) and data that arrive in a variety of forms, organized and unorganized. These
attributes of “volume, velocity, and variety,” first identified in 2001 (see reference 5), make big
data different from any of the data sets used in this book.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Operational Definition

A

a
universally accepted meaning that is clear to all associated with an analysis.This definition
should clearly identify the values of the variable necessary to ensure that collected data are acceptable and appropriate for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Categorical variables /Qualitative

A

) have values that can

only be placed into categories such as yes and no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Numerical variables (also known as quantitative variables)

A

have values that represent a counted or measured quantity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Discrete variables

A

numerical values that arise from a counting process, . “Number of
items purchased” i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Continuous variables

A

numerical values that arise from a
measuring process. “The time spent waiting on a checkout line” ” is an example of a continuous
numerical variable because its values can represent a measurement with a stopwatch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can variables be categorical AND numerical?

A

Yes. For example, “age” would seem to be an obvious numerical variable, but what if you are
interested in comparing the buying habits of children, young adults, middle-aged persons, and
retirement-age people? In that case, defining “age” as a categorical variable would make better
sense. Depends on operational definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Nominal Scale

A

classifies data into distinct categories in which no ranking
is implied. Examples of a nominal scaled variable are your favorite soft drink, your political
party affiliation, and your gender. Nominal scaling is the weakest form of measurement because you cannot specify any ranking across the various categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ordinal Scale

A

n ordinal scale classifies values into distinct categories in which ranking is implied.
For example, suppose that GT&M conducted a survey of customers who made a purchase and
asked the question “How do you rate the overall service provided by Good Tunes & More during your most recent purchase?” to which the responses were “excellent,” “very good,” “fair,”
1.2 Measurement Scales for Variables
Learn More
Read the Short Takes for
Chapter 1 to learn more
about nominal and ordinal
scales.
M01_BERE9029_13_SE_C01.indd 43 19/09/14 8:37 AM
44 Chapter 1 Defining and Collecting Data
and “poor.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

interval

scale

A

is an ordered scale in which the difference between measurements is a
meaningful quantity but does not involve a true zero point. For example, a noontime temperature reading of 67° Fahrenheit is 2 degrees warmer than a noontime reading of 65°.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

ratio scale

A

A ratio scale is an ordered scale in which the difference between the measurements involves a true zero point, as in height, weight, age, or salary measurements. If GT&M conducted a survey and asked how much money you expected to spend on audio equipment in
the next year, the responses to such a question would be an example of a ratio-scaled variable.
A person who expects to spend $1,000 on audio equipment expects to spend twice as much
money as someone who expects to spend $500. As another example, a person who weighs
240 pounds is twice as heavy as someone who weighs 120 pounds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

primary data

source

A

You are using a primary data

source if you collect your own data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

secondary data source

A

a secondary data source if the

data for your analysis have been collected by someone else.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

population

A

consists of all the

items or individuals about which you want to reach conclusions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

sample

A

portion of a population selected for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Structured data

A

data that follows some organizing principle or plan, typically a repeating pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

unstructured data

A

follows no repeating pattern. For example, if five different
persons sent you an email message concerning the stock trades of a specific company, that data
could be anywhere in the message.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

outliers

A

s, values that seem excessively different from most of the rest of the
values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

missing value

A

is a value that

was not able to be collected (and therefore not available to analysis).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

recoded variable

A

After you have collected data, you may discover that you need to reconsider the categories that
you have defined for a categorical variable or that you need to transform a numerical variable
into a categorical variable by assigning the individual numeric data values to one of several
groups. In either case, you can define a recoded variable that supplements or replaces the
original variable in your analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
mutually exclusive
t the category definitions cause each data value to | be placed in one and only one category, a
26
collectively exhaustive
Also ensure that the set of categories you create for the new, recoded variables include all the data values being recoded, a property known as being collectively exhaustive. If you are recoding a categorical variable, you can preserve one or more of the original categories, as long as your recodings are both mutually exclusive and collectively exhaustive.
27
frame
The frame is a complete or partial listing of the items that make up the population from which the sample will be selected.
28
nonprobability sample
In | a nonprobability sample, you select the items or individuals without knowing their probabilities of selection.
29
probability sample
in a probability sample, you select items based on known probabilities. Whenever possible, you should use a probability sample as such a sample will allow you to make inferences about the population being analyzed.
30
convenience sample,
you select items that are easy, inexpensive, or convenient to sample. For example, in a warehouse of stacked items, selecting only the items located on the tops of each stack and within 1.4 Types of Sampling Methods M01_BERE9029_13_SE_C01.indd 49 19/09/14 8:37 AM 50 Chapter 1 Defining and Collecting Data easy reach would create a convenience sample. So, too, would be the responses to surveys that the websites of many companies offer visitors. While such surveys can provide large amounts of data quickly and inexpensively, the convenience samples selected from these responses will consist of self-selected website visitors. (Read the Think About This essay on page 54 for a related story.)
31
judgment sample
you collect the opinions of preselected experts in the subject matter. Although the experts may be well informed, you cannot generalize their results to the population.
32
simple random sample
every item from a frame has the same chance of selection as every other item, and every sample of a fixed size has the same chance of selection as every other sample of that size. Simple random sampling is the most elementary random sampling technique. It forms the basis for the other random sampling techniques. However, simple random sampling has its disadvantages. Its results are often subject to more variation than other sampling methods. In addition, when the frame used is very large, carrying out a simple random sample may be time consuming and expensive
33
Sampling with replacement
means that after you select an item, you return it to the frame, where it has the same probability of being selected again. Imagine that you have a fishbowl containing N business cards, one card for each person. On the first selection, you select the card for Grace Kim. You record pertinent information and replace the business card in the bowl. You then mix up the cards in the bowl and select a second card. On the second selection, Grace Kim has the same probability of being selected again, 1>N. You repeat this process until you have selected the desired sample size,
34
Sampling without replacement
means that once you select an item, you cannot select it again. The chance that you will select any particular item in the frame—for example, the business card for Grace Kim—on the first selection is 1>N. The chance that you will select any card not previously chosen on the second selection is now
35
systematic sample,
you partition the N items in the frame into n groups of k items, where k = N/n You round k to the nearest integer. To select a systematic sample, you choose the first item to be selected at random from the first k items in the frame. Then, you select the remaining n - 1 items by taking every kth item thereafter from the entire frame.
36
stratified sample
In a stratified sample, you first subdivide the N items in the frame into separate subpopulations, or strata.
37
cluster sample
In a cluster sample, you divide the N items in the frame into clusters that contain several items.
38
. Clusters
are often naturally occurring groups, such as counties, election districts, city blocks, households, or sales territories. You then take a random sample of one or more clusters and study all items in each selected cluster.
39
Coverage error
Coverage error occurs if certain groups of items are excluded from the frame so that they have no chance of being selected in the sample or if items are included from outside the frame. Coverage error results in a selection bias.
40
selection bias
Coverage error occurs if certain groups of items are excluded from the frame so that they have no chance of being selected in the sample or if items are included from outside the frame. Coverage error results in a selection bias
41
Nonresponse error
Not everyone is willing to respond to a survey. Nonresponse error arises from failure to collect data on all items in the sample and results in a nonresponse bias.
42
nonresponse bias
y. Nonresponse error arises from failure to collect data on all items in the sample and results in a nonresponse bias.
43
sampling error
r reflects the variation, or “chance differences,” from sample to sample, based on the probability of particular individuals or items being selected in the particular samples.
44
summary table
llies the values as frequencies or percentages for each category. A summary table helps you see the differences among the categories by displaying the frequency, amount, or percentage of items in a set of categories in a separate colu
45
contingency table
cross-tabulates, or tallies jointly, the values of two or more categorical variables, allowing you to study patterns that may exist between the variables. Tallies can be shown as a frequency, a percentage of the overall total, a percentage of the row total, or a percentage of the column total, depending on the type of contingency table you use. Each tally appears in its own cell, and there is a cell for each joint response,
46
ordered array
An ordered array arranges the values of a numerical variable in rank order, from the smallest value to the largest value.
47
frequency distribution
``` A frequency distribution tallies the values of a numerical variable into a set of numerically ordered classes Each class groups a mutually exclusive range of values, called a class interval. ```
48
class interval
. Each class groups a mutually exclusive range of values, called a class interval.
49
``` class interval width ```
intervale width = (highest value-lowest value)/number of classes = 55/10 = 5.5
50
class boundaries
``` Because each value can appear in only one class, you must establish proper and clearly defined class boundaries for each class. For example, if you chose $10 as the class interval for the restaurant data, you would need to establish boundaries that would include all the values and simplify the reading and interpretation of the frequency distribution. Because the cost of a city restaurant meal varies from $25 to $80, establishing the first class interval as $20 to less than $30, the second as $30 to less than $40, and so on, until the last class interval is $80 to less than $90, would meet the requirements. Table 2.9 contains frequency distributions of the cost per meal for the 50 city restaurants and the 50 suburban restaurants using these class intervals. ```
51
Class Midpoint
``` For some charts discussed later in this chapter, class intervals are identified by their class midpoints, the values that are halfway between the lower and upper boundaries of each class. For the frequency distributions shown in Table 2.9, the class midpoints are $25, $35, $45, $55, $65, $75, $85, and $95. Note that well-chosen class intervals lead to class midpoints that are simple to read and interpret, as in this example ```
52
relative frequency distribution
``` . A relative frequency distribution presents the relative frequency, or proportion, of the total for each group that each class represents. ```
53
percentage distribution
A percentage distribution presents the percentage of the total for each group that each class represents. When you compare two or more groups, knowing the proportion (or percentage) of the total for each group is more useful than knowing the frequency for each group, as Table 2.11 demonstrates. Compare this table to Table 2.9 on page 71, which displays frequencies. Table 2.11 organizes the meal cost data in a manner that facilitates comparisons.
54
proportion, or relative frequency
``` , in each group is equal to the number of values in each class divided by the total number of values. ```
55
Computing the Proportion or Relative Frequency
``` Proportion = relative frequency = number of values in each class total number of values ``` ``` If there are 80 values and the frequency in a certain class is 20, the proportion of values in that class is 20 80 = 0.25 and the percentage is 0.25 * 100% = 25% ```
56
cumulative percentage distribution
The cumulative percentage distribution provides a way of presenting information about the percentage of values that are less than a specific amount. You use a percentage distribution as the basis to construct a cumulative percentage distribution.
57
Stacked vs Unstacked
In an unstacked format, you create separate numerical variables for each group. For example, if you entered the meal cost data used in the examples in this section in unstacked format, you would create two numerical variables—city meal cost and suburban meal cost—enter the top data in Table 2.8A on page 70 as the city meal cost data, and enter the bottom data in Table 2.8A as the suburban meal cost data. In a stacked format, you pair a numerical variable that contains all of the values with a second, separate categorical variable that contains values that identify to which group each numerical value belongs. For example, if you entered the meal cost data used in the examples in this section in stacked format, you would create a meal cost numerical variable to hold the 100 meal cost values shown in Table 2.8A and create a second location (categorical) variable that would take the value “City” or “Suburban,” depending upon whether a particular value came from a city or suburban restaurant (the top half or bottom half of Table 2.8A).
58
Pareto Chat
In a Pareto chart, the tallies for each category are plotted as vertical bars in descending order, according to their frequencies, and are combined with a cumulative percentage line on the same chart. Pareto charts get their name from the Pareto principle, the observation that in many data sets, a few categories of a categorical variable represent the majority of the data, while many other categories represent a relatively small, or trivial, amount of the data. Pareto charts help you to visually identify the “vital few” categories from the “
59
stem-and-leaf display
A stem-and-leaf display visualizes data by presenting the data as one or more row-wise stems that represent a range of values. In turn, each stem has one or more leaves that branch out to the right of their stem and represent the values found in that stem. For stems with more than one leaf, the leaves are arranged in ascending order.
60
Histogram
``` The Histogram A histogram visualizes data as a vertical bar chart in which each bar represents a class interval from a frequency or percentage distribution. In a histogram, you display the numerical variable along the horizontal (X) axis and use the vertic ```
61
a percentage polygon
``` When using a categorical variable to divide the data of a numerical variable into two or more groups, you visualize data by constructing a percentage polygon. This chart uses the midpoints of each class interval to represent the data of each class and then plots the midpoints, at their respective class percentages, as points on a line along the X axis. W ```
62
cumulative percentage polygon, or ogive
``` e cumulative percentage polygon, or ogive, uses the cumulative percentage distribution discussed in Section 2.2 to plot the cumulative percentages along the Y axis. Unlike the percentage polygon, the lower boundary of the class interval for the numerical variable are plotted, at their respective class percentages, as points on a line along the X axis ```
63
scatter plot
res the possible relationship between two numerical variables by plotting the values of one numerical variable on the horizontal, or X, axis and the values of a second numerical variable on the vertical, or Y, axis.
64
time series plot
A time-series plot plots the values of a numerical variable on the Y axis and plots the time period associated with each numerical value on the X axis. A time-series plot can help you visualize trends in data that occur over tim
65
multidimensional contingency table
Both Excel and Minitab can organize many variables at the same time, but the two programs have different strengths. Using Excel, you can create a PivotTable,
66
Challenges in Organizing and Visualizing Variables
Obscuring Data Creating False Impressions Chartjunk
67
Central Tendency and Variation
Central tendency is the extent to which the values of a numerical variable group around a typical, or central, value. Variation measures the amount of dispersion, or scattering, away from a central value that the values of a numerical variable show. The shape of a variable is the pattern of the distribution of values from the lowest value to the highest value
68
e arithmetic mean
s the most common measure of central tendency. The mean can suggest a typical or central value and serves as a “balance point” in a set of data, similar to the fulcrum on a seesaw. The mean is the only common measure in which all the values play an equal role. You compute the mean by adding together all the values and then dividing that sum by the number of values in the data set.
69
Sample Mean
The sample mean is the sum of the values in a sample divided by the number of values in the sample:
70
median
The median is the middle value in an ordered array of data that has been ranked from smallest to largest. Half the values are smaller tha
71
The Geometric Mean
When you want to measure the rate of change of a variable over time, you need to use the geometric mean instead of the arithmetic mean
72
Variation
Variation measures the spread, or dispersion, o
73
Range
difference between smallest and largest values
74
variance and the standard deviation
. These statistics measure the “average” scatter around the mean—how larger values fluctuate above it and how smaller values fluctuate below it.
75
sum of squares (SS
A simple measure of variation around the mean might take the difference between each value and the mean and then sum these differences. However, if you did that, you would find that these differences sum to zero because the mean is the balance point for every numerical variable.
76
sample variance
The sample variance (S2 ) is the sum of squares divided by the sample size minus 1.
77
sample standard deviation
sample standard deviation (S) is the square root of the sample variance
78
coefficient of variation
Coefficient of Variation The coefficient of variation is equal to the standard deviation divided by the mean, multiplied by 100%.
79
Z Score
Z Scores The Z score of a value is the difference between that value and the mean, divided by the standard deviation. A Z score of 0 indicates that the value is the same as the mean. If a Z score is a positive or negative number, it indicates whether the value is above or below the mean and by how many standard deviations.
80
Kurtosis
Kurtosis measures the extent to which values that are very different from the mean affect the shape of the distribution of a set of data
81
lepokurtic, platykurtic
A distribution that has a sharper-rising center peak than the peak of a normal distribution has positive kurtosis, a kurtosis value that is greater than zero, and is called lepokurtic. A distribution that has a slowerrising (flatter) center peak than the peak of a normal distribution has negative kurtosis, a kurtosis value that is less than zero, and is called platykurtic.
82
Quartiles
Quartiles Quartiles split the values into four equal parts—the first quartile 1Q1 2 divides the smallest 25.0% of the values from the other 75.0% that are larger. The second quartile 1Q2 2 is the median; 50.0% of the values are smaller than or equal to the median, and 50.0% are larger than or equal to the median. The third quartile 1Q3 2 divides the smallest 75.0% of the values from the largest 25.0%. Equations (3.10) and (3.11) define the first and third quartiles
83
Interquartile Range
The Interquartile Range The interquartile range (also called the midspread) measures the difference in the center of a distribution between the third and first quartiles.
84
Five number summary
he Five-Number Summary The five-number summary for a variable consists of the smallest value 1Xsmallest 2, the first quartile, the median, the third quartile, and the largest value 1Xlargest 2.
85
Boxplot
The Boxplot | The boxplot uses a five-number summary to visualize the shape of the distribution for a va
86
Descriptive vs. Inferential
Descriptive is of a sample, inferential relates to a hypothesis