Market Research Flashcards
3 ways in which Data is prepared & Define
- Data Entry: Convert data to electronic form
- Data coding: Group and assign numeric codes to responses
- Data Cleaning: Check for errors & inconsistencies
What is an example of data coding?
I.E. female=1 male =2
What are the types of errors & inconsistencies that can be found during data cleaning?
Skipping patters: Answers when they shouldn’t or doesn’t when they should
Incompleteness
Impossible values: IE age = 999
“Straight lining”: occurs when survey respondents give identical (or nearly identical) answers to items in a battery of questions using the same response scale
What is descriptive statistics?
Summarizes data
Measures of Central Tendency: using Mean, Median, Mode as well as
Measures of dispersion: Standard Deviation, Variance and Range
What type of data can be measured using mean, median and mode?
Mean: Interval/Ratio
Median: all except nominal
Mode: any
How is the mean & SD calculated?
Mean: sum x/number of x
SD: sqrt[sum( X(i)-mean)^2/ (n)]
How is variance calculated?
Var = sum (X(i)-mean)^2/(n)
How is the sample SD and the population SD differ?
The samples SD uses n-1, whereas the population SD is n
What is a one way frequency table?
Table that shows number of respondents choosing each answer yo a survey question
Application Rule for Frequency Tables
Always applicable, but not always effective if the variable contains too many values
What measurement scales is used for Mean, Median, Mode?
Nominal: Mode
Ordinal: Mode & Median
Interval/Ratio: Mode, Median, Mean
Covariance
How much two random variables change together
Pearson Correlation
A scaled version of covariance:
when p=0 no relationship,
|p|<0.3 weak
|p|> 0.49 strong
2 caveats on correlation
- When p=0 there is no Linear Correlation, which means there may be non linear relationships
- Measures how closely data is scattered around a linear line & has nothing to do with the slope
How do you interpret the crosstabulation?
Lecture 12 Slide 2 - photo
What questions does the chi-square analysis answer?
Are the percentages found on a cross tab table actually different or did they happen by chance, or is it an overall population pattern?
What is a hypothesis?
Is an assumption that a researcher makes about some characteristics of the population
Null hypothesis?
The status quo, no effect, no relationship, no difference.
Alternative hypothesis
There is an effect, there is a difference, relationship
What are the hypothesis framework?
- Hypothesize something about the population H0
- Measure the chance of observing the sample if H0 is true
- If the chance is high accept H0, if it’s low reject the H0 and conclude H1
Test Statistic
Is a standardized value that is calculated from the sample data during a hypothesis test conditional on the null hypothesis.
IE z-score, t-test, F-statistic, x^2
P-Value
It measures how likely we can observe the sample data if the null hypothesis is right.
If p is small the null must go
Significance Level
Compare the p value to our significance level. Usually 0.05
We reject the null when ____ is less than the ______
P- value
Sig level
5 steps in hypothesis testing
- State the hypothesis
- Choose the appropriate test based on the problem
- Develop a decision rule
- Calculate the value of the test statistic/p-value
Decision Rule
A standard to reject or fail to reject the null hypothesis.
P-value, Significant Value
When you look at SPSS where is the test statistic? Where is the P-value?
Pearson Chi-Square & Value = Test statistic
Pearson Chi-Square & Asymptotic Significance = P-Value
How do you state the conclusion?
With 95% confidence we can/ cannot reject the null hypothesis that there is not relationship between X and Y
Explain the types of errors?
Type 1: False Positive
Type 2: False Negative
Type I or II?
The person is innocent but you conclude that the person is guilty
The person is guilty but you conclude that the person is innocent
Type 2: False Negative
Type 1: False Positive
When do you use the Chi-Square test?
Chi-square Test when you want to examine the relationship of two nominal/ordinal
variables
• Compare the proportions (nominal/ordinal) of different groups
What is this problem type?
Do people’s perception of PCs have changed after seeing the ads?
Problem Type: Compare the mean of an (interval/ratio) variable to a number
One-Sample T-Test
Do purchase intents of a PC vary between people who have and have not seen the ads?
Problem Type: Compare the mean of an (interval/ratio) variable of different groups (2 groups)
Independent Sample T-Test
Do purchase intents of a PC vary among people who have PC only, who have Mac only, and who have both?
Problem Type: Compare the mean of an (interval/ratio) variable of different groups (more than 2 groups)
One-Way Anova
Do people rate the importance of quality and that of reliability differently?
Problem Type: Compare the means of two (interval/ratio) variables
Paired Sample
Explain the 4 types of Mean Comparisons tests
Compare the mean of an (interval/ratio) variable to a number
-One-Sample T-Test
Compare the mean of an (interval/ratio) variable of different groups (2 groups)
-Independent Samples T-Test
Compare the mean of an (interval/ratio) variable of different groups (more than 2 groups)
-One-Way ANOVA
Compare the means of two (interval/ratio) variables
-Paired Samples T-Test
Hypothesis of a One-Way ANOVA
Null: There is no difference between variable one and categorical variable with 3+ categories IE mean 1=mean2=mean3
Alternative Hypothesis: at least one group has a different relationship from the other two
How do we run a hypothesis test on the following?
Do purchase intent of a PC vary between people who have and have not seen the ads?
1 State Hypothesis: Null: There is no difference between the two means
2 Test: Independent sample test
3 Decision Rule: 0.05
4 P-Value: 0.199 ->Is there a difference between the variance of the two groups? Based off that you choose which P-value to look at when comparing the means
5 With 95% confidence we fail to reject the null hypothesis XYZ
When do we use Z tests?
When random sample is >=30 + proportion, metric, one or two means
____________________ are those in which measurement of the variable of interest in one sample has no effect on measurement of the variable in the other sample
Paired Samples
The number of ________________ is the number of observations in a statistical problem that are not restricted or are free to vary
Degrees of Freedom
The _________________________ enables the research analyst to determine whether an observed pattern of frequencies corresponds to, or fits, an “expected” pattern
Chi-Square
_______________ are those in which measurement of the variable of interest in one sample may influence measurement of the variable in another sample
Related Sample
“Because the calculated χ2 value (7.6) is higher than the table value (5.99), we ___ the null hypothesis”
Can
For hypotheses about one mean, with small samples (n<30), the __________ with n − 1 degrees of freedom is the appropriate test for making statistical inferences
T-Test
How do you calculate the T- Statistic ?
Z = Sample mean - population Mean under H0 / Estimated Standard Error
How do you calculate the Estimated Standard Error
SD: Sqrt[ {sum( Xbar-Xi)^2} / n-1] OR Variance sqrt
SE: SD/ Sqrt(n)
Marketing researchers often need to determine whether there is any association between two or more variables in a sample. The _______________ test for two independent samples is the appropriate test in such situations
T-test for independent samples
Although the _____________________ is generally used for large samples, nearly all statistical packages use the t test for all sample sizes
Z-test
In many situations, researchers are concerned with phenomena that are expressed in terms of percentages - also known as ____________________
Test for proportion
A hypothesis test of proportions is a test to determine whether the difference between proportions is greater than would be expected because of _________________________
Sampling Error
When the goal is to test the differences among the means of two or more independent samples, analysis of variance __________________________ is an appropriate statistical tool
ANOVA
What does ANOVA mean?
Analysis of Variance
- Mathematical differences
- Statistically significant difference
- Managerially important differences
a. if a difference is large enough to be unlikely to have occurred because of chance or sampling error
b. statistically significant difference large enough to be important to management
c. if numbers are not exactly the same
- C.
- A
- B
What is bivariate analysis?
The degree of association between two variables
Criterion and predictor varables
Criterion: Dependent Variable - Explained by the X variable
Predictor: Independent Variable - affect the value of the Y variable
______ AKA _______ is used to analyze the relationship between
two variables when one is considered the dependent variable and the
other the independent variable
Bivariate regression & Simple regression
How do you determine if using a linear regression model is appropriate?
Scatterplot
What is the Least-Squares Estimation Procedure?
Y= a+bX+e or Y = B0 + B1X
What is the R^2? What does it measure? What is it’s range?
Describes the nature of the relationship between X & Y, a measure of the strength of the linear relationship btwn X & Y
It is a measured percentage of the total variation in Y explained by the variation in X
0-1 where 1 is the strongest
What is the formula for R2?
Mean Variation - Unexplained variation / Mean Variation
SST & SSR?
Total Sum of Squares: Total variation
Sum of Squares due to Regression: Explained Variation
What is Beta? What is the hypothesis?
The Regression Coefficient
H0: B=0 Ha: B DNE 0
What is the range of the Pearson Correlation? What is weak, moderate and strong
-1 <= p(X,Y) => 1
Weak: less than 0.3
Moderate: greater than or equal to 3 and less than and equal to 0.49
Strong: greater than 0.49
What common issues arise in correlational interpretations?
Outliers
Effect size may be too small to be a useful r
Non-linear realtionships
High correlations are often tautological
What does it mean when the Pearson correlation is 0.762 & P-Value is less than 0.05?
Positive strong linear relationship between the way X & Y move
P-Value: The correlation is different from zero
- Pearson r = .93 indicates a larger/steeper slope than a Pearson r = .4. 2.
- If a Pearson r is statistically significant, it means that a linear approach is the best one.
- A Pearson r with a p-value of < .001 indicates a weak correlation.
False
What is the simple linear regression model?
y= a + Bx + e
a= intercept
B=slope
e= Random error
Draw the OLS model
Ordinary Least Squares Regression
a hat: the intercept, value of why when X is zero
b hat: slope, estimated change in the average value of Y as a result of a one-unit change in X
e is the cumulative difference between the regression line and the points
Explain the goodness of fit in terms of regression?
R-Squared indicates how well the variables fit with the regression line, and the more variables that are in the line, the better the fit
What is the caveat with regression?
Loose confidence in the predictions when the results fall outside the current range of X
The three classifications of the data problems are:
Type A: Long term data
Maximize profit for an existing product
Type B: Short term data
Increase visibility of just launched product
Type C: No data
Predict how a new product will perform
Which test to choose?
Within Group:
- Does mean differ from benchmark? One Sample T-Test
- Does mean of x and mean of y differ? Paired Sample T-Test
Between Groups:
- Does frequencies differ between groups? Chi-Square Test
- Does mean of X differ between 2 groups? Independent Sample T-Test
- Does mean of X differ between 3+ groups? ANOVA
What is multicollinearity? How do you discover it? Why is it bad?
When your independent variables are highly collinear with each other
- Look at the correlation matrix of the independent variables
Bad b/c we cannot distinguish between the individual effects of the independent variable on the dependent variables
How do you solve multicollinearity?
Get more data
Don’t include all of the independent variables
Drop the correlated variables
Or combine them to create a new variable, Factor Analysis
Dummy Variables
0 or 1 to let us know if there is or isn’t the presence of a categorical value
When should we use dummy variables?
A categorical variable should be recoded into a dummy variable in regression analysis
How many dummy variables should we include?
K-1 for k categories
The value of the categorical variable that is not represented explicitly by a dummy variable is called the ________.
An example of this in terms of gender would be ?
Reference group
Gender: if X1= 1 if women, 0 otherwise & X2=1 if male, 0 otherwise. THE Reference group would be non-binary
D1= 1 if female 0 if not D2= 1 if male 0 if not y= annual spending on clothes ($) x= age y= 200 + 20x - 50D2 Interpret the values
The average annual spending on clothes for women at the age of 0 is $200
If the age increases by one year, average spending on clothes increased by $20
Men on average spend $50 less than women fo every year
How does the reference group relate to the regression model?
Slope is the same interpretation
alpha: When [the reference group] is activated then [alpha value] is [Y variable]
Beta coefficient: Compared to the [Reference group] the average [Dummy variable] in/decreases by [Beta coefficient value]
What does the ADJ R2 value do?
It adjusts for more variables
What is the rule of thumb for the Beta coefficient?
when it is more than 2 times the standard error it is a good fit
What is VIF?
Variance Inflation Factor, gives a measure of multicollinearity.
Keep it below 10, is caused by too many variables
MC = 1.23/bottle
10% markup for retail price
Wholesale price points 1.80, 2.00 or 2.20
Regression: sales = 789.150 - 250.813 * RetailPriceOfBrand
How to calculate?
- Calculate Retail Price= Wholesale * (1 + 10%)
- Put price into regression model
- Calculate profit: Profit = (Retail price-MC) * Sales
Highest profit is your choice
When can you not use linear regression?
Prediction would not be exactly 0 or 1 but some continuous number
Predictions could be outside the range of [0,1]
In binary regression what are the outcomes of the dependent and independent variables
Dependent: Outcome is binary
Independent: What do you think can predict the outcome
What is the logistics regression model? What are the constraints?
ln (p/1-p) = a +B1X1 + …+ BkXk
p= exp(a +B1X1 + …+ BkXk)/ 1+exp(a +B1X1 + …+ BkXk)
0