C14 understanding quant data Flashcards

1
Q

planning analysis - what is quant analysis? how does it relate to design?

A

Likely to go into the analysis stage with fairly solid ideas about what you are looking for. Nature of quant is that you have a clear idea about concepts you want to measure, questions wanted to address and hypotheses wanted to test. You will have thought about this in the deciding sampling frame, who and how many, and how to ask them. Quality of analysis is based on quality of problem definition, research design stages, quality of earlier stages. Outcome of analysis will be of much better quality if the problem was clearly defined and that research designed would deliver evidence that would help to address clients business problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

planning analysis - reviewing materials

A

Likely to go into the analysis stage with fairly solid ideas about what you are looking for. Nature of quant is that you have a clear idea about concepts you want to measure, questions wanted to address and hypotheses wanted to test. You will have thought about this in the deciding sampling frame, who and how many, and how to ask them. Quality of analysis is based on quality of problem definition, research design stages, quality of earlier stages. Outcome of analysis will be of much better quality if the problem was clearly defined and that research designed would deliver evidence that would help to address clients business problems.

May be useful to reacquaint yourself with a brief, sampling plan and questionnaire. Brief will provide business problems and information needed to address it, which you must not lose sight of whilst analysing. When reviewing brief you should ask:

Why research is needed
How findings are to be used
Research objectives
What aim of research is - explore, describe, explain, evaluate 
What are working hypotheses or ideas?

In tackling analysis you are looking for information in data - meaningful insights - that will allow clients to make informed decisions.

Sampling plan will tell you who you need to look at - which groups / types of people. Questionnaire is in effect a map or index of data that you have to address research objectives. Use both in conjunction with brief to look at what data you need to look at by which groups, comparisons to be made and to what end.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Planning analysis - benefits of it, secondary data

A

Analysis strategy will help to stretch resources; will take you through mass of data in a systematic and rigorous way. One that meets requirements set out in brief, will make tasks entirely efficient. Strategies should not be set in stone, data may throw up interesting or unexpected findings, and it is acceptable to explore these in relation to research objectives.

May be useful to revisit secondary data sources, initial background / secondary research for particular study or body of existing knowledge / literature; it will give ideas or help develop thinking and analysis. May be useful to look at well-developed models and theories which can be a source of inspiration and should help, but used critically.

Once an analysis plan is in plan, you should get to know data and start working through and reorganizing to suit your purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Understanding data - concepts, questions and variables

A

Measuring in a research context can mean gathering data on relevant things. This may be straight forward e.g. age or may be conceptual e.g sexism. A valid and reliable measure of a concept has to start with an examination of it, agreeing definition of it and what dimension of it that is relevant in research objectives, and establish which indicators will be used to measure this. Final, question was designed - this process can be called an operationalizing concept. Response format had to be decided e.g. age into 4 bands of which can be statistical tests on these groups. Back at qnn stage you would have been thinking ahead to analysis to make these decisions.

At the analysis stage conventional practice is to refer to questions designed as variables and responses as values of variables. Important thing to note at this point in connection between questions and variables, and link back to concept you are to measure as well as link between choice of question / response format and its impact on what you can do with the analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Case, variables and values

A

A complete individual unit of analysis is called a case - typically one questionnaire, record of interview with one respondent is one case. 300 completed qnn is 300 cases. To identify each individual case a unique number, serial number, is assigned. For each case individual bits of information are called variables, and answers the respondents give to these questions are variables. The process of responses being assigned numbers is called coding. Coding means that an answer, response to a question, is converted into a number value that the analysis programme can read.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data entry

A

When qnn is administered or completed computer aided (CAPI, CATI or online by respondents), the process of data entry - moving responses from qnn to data file - is done automatically. If you are using paper qnn this must be done using data entry; for an analysis programme to read data it must be in a regular, predictable format. For most datasets the data usually appears in a grid arrangement - sort seen in spreadsheet or analysis packages such as SPSS. grid is made up of rows of cases and columns of variables. These number codes are what you or the data entry programme transfer from qnn into analysis program in a process known as data entry / data input. Packages also allow alphanumeric codes, these are called string variables.

Typically, frequency counts will be converted into a percentage calculated on the most suitable base for a particular question, all answering or total sample. You can ask in DP spec or when you write table specifications that both percentage and frequency count / raw number appear on tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Levels of measurement - what are they?

A

Nominal scale numbers - used to classify or label things. Other symbols would be just as suitable but numbers are used as they are familiar.have arithmetic meaning or value.

Ordinal scale numbers - represent category / indicate that there is a relationship between the numbered items. In other words there is ranking / order / sequence to numbers e.g. house numbers are ordinal numbers; position in race or birth order in family e..g first secomnd third are ordinal rankings. Ordinal numbers do not represent a real amount, so arithmetic is not meaningful.

Interval scale numbers - represent measurement numbers or values, so arithmetic. Numbers in interval scale are ordered and intervals between numbers are of equal size. Temperature is based on interval scale. There is no absolute zero - negative amounts mean something e.g. -5 degrees. Income is an example of an interval level variable.

Ratio scale numbers - same properties as interval scale numbers, have a rank order, equal intervals, arithmetic is meaningful, but on ratio scale there is an absolute zero. Zero on ratio scale means that there is nothing there - whereas on interval scale zero might mean low or very low. Examples are time, weight, number of times an item has been used or number of children in HH.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why do levels of measurement matter?

A

Interval and ratio level variables can be manipulated using a range of mathematical and statistical procedures, as they represent numeric amounts and arithmetic is meaningful with these types of numbers. Nominal and ordinal level variables are not suitable for this. To determine what kind of analysis is appropriate, type of statistical test to use when testing hypotheses, it is important to recognise what kind of number or variable you have. Different tests are suitable for different levels of measurement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Editing and cleaning dataset - why?

A

Either as data is being entered or afterwards they are edited / cleaned to make sure they are free of errors and inconsistencies e.g. missing values, out of range values, errors due to misrouteing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Missing values - why occur? how to avoid in design?

A

Blank responses. Can occur as:

Question may not apply to respondent
Respondent may not know answer
Respondent refuses to answer
Interviewer forgot to record response

Missing value must be dealt with to avoid contaminating the dataset or misleading researcher / client. Adding a DK or N/A can prevent this at qnn design stage and at interviewer briefing sessions. Interviewers should be briefed on how to code these. Possible to avoid missing answers by checking respondents at the end of the interview or in quality control call-backs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Missing values - how to deal with them?

A

If missing values remain, a code can be added to the data entry program that allows missing value to be recorded. Typically code is chosen with a value that is out of range of possible values for that variable. Another option, extreme one, if casewise deletion in which you remove cases that contain missing values. Result in reduction of sample size and may lead to bias, as cases with missing values may differ from those where there is none. Less drastic approach is pairwise deletion in which only cases without missing values are used in the table or calculation for specific questions. May also replace missing value with a real one - two ways of approaching this:

calculate mean from variable and use that
Calculate an imputed value based on either pattern of response of respondents with similar profiles to respondents with missing value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

inconsistencies , routing errors and out of range values

A

Resolving problems in inconsistencies, routing instructions not followed correctly, extreme answers and answers that are not valid or are outside the range of possible answers. Incorrect routing should not happen if CAPI where routing is automatic, and programme alerts of inconsistent answers and refuses answer codes that are out of range. Further checks on accuracy and consistency of data can be made at the next change of process, when data is available in the form of a frequency or holecount.

Once data has been entered, edited and verified they are in form that they can be manipulated or analysed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Manipulation of variables

A

Some variables may not be in a form that is useful for further analysis. Possible to change variables / values by recording them or manipulating them into new variables. If a variable is at interval / ratio level of measurement you can use arithmetic functions to create new values based on values of original value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Types of data analysis? what is data analysis?

A

Purpose of the project has been to answer questions raised by clients in wanting to explore, describe, count, explain, understand or evaluate an issue or problem relevant to their business problem. Now at the point of being able to answer these problems, under the assumption that research questions were relevant to research problems and appropriate research design was undertaken).

Four types of analysis

Univariate descriptive
Bivariate descriptive
Explanatory
Inferential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What sampe for inferential analysi?

A

May in the course of the project use one or more types of analysis. Inferential analysis depends on which type of sampling you used. That is, whether you used probability or non-probability. Reason for using probability / random sample is to generalise sample to population - estimate whether what you see in sample exists in population from which sample was drawn. If you use this kind of sampling you can use inferential analysis to make inferences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Univariate descriptive analysis?

A

Analysis that describes one variable - basic but useful and informative type of analysis. Purpose oh which is often to help to get to know data. It involves summarising or describing reposes using frequency counts and frequency distributions, and calculations known as descriptive statistics - measures of central tendency (averages) and measures of spread or variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Frequency counts and skews

A

Count the number of times a value occurs in the dataset, typically the number of respondents that gave an answer. Useful to run frequency counts before detailed analysis or table spec as it gives overview to a question, allowing you to see size of sub-groups within sample, what categories of responses might be grouped together, and what weighting may be required. Can decide whether it is feasible to isolate certain groups to look at how attitudes, behaviour or opinion differ from other groups.

Can also be used to look at graphical display of frequency on what is know as a frequency distribution chart; plot range of values on X axis and frequency on Y axis, allowing you to quickly and easy see spread of values for particular variables. Useful way of describing shape of distribution or continues of metric variables. If it is symmetrical (bell curve, normal distribution), there is no skewness in either direction, mean, median and mode will be the sane. When distribution is skewed, asymmetrical, they will not be the same value. Positively skewed population jas a greater proportion of values lying above mean, negative is opposite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Raw numbers, proportions, percentages and ratios

A

Frequency count is typically expressed in raw numbers. Does not tell what proportion or percentage of total sample this number represents. Useful to reduce frequencies to proportions or percentages as it allows us to compare data between groups. Proportion is the relative incidence of occurrence expressed a proportion of 1.00. / frequency of occurrence divided by total numbers of cases and times by 100.

Ratios are a useful way of comparing relative size of two group - a good way of digesting data.

19
Q

Graphical displays

A

Graphical displays of frequency are charts - pie, bar, histograms, line graphs. In choosing you should consider the data you have. For categorical data (nominal, ordinal) most suitable formats are pie charts and bar charts, for ratio / interval most appropriate are histograms and line graphs.

20
Q

Pie charts

A

If you want to show how something divides into its parts a pie chart is useful. E.g. breakdown of political party votes in election. Each segment represents porportionof same, and should be ordered logically in a clockwise direction. If you want to highlight one segment you can ‘explode’ it, removing it slightly from the rest. Not a good choice if you have many categories in variable as they are hard to read.

21
Q

Bar charts

A

Bar charts should be used for nominal / ordinal, histograms should be used if data is interval / ratio. There are several ways of displaying bar charts, vertically as well as horizontally, two or more bars on one chart that are clustered together to show responses of sample to different brands for example.

22
Q

Histograms

A

Bar chart but without the spaces between the bars; this is because it will be showing continuous data or interval / ratio level or measurements e.g. age bands or income groups, not data that can be grouped into discrete categories e.g. male / female. Width of bar represents size of interval covered by variable, and so area of each bar is proportional to frequency or responses for that group.

23
Q

Measures of central tendency - what is it? what types?

A

Averages. Single figures used to represent average of a distribution or groups or values, it anchors / locates the distribution on a scale of all of its possible values. There are three versions of this: mean, mode and median.

24
Q

Measures of central tendency - explain types?

A

Mean

Arithmetic mean is average most often used - can only be used on data of at least interval level of measurement. Add together all values in sample and divide by total number of values.

Mode

Most frequent response. Requires no calculation. Shows what value most frequently occurs. Can be used on data of any level of measurement. Possible to have more than one mode.

Median

Middle value when all values are arranged in order. Used on all types of data except nominal level data. Has same number of values or observations above as it has below; if you have no middle value (even number of values), you must take mean of two middle values

25
Q

Measures of central tendency - properties and when to use

A

Each has particular properties. Extremes can distort values however, except for mode where the most frequently occurring response is chosen.

Use mean when:

Need statistic that is widely understood
Want to take into account influence of all values, even outliers
Need statistic to use in further calculations
You do not need a ‘real’ value
Data are at interval / ratio level of measurement

E.g. average HH income, average spend, average age of users of X

Use median when:

Want average that is not affected by outliers
Do not need average to calculate further statistics
Middle value has some significance
Realistic representation of average
Data are interval / ratio level

E.g. average breakdown rates of dishwashers

Use mode when:

Do not need any further statistics based on average
Only interested in most frequent value
Data are numerical or non numerical

E..g. price willing to pay, favourite colour

26
Q

Measures of variations

A

Tells us something about where the middle of distribution is, but does not tell us about range of values. This is where we use measures of variations. Range and standard deviation are most commonly used. Knowing level of measurement of variable is important in deciding what measure of variations to use.

27
Q

The range

A

Difference between highest value in distribution and lowest value. Suitable for metric data (ratio and interval level). Helpful in determining scope of distribution, range over which values are spread. The bigger range, the bigger the spread in values, smaller the range the more tightly clustered the values.
Range is a crude measure as one outlier can have a huge impact. To calculate interquartile range you divide distribution in four and the interquartile range is the difference between third quartile and first quartile.

28
Q

Variance and standard deviation

A

SD is a statistic that summarises the average distance of values from the mean. Like the range, the bigger the SD the greater the variation / spread of data in sample / distribution. More robust calculation than range because in calculating we used more of the values in distribution.

SD is a useful statistic, particularly when used alongside the mean; but distribution must be a normal distribution to give a reasonable indicator of spread.

29
Q

Story so far

A

Now have a reasonable armory in which to explore data; freq counts and % will tell you how many gave answers, measures of central tendency and variation give average and the spread of the whole group of answers. At this level of analysis you are only looking at one variable at a time - hence univariate. Typically you will need to compare responses of different groups of people to see if there are patterns, to determine if relationships exist within data.

30
Q

Bivariate descriptive analysis

A

Involves two variables, e.g. age and numbers of texts sent per month, and allows you to determine if there are similarities or differences between values of one variable in relation to values of another. Allows description and measurement of strength of relationship / association between two variables.

To get to grips with bivariate descriptive analysis there are a number of concepts / terminology needed to master:

Ideas and hypotheses
Cross tabs and cross break / top breaks / banner headings
Dependent and independent variable
Bases and filtering 
Weighting
31
Q

bivariate - Ideas and hypotheses

A

Ideas and hunches are things of interest and relevance to clients information needs and research objectives. These are called hypotheses. In planning analysis, in reviewing brief samples & qnn, further ideas / hypotheses may occur to you. It is likely as you work through analysis further ones will emerge,

In inferential statistical tests you formulate a statistical hypothesis to find out whether characteristic of interest or the relationship that you can see in your sample data can be expected to exist in the wider pop. if you have data from a non-probability sample then you have no use for statistical hypotheses but can still formulate ideas to examine in data.

32
Q

bivariate -Dependent and independent variables

A

In formulating ideas / hypothesis and talking about relationships between variables we often designate one variable as dependent variable and other as independent. Dependent variable is the one we predict will change as a result of the other, the independent (explanatory) variable is the one we think explains the change in the former.

Designating one variable as dependent and another as independent variable suggests that we know direction of influence - that we know what variable influences the other. Very often you will know this from knowledge of subject area (lit reviews, previous research, results of exploratory research). Use this thinking about variables to design cross tabs, to make decision about what should appear in banner heading or cross-break, since it is traditional to look at responses to questions by variables that help us to look for and describe relationships and think about further things like influence. It heps if you can compare responses of different groups or types of people side by side. Remember that in deciding that a variable is independent you are making assumptions, and if you suggest that this is the cause of a relationship you have gone too far.

33
Q

Cross tabs

A

Most common way of doing bivariate descriptive analysis is using CT of one variable / set of variables or Qs against another. This is called CT. size of CT is determined by number of categories that each variable has. Each cell contains % (typically also a raw number or freq count). Using table we can compare side by side responses between different groups.

34
Q

How to read a cross-tab

A

Each column in the table is based on the total number of people in a particular group, and determined by the number of people who gave that answer to the question from which it is derived. When it is only possible to give one answer to a question, this should add up to 100%. Due to rounding of proportions it may sum to slightly more / less. If you were able to give more than one answer it is likely to add up to more than 100%.

35
Q

Including DKs in calculating %

A

Questions should offer DK or N/A answer options where appropriate. It is usual in expressing % to specify whether these options have been included or excluded. Deciding on how to handle such responses will depend on aims of question - it may be important to report on those who answer DK, as it may be a genuine answer to the question. Deciding which way to report data will depend on context. In most cases it is useful to report both percentages who said DK / NA and proportion split between responses excluding these. Also important to remember that people in some cultures are more likely than others to give a DK answer. If you are analysing and reporting multi-country data you must be aware of this and take it into account.

36
Q

Compiling set of cross-tabs

A

Cross-tabs often include an array of variables in top break. Choice of variables to include in top break (those that define columns), should be prepared with objectives in mind. If research objectives involve determining profiles of users of a product, service for example or finding out what whether different groups vary in terms of their attitudes. Looking at data through the eyes of those with different attitudes or behaving in different ways can help us to understand what motivates or influences different types of people, and can help to build a picture of dynamics of the market.

Getting a set of cross tabs is easy and often quicker to ask for all variables to be tabulated against every demog, geodemog, attitudinal and behavioural variables - any variables that might be useful to analysis. Be selective in specifying variables for top break in cross tabs and only ask for relevant ones to analyze as you may lose focus of what analysis is trying to pull from data to address research objectives. Take an orderly and systematic approach - if questions arise that you cannot answer with tables you have, think about what other tables or analyses might help and make a note to run those next.

37
Q

Uses of bases and filtering in tables

A

Each table is usually based on those in sample eligible to answer questions to which it relates. Not all questions are asked of total sample, and analysis based on total sample isn’t always relevant. Tables should only be based on those eligible to answer questions. In designing tables it is important to think about what base is relevant to aims of analysis.

If you have particularly large or unwieldy dataset and you do not need to look at responses from total sample, filtering data (excluding some types of respondents or basing tables on relevant sub-sample) can make data analysis more efficient.

38
Q

Labelling tables

A

Cross tabs should be clearly laid out and easy to read - makes the task of thinking about findings much easier. Each table should have a heading that describes the content, question number to which it refers, and in full / summary the question / variable on which it is based. Base on which percentages are calculated should be clearly shown and should be indicated whether percentages are based on column or row variables or both.

39
Q

Weighing data

A

Used to adjust sample data to make it more representative of target population on particular characteristics including demogs / product / service usage. Procedure involves adjusting the profile of sample data to bring it into line with population profile to ensure that relative importance of characteristics within the dataset reflects that within target pop. any weighting procedure used should b clearly indicated and data tables should show unweighted and weighted data,

40
Q

Data reduction

A

Process of reducing mass of data to something that is more manageable and meaningful. May be as simple as calculating mean / standard deviations for a variable (univariate descriptive analysis) or recoding variables or getting rid of variables from cross tabs that are not useful or relevant. It also takes in more complicated procedures such as creating scales or indices based on responses to a range of questions. Some researchers also consider factor / cluster analysis to be data reduction.

41
Q

Data reduction at frequency count stage

A

Should think about data reduction at this point; by reviewing freq counts and freq distributions for each question for a total sample you will be able to make decisions about recoding variables - which categories of which variables might be usefully combined together. You will also be able to make decisions about viability of key variables as top breaks for cross tabs - are base sizes big enough and / or robust enough to view separately in column?

42
Q

Data reduction at univariate descriptive analysis stage

A

Reduce mass of data with relevant descriptive statistics (averages and measures of spread / variation). Especially useful with scale qs as you can get an average score for the whole sample / sub-groups and one number that tells you the amount of variation.

43
Q

Data reduction at bivariate descriptive analysis stage

A

Having reviewed research objectives and refreshed mind about clients business problems, and having seen raw data and frequency counts, you will have a good idea about variables you want to use as top breaks and how you want data for each question to appear in the cross tab. You should have enough information about data to allow you to be selective - and you do need to be selective so that you do not lose sight of big picture.

Choose only to run tables that are relevant to research objectives, and only relevant top breaks and relevant recorded variables summary statistics. A final data reduction issue to consider at this stage is whether you want column / row percentages rounded off to the nearest whole number or whether you want them calculated to one or more decimal places. Editing process may involve removing rows and columns that do not tell you anything, reordering and relabelling to draw attention to keu insights. Reordering demogs by level of interest and insight they provoke.

44
Q

Data display as data reduction

A

Data visualization can allow data reduction, by using (for bivariate data) scatter plots, line graphs or bar charts where appropriate. Can be used to determine if there is a relationship between two variables. Scatterplots are often produced as the first step in looking for associations or relationships prior to running correlation or regression analysis.