Vocab Flashcards
Chapter 1 - Collection of data.
Population
Everyone/everything involved in an investigation.
Census
An investigation with data taken from every member of a population.
Sample
An investigation with data taken from a select few of the population.
Bias
Anything that distorts the data.
Strata
A subgroup/subcategory within a sample.
Sampling frame
A list of all the items/people forming a population.
Sampling unit
One item from a sampling frame.
Observation
You record something happening.
Experiment
You record data from something you make happen.
Qualitative data
Describes certain qualities.
Quantitative data
Describes certain quantities, can be discrete or continuous.
Continuous data.
Data we can measure.
Discrete data.
Data we can count.
Primary data.
Collected by the user.
Secondary data.
You obtain the data from somebody else.
Questionnaire.
A set of questions used to obtain data, which respondents complete, can be anonymous.
Interview/Survey.
Data collection methods. Ask people their opinions, can be anonymous.
Pilot survey.
Testing a questionnaire on a small group of people first.
-identifies likely responses
-checks response rate
-see if questions are understood
-checks how long it will take
-unexpected outcomes(refine hypothesis/change something)
-problems easier and less costly to fix before full study
-check methods of distribution/collection work
-estimate time/costs of full study
Open questions.
No suggested answers, differently worded answers can make data analysis difficult.
Closed questions.
Suggested answers to choose from, opinion scales where people tend to answer in the middle as they do not wish to be extreme.
Capture recapture
A population estimate.
Judgement sampling.
Use judgement to select a sample representative of the population.
Opportunity sampling.
Use available people/objects at the time.
Systematic sampling.
Choose a starting point from your sampling frame at random, then choose items at regular intervals. (e.g. sampling frame of 1st 32, use RNG to pick number in 1st 32, then go up sample in intervals of 32s asking every person selected.)
Random sampling.
Everyone in the population has an equal chance of being selected (unbiased).
Quota sampling.
Group by characteristics, and interview a number from each group
Cluster sampling.
Data naturally splits. List of clusters = sampling frame. Randomly select clusters to form sample.
Stratified sampling.
Intentionally different proportion of people asked from each strata, depending on size. (e.g. 60/1000 x 250 =15 year 7s in sample).
Random response method.
For sensitive questions which people are likely to answer dishonestly (e.g. flipping coins, if heads, tick yes, if tails, answer honestly.)
Primary data advantages
gather data that directly relates to hypothesis
you know reliability
primary data disadvantages
expensive
time consuming
difficult/impossible
secondary data advantages
easier to get hold of
can gather data quickly and cheaply
large data sets
secondary data disadvantages
wrong format/rounded
difficult to find data that matches your hypothesis exactly
(out of date, no relevant data available)
don’t know accuracy, may be biased, unreliable
census pros
representative of entire pop.
unbiased
census cons
hard/impossible for big pops.
expensive
impractical
might be tricky to define entire pop/access all members
not an option when items being used up/damaged by investigation
sample pros
quicker
cheaper
more practical than a census
sample cons
less accurate
not fully representative
biased
variability between samples
random sampling pros
unbiased
(should be) representative
random sampling cons
not always practical/convenient-if pop. spread over large area, travel
impossible to list entire pop. or access everyone
stratified sampling pros
likely gives a representative sample if you have easy to define categories (e.g. gender)
can compare results from different groups
stratified sample cons
not useful when no obvious categories/hard to define
can be expensive
systematic sampling pros
unbiased sample
can be done by machine
systematic sample cons
nth item might coincide with a pattern (e.g. fault) so biased
cluster sampling pros
convenient (saves travel time when pop. spread over large area)
cluster sampling cons
biased if similar clusters sampled, e.g. with similar incomes per region.
quota sampling pros
quick
representation of all diff groups (genders etc)
can be done with no sample frame
member easily replaced by one of the same characteristics
quota sampling cons
biased- interviewer bias
refusal to take part (might have similar views)
-not all may have an equal chance of being selected
opportunity sampling pros
convenient
opportunity sample cons
-not representative of pop.
-very biased.
-selecting at a particular time and place so not all students have an equal chance of being selected.
judgement sampling pros
quick
sometimes may be the only suitable method to use
judgement sampling cons
researcher bias
researcher unreliable-though should have good knowledge of pop.
not random -very biased
categorical scale
gives names or numbers to classes of qualitative data so it can be more easily processed. (numbers don’t have meaning).
ordinal scale
(rank scale)
gives numbers to the classes of data which can be ordered in a meaningful way.
multivariate data
made up of two or more variables
bivariate data
data made up of two variables (numerical)
questionnaire pros
quick and cheap
well written ones shouldn’t be biased
respondents aren’t under pressure, so their answers likely truthful
can distribute to large numbers of people
questionnaire cons
distribution can lead to bias
non-responses
(particularly on sensitive Qs)
(discard but might remove certain parts of pop.)
questions might not be understood by respondent
methods to distribute questionnaires (pros and cons)
hand it out - target pop gets, but time consuming
put it online -data recorded and collected easily, but ppl without internet access excluded
post/email - wide reaching, not sure who is responding
ask ppl to collect it - easy, but people with strong views are more likely to take one.
interview pros
ask more complex questions
can explain Qs if someone doesn’t understand/ask follow up questions
higher response rate
you know the right person answered the questions
interview cons
time consuming - one person at a time
expensive - employ interviewers/travel if sample is geographically spread out
more likely to lie if questions are sensitive, they may be embarrassed
answers could be recorded in a biased way (accidental if untrained, deliberate if strong views)
statistical enquiry cycle
- planning (hyp, what data and how use)
- collecting data (prim/sec, constraints)
- processing and presenting data (diagrams/measures, tech)
- interpreting results (plan analysis, conclusions, predict)
- communicating results clearly and evaluating methods (aware of target audience, clear visual representation of results)
collecting data
primary data by experiment - reliable recording of data accurately/fairly
secondary data from a website- more reliable in cases, for sensitive topics (income, (money spent) weight, age)
processing and presenting
Distribution?
-averages
-measures of spread
-box plots
-(pie charts)
-(histograms)
-(bar graphs)
Correlation
-Scatter graph
-line of best fit
-SRCC
-PMCC
Over time
-time series graph
Interpreting data
-compare averages
or
-find correlation
do the result prove/disprove hypothesis
-do I need to repeat to find more results? (c+e)
Closed vs open questions
Closed questions have a fixed number of possible answers whereas open questions can be answered in any way.
Questionnaire questions, think: (SABCURL)
-Is it understandable and clear?
-Is it relevant?
-Is it leading?
-Is is biased?
-Is it ambiguous?
-Is it sensitive?
How can we reduce the problem of non-responses?
-Follow up people who did not respond
-Provide an incentive for people to answer (prize)
-Use clear questions that are easy to answer
Remember to:
Answer the question in a statement
Look at how many parts to q and how many marks
How to use technology
Can use technology to…
-order data (e.g. by age)
-identify missing data
-remove irrelevant columns/data
-remove extraneous symbols
-remove outliers
-automate the calculation of summary statistics (using a computer) e.g. mean point, line of best fit.
-set up a computer to visually represent data
Advantages of using technology
-can reduce human error
-uses all data so unbiased
-more visually appealing
-saves time
constraints when planning an investigation:
time - under pressure?
costs - budget? minimise spending? longer investigation = more expensive, costs of travel and equipment
ethical issues - no harm/ distress
confidentiality- sensitive information e.g income? could be hard to get accurate data- ppl may lie or refuse to answer.
convenience - hyp could be difficult/ impossible to test, think abt most convenient way to access data you need
observation
involves counting or measuring
reference sources
secondary sources of information:
-acknowledge its source
-consider reliability(biased?)
-out of date? wrong format? data incomplete/missing?
explanatory variable
the variable you are in control of/ the variable that has an affect on the other variable
response variable
the variable you measure/ changes as a result of changing the explanatory variable.
when considering a lab, field, or natural experiment, think:
how far can I control the explanatory variable?
How can we clean raw data?
-Remove outliers
-Put data in the dame format
-Remove extraneous symbols
-Identify missing values
-Remove irrelevant columns
Why would we repeat a simulation/experiment a number of times?
-Find the mean average
-Compare results/see patterns
-Spot anomalous results
-Results will vary
Steps for a simulation
-Choose a suitable method for getting random numbers
-Assign numbers to the data
-Generate random numbers
-Match the random numbers
-count how many rolls or whatever it took
-repeat a number of times and find the mean average
Frequency polygon
Use midpoints
Cumulative frequency chart
Use endpoints/the highest value.
Why would you expect a smaller sample to have a greater standard deviation?
More variation between samples.
Why may it be appropriate to remove outliers?
-May be an error in data
-Doesn’t fit trend
What should you look for in tables?
Patterns in the data e.g. is distribution symmetric?
why might the mean be appropriate?
takes into account all the data
can be used to calculate standard deviation
why might the mean not be appropriate?
may be significantly affected by extreme values or outliers
why might the median be appropriate?
-useful when data is skewed or contains outliers as not distorted by extreme values
-easy to find in ordered data
-can be used alongside range and IQR
why might the median not be appropriate?
isn’t always a data value
not always a good representation of the data
why might the mode be appropriate?
always a data value
can be used with non-numerical data
easy to find in tallied data
why might the mode not be appropriate?
-doesn’t always exist
-may be more than one
-may be a misleading value far from the mean
-may not be a good representation of the data.
What does PMCC tell you?
It measures how close the points on a scatter diagram are to a straight line (how linear the correlation is)
What does SRCC tell you?
It measures correlation between ranks. (this can be strong even if the data values themselves have a non-linear relationship so SRCC can detect both a linear and non-linear association).
How will SRCC and PMCC compare if there’s a non-linear association between two variables?
Both will be positive or negative but SRCC will be stronger (closer to 1 or -1).
If the mean is low then…
more than 50% of data values must be above the mean.
If the mean in high then…
More than 50% of data values must be below the mean.
Why should a control group be used?
Allows for comparisons (between control and test group).
how could matched pairs be used? (2)
Will aim to pair people based on similar characteristics (e.g. age, gender) and place one in each group.
What can you do when given a pie chart?(or comparative pie charts)
Measure the radius! With a ruler!
index numbers
talk about rate
for probability tree diagrams…
multiply all the branches out to find values at end of each branch.
for comparing regression lines…
-talk about gradient.
-plug the values given into the equation or imagine x as 0.
-interpret each correlation.
Cumulative frequency step polygons
along and then up
the height of each step is the same as the frequency for its corresponding value e.g. 5 boxes (vertical) have 48 matches (horizontal)
why might the mean increase?
if you add a value greater or take away a value less than the mean, it increases
Why is combining results (e.g. into one grouped frequency table) an advantage?
Only need to calculate one mean
Why is combining results (e.g. into one grouped frequency table) a disadvantage?
Can’t compare classes
What do we do for a systematic sample?
number
divide
choose
go in intervals
What do we do for a systematic sample?
number
divide
choose
go
Are population or sample means more consistent? 😡
sample means so standard deviation of pop is bigger