Chapter 5 Flashcards
Sample of relevant content rather than census
How selecting sample determines which statistical test (inferential or descriptive)
Social science theory
Describe people’s behaviour and mental processes
Sample: subset of units from population = represent the population
Probability samples (units randomly) - valid inferences about population.
From probability: subject to sampling error - statistical procedures help to estimate sampling error.
If non-probability: sampling error cannot be calculated.
Universe: all units being considered
Population: all sampling units the study infer
Sampling frame: actual list of units from sample
Population specified but not sampling frame: multistage sampling
Sampling Time periods
Cross-sectional studies are most popular. Sample people at one point, behaviours, attitudes etc. Appears over time
For over time periods:
Longitudinal designs are possible.
Concerns about timing of content posted online, mobile content. Lack of predicable publication cycle for web content and ability for posting make sampling from time to time more important (and difficult)
Digital distribution: time sampling problems.
Interpersonal communication through writing and phone calls - changing content with no routine.
Impact of time on internet and mobile samples is a big problem when content does not have a timestamp.
Achieved content: searched and sampling frame created.
If not this: needs to be collected as it is posted= problems that can be addressed using software to scrape internet content at randomly selected predetermined times.
=generate their own archive using software.
Make sure inference concerns content producers, time or both. (dimension of content or time) is based on probability samples.
Sampling techniques
Sampling techniques
The sample must be a probability sample. Non-prob: meaningless. No validity.
Problem: allow valid conclusions without too much time.
Census
Census
Every unit in the population is included in CA - event or series of event.
Census or sample? How best to use coders time for research goals.
If census: depends on the resources and goals: the larger the number of content units the less bias but more resources.
Non-prob sampling
Used often. Sometimes used and another sampling frame is not available.
Two non-prob sampling: convenience and purposive sampling. (mostly purposive)
Convenience samples
Using content because its available. = its a census in which the population is defined by availability rather than RQ. Population is biased representation of the universe of units.
Problems: websites may not be equivalent - difficult accessing content.
Convenience: no inference to a population but justified under 3 conditions=
- Material studies hard to find
- Resources limit the ability to generate a random sample of population. Time and money
- Researcher is exploring some under-researched but important area little is known - importance of the scholarly.
Consistent results from a large number of convenience samples: contribute to theory
Purposive sampling
Logical or deductive reasons dictated by the nature of the research project
Studies of particular publications or time
Purposive samples: requires specific research justifications other than lack of money or availability.
= consecutive unit sampling: series of content produced during a certain time period. two week period in a consecutive day sample. = important when studying continuing news. (elections)
Probability sampling
Core: equal chance of being included
Extension of logic: take many samples from the same population at one time. Best guess for the value for each of the sample means would be population mean, sample means would vary from population mean.
Infinite number of samples
Average mean of all the sample means would equal the population. All means on a graph: result would be a distribution of sample means - sampling distribution.
Any sampling distribution when an infinite number of samples taken: central limits theorem
Allow researcher to estimate the amount of sampling error in probability sample. Can calculate the probability that a particular sample mean is close to the true population mean random samples. Probability can be calculated because the mean of infinite samples will equal the population mean
Sampling error combined with sample mean
Allows a researcher to estimate pop mean (given confidence)
Best guess: sample mean or proportion. Estimate range of error in the guess.
Understanding inference from a probability sample to population is sampling error: indication of accuracy of the sample
Standard error formulas
Adjust samples SD for sample size because sample size is one of three factors that affect how good an estimate a sample mean or proportion will be sample size most important
Larger sample - better estimate of population. More cases: smaller impact of the large and small values on the mean
Affecting accuracy of sample estimate is variability of case values: homogeneity of the population.
If case values vary widely, sample will have more error in estimating the population mean or proportion
Variability results from presence of large and small values for cases. Larger the sample: the more likely case variability will decline.
The third factor: (affecting accuracy of sample estimate of population) is the proportion of the population in the sample. High proportion in sample: error will decline (sample distribution is better approx population distribution
Sample must equal or exceed 20% of the population cases before this factor in estimating sampling error.
Sampling a high proportion of a large population is not necessary to generate a representative sample.
When the percentage of population exceeds 20%, adjust sampling error using the finite population correction (fpc).
To adjust standard error for sample: standard error formula multiplied with the FPC formula
All content involves a time dimensions - concept of it concerns trend studies over periods longer than a year (natural planning)
Sampling
Simple random sampling
All units equal chance of being selected. list of all films: 100 numbers between 1 and 375
Simple random sampling: two conditions: when units are replaced in the population after they are selected and when they are not replaced. With or without replacement
Large population: small variation of probability without replacement has negligible impact on sampling error estimates. not good in all situations. if list is long then another technique is preferred.
Systematic sampling
Selecting every nth unit from the sampling frame.
(n) is dividing the sampling frame by the sample size.
Sample 1000 sentences from 10000 sentences: select every tenth sentence.
Starting point have to be random. Works well when simple random creates problems.
Can have problems under two conditions:
Listing of all possible units (if incomplete inferences cannot be done)
It’s subject to periodicity, a bias in the arrangement of units in a list. problem since a few months might not be represented in sample.
Stratified sampling
Breaking a population into smaller groups and random sampling in the groups. More homogeneous than population with respect to characteristics of importance.
Can be stratified per year - makes smaller homogenous groups that would guarantee a more representative sample.
Two purposes: increases representativeness (knowledge about distribution to avoid oversampling and undersampling)
Proportionate sampling - sample sizes from within strata based on the proportion of the population.
Sometimes: straying can increase the number of units in a study when types of units make up a small proportion of the population.
Disproportionate sampling: selecting a sample frame from a stratum that is larger than that stratums proportion of the population.
= it oversamples some units to obtain enough cases for valid analysis. No longer representative for population.
Mass media content on a regular basis: stratified advantages from known variations within these production cycles.
Stratified: required adjustments to sampling error estimates.
Sampling from homogeneous groups - standard error is reduced. S
Standard error of proportion: equals the sum of standard errors for all strata.
Cluster sampling
Lists sometimes unavailable - then using cluster, selecting content units from clusters or groups of content.
Mass media: Google news: cluster of many articles divided into topics (sports, business etc). All websites impossible, however: local websites by city (cluster for sampling when geography is important).
Cluster: allows prob selection of groups and then subgroups: random sampling within these groups would lead to specific content units
Cluster: additional sampling error, because of intra correlation.
Content units may cluster together because they are similar in nature.
Shared characteristics: positive correlation among the attributes. Easier to exclude units that have different characteristics from units selected clusters. May not be representative?
Multistage sampling
Common practice involving one or several of these techniques.
Mediated content three dimensions in sampling: titles, issues or dates.
Pure multistage sampling: random sampling for each stage
can also combine techniques: as representative as possible
Stratified sampling (legacy media)
Simple or stratified regards efficiency.
Legacy media produces predicable variations. If these are known:
Variations used to select a more representative sample. Variations: identification of subsets of homogeneous content (smaller stratified sample as representative)
In daily newspapers:
Use constructed week - randomly selecting an issue for each day of the week.
Two constructed weeks sufficient for representing a years content
One constructed week, and two constructed weeks worked even better
2 weeks for a year of content. For longer? 5 years? Nine constructed week is representative. 2 from each year.
Health stories: 6 weeks instead of two. (for 1 or 5 years) (but better for 1)
Weekly newspapers
Either: simple random of 14 issues or one issue randomly selected from each month (12) most efficient.
1st best when: more risky decisions has to be made, 2nd best when less risk and time and money important
Magazines
One magazine randomly from each month is the most efficient. (1 year) or 14 issues from a year.
Monthly: examine all issues - long term trends > GO FOR stratified.
Network television news
Random 12 weeks from 60 months, using same two weeks from each year, sampling two constructed weeks per quarter for nine years and four consecutive weeks per six months period
Best:
Random select two days from each month for a total of 24 days from the year. 35 days with simple random sampling for 1 year. Use simple - stratified in media not influenced by weekly, monthly etc will introduce bias in the data.
sampling digital content
New information distribution and networking systems have a great impact on individuals. Assessing population and prob samples would be difficult. from: lack of sampling frame, private areas in web, difficulties in analysing big data sets
Digital:
People can communicate to large numbers (mass) and a single person (interpersonal).
RQ, access to content and cost
Varies with the type of digital content.
Digital designed for mass is easier to access. Twitter easier (snapchat difficult)
A prime difference between websites and social networking platform
Is the tendency of websites to represent organizations, profit or not, instead of individuals while social networking: represent content of organisations etc. social network: large data sets compared to websites - complicated with representative sample.
Sampling the web
Problems sampling online content:
- Interactivity
- Immediacy
- Multimodality
- Hyperlinks
Online: unpredictable - sampling challenges. Not use stratified and longer time frame for simple
Problem: absence of sampling frames for populations.
Digital: large number of studies on a wide range of topics samples news websites.
Online
Constructed week sampling most efficient - two or more. Another > six days could represent a year. Not generalisable? More than two weeks required.
Online press: 12 constructed weeks (three per quarter)
Associated press website: 8 weeks
Warnings
Be aware of the web is similar and not similar to legacy media.
Use different than traditional media, also difficult because sampling frames are not available and content changes.
Third: changing nature of web makes coding difficult. Must be captured or sampling change into consideration.
Fourth: multimedia nature can affect various study units.
Fifth: changing nature of sites makes reliability testing difficult because coders may not be coding identical content
Process of sampling:
Depends on the research conceptualisation. Differs from different websites
Convenience fewer problems - not generalised.
Researches ALWAYS aware of time element of changing web content
When no exact sampling frame: use multistage
First stage
Range of search engines and algorithms to generate multiple lists of sites
lists becomes a sampling frame once the duplicate sites are removed
Second stage:
Select from the sites in the sampling frame. If geography important add more stages.
Problems:
Search engines and algo: long lists of sites - not random.
Various engines different algo for generating the order in their lists.
Sampling frames from search results can be time consuming, and can be representative of certain sites over others.
Content on pages changes at varying rates
- use other categories other than topics to classify web pages.
How to deal with news websites:
Micro-longitudinal sampling with software program: download specific components of a page (headlines) every 60 seconds from a news site. Only download changed content
Size and complexity of web has led to development of machine learning strategies
Topic-specific search engines: learn from training documents. Filter from URL. their approach uses both content and structure (links) to collect web content
After filtered: take sample from the population from searching web for topics
Sampling with databases
CA: enhance with increasing storage capacity. Messages: digitised, preserved and available online. Capacity: search and retrieve specific types of content from different databases.
Data base:
Structured collection of data that can be easily searched and retrieved. usually text.
Visual and audio. Can be commercial, or researchers can create them. Or combination of two. What content goes into data base is decided by database creator.
Most databases: keywords - terms. It has limits: unlikely a database contains all content. Purposively organised and does not represent population
Use more than one database - are indexing and archiving software equivalent?
Existing literature:
Absence of information about process used to generate content sample from databases.
Keywords: crucial in determining the ability of a sample to yield valid results. searches with one keyword: not always relevant. Use strings of keyword instead.
Researchers conduct and report formal evaluations of the recall and precision of a search string.
Recall: measures a strings capability of getting pertinent content
Precision: actually relevant to study’s goals?
Recall: dividing relevant articles by all articles. More precise: more likely to it will miss relevant content.
Relevant content in the data: established with a protocol that has reached acceptable reliability and applied by two or more coders
Precision and recall: used to create a correction coefficient that estimates errors associated with sample from a database. how: dividing precision by recall.
Correction coefficient less than 1:
Correction coefficient less than 1: Search string overestimated number of articles
Greater than 1: string underestimated n of articles.
Correction c is accurate for longer time periods and not for short time periods
Using database: provide detailed description of the process
- Discussion of relevant media outlets
- Search strings reported and determine process
- Calculate precision and recall and report in the article, and correction c
Sampling social media
Sampling Twitter: greater access to messages than Facebook.
A bias toward Twitter since it over represents its social impact. Not many studies used a representative sample.
Social networking sites: examine one platform at a time = sometimes using both Twitter and Youtube can yield a more diverse video collection.
Public organisations:
Good sampling frame for tweets by searching the internet
Organisations:
Tend to make their messages available. when twitter interpersonal: problem. Twitter: either census or representative sample useful.
The most used way for representative samples is tweets from Twitter: access through firehouse (API).
Firehouse provides a census within a selected stream of tweets - probability samples. (expensive)
API stream available cost free - however estimates the top hashtags but misleading for smaller numbers. How well the API streaming represents all tweets depends on the coverage of the topic and nature. (Does not change with more than one API).
Algorithmic sampling
Uses the nodes (web pages) and edges (hyperlinks) of the web graph to generate a sample (all samples equally probable).
Learning algorithms - sampling social media - open-source software for selecting and analysis tweets (API).
Sampling suggestions for Digital media
Questions for sampling content
Studying content and messages created by organizations or individuals?
Which ones?
Time period
Sampling frame available?
Frame too long for all units? If not, conduct a census
Too long, can simple random be conducted?
Stratified sampling - more representative?
If no sampling frame - sampling studies suggested ways to generate using search engines or specialised?
If representative impossible, is convenience or purpose allowed?
Sampling individual communication
Mass com: being regular in their creation style - records often available.
Individual com is complicated
Often convenience samples result. Scientific method is solution for inability to random sample.
SUMMARY:
Appropriate selection of content depends on theoretical issues and practical problems in RQ.
Units small: a census of all content conducted.
Large: probability sample is better. (inference to the population from the sample)
Appropriate sampling depends on the nature of the RQ. Probability: necessary if using statistical inference
Efficient sampling of mass media for a given time period: stratified sampling (mass media varies systematically with time periods). CA: sampling may involve prob samples based on time, content or both.