lecture 13 - big data + data analysis Flashcards
The three Vs of Big Data
- Volume (scale)
- Velocity (speed of info. production)
- Variety (diversity of forms)
(Laney)
big data - examples + research
e.g. social media, Smart Watches, Smart Homes, Video Doorbells, Classroom Scanners, etc.
Research = XXL large-N studies
- not possible without computer-assisted content analysis
- Data Mining = throw data in computer, let computer process it (without any guidance), to see if it can detect patterns/relations/correlation
= data-driven search for correlations/patterns (than logical theory development) - Machine Learning = computers ‘learn’ problem solving algorithms from data (AI approaches)
instead of programming algorithms
!they can also be wrong, but sometimes can come up with good strategies
data mining = less sophisticated than machine learning
big data example: fake news on twitter
tweet-based analysis:
- sample of Twitter users with voter registration
- focus on Tweets with links to political websites
- using list of ‘fake news’ websites to classify/code tweets
- calculate she of ‘fake news’ links by Tweet exposure & posts
findings = (extreme) left less likely to be exposed to fake news than (extreme) left
*superconsumers + supersharers = high exposure
big data: ethical questions + recommendations
ownership
- personal vs commercial (who gets the decision of informed consent)
- public vs private information
ethical principles
- minimizing harm
potential benefits (public interest) vs potential harm (rights, reputation, money) - informed consent
retroactively vs in advance - protecting privacy and confidentiality
intent and expectation of users/sources is crucial
(rule of thumb: people don’t think about if their information will be used)
big data ethics recommendations
5
- anonymization of personal data
!quoting specific posts can still lead to identification
pseudoanonymization: remove identifyers + store them somewhere for if you want to be more transparent - data minimization (only collect what you really need)
- data encryption (not everyone can access)
- secure storage
- arrangements that enable data subjects to exercise their fundamental rights = ways for people to agree
(e.g. direct access to their personal data and consent ot its use or transfer)
(e.g. if you remove data from the site, than it would also be removed from data available for research)
secondary data
= data collected by other researchers/institutions & made available often for no cost e.g. in data archives
trade-off
- quick & convenient access to data
- lack of control & constraints on measurement
e.g. questions not exactly phrased the way you want
assessment of validity & reliability
- during data analysis: researcher needs to look at reliability and validity whilst analyzing the data
!is necessary if you do something with the data beyond what it was created for
Ethical principles still apply -> only informed consent is not required (already given)
- still ethics review necessary (esp. for privacy and confidentiality)
data management
- own data entry/existing data (numerical data) into database/spreadsheet
typical structure =
- columns = variables/categories/dimensions
- row = data for individual cases (e.g each participant)
!!documentation and archiving = make a BACKUP
levels of measurement
4
-
nominal = mutually exclusive, equivalent, and exhaustive categories
e.g. gender, ethnicity, religion, countries -
ordinal = rank ordering (with/without ties) on some dimension
e.g. agreement-scale, evaluation, arbitrary intervals (e.g. time)
!intervals are arbitrary, but can be ranked - interval = precise measurement units with arbitrary zero point
- ratio = precise measurement units with absolute zero point
interesting: statistical analysis usually assumes ordinal and interval level data, but most data is ordinal (e.g. agreement scales in surveys)
-> **ordinal scales are sometimes treated as interval scales (requires heroic assumptions: assumes that the interval between the units is equal)
tables - how to do it
- meaningful title
- self-explanatory labels
- consistent number format
- (total) number of cases
- data source(s)
Democratic Peace Theory
table 1: Wars by type of regime and type period
columns: wars 1800-1939, 1940-2010, totals
rows: dem-dem, dem-aut, aut-aut, total
table types
- bar charts = best choice for cross-sectional data
- line charts = best choice for time series data
- pie charts = almost always a bad choice
scatterplots
= two interval scales
scatterplotting data is a good idea: it visualizes numerical trends
diff. forms of scatterplots can have the same linear regression line
same line can in scatterplot turn out to be
- linear relationship
- non-linear relationship
- linear relationship with an outlier
- no relationship
inferential statistics basic idea + key assumptions
= making inferences from sample to population
- sampling from complete target population
- SRS (simple random sampling) with perfect response rate (or non-response completely random)
- no nonsampling error
statistical significance
sample is never 100% representative, there is always some error, we don’t know how much -> we can make informed guesses
rule of thumb =
- observed difference/relationship > random/sampling error
- Statistical tests are “mechanical” tests, not a substitute for substantive interpretation
sampling error = difference between different samples
sampling distribution gives confidence in observed difference/mean in the single sample is random sampling error or if it represents a real meaningful difference
significance testing: issues and problems
4
- statistical vs practical significance
e.g. you could have same statistics/correlation, but small sample -> e.g. large random error -> not significant
you could have a large sample -> small random error -> highly significant
-> everything becomes significant in large-N - population (e.g. all countries) & significance testing
(no sample -> no SE -> no inferential statistics = view 1)
(population not static but changes over time thus (even) a single census is just one cross-sectional snapshot over time = view 2) - non-probability samples, e.g. internet panels
increasingly hard to get probability samples (low response-rate) -> lot of shit goes via internet panels = not representative -> difficult statistics to fix this - publication bias & significance testing
(no findings is boring -> does not get published -> what you read in journals is not representative (it ignores studies that ‘‘failed’’))
solulu: pre-registration, so that ‘‘failed studies’’ show up
a lot of significance depends on the size of the sample
discussion of results should focus on : pattern of results + substantive meaning (shouldn’t be so small that it is only meaningful in statistics)
outlook: science in the ‘‘post-truth’’ era
contemporary threats to sciences
- external: distrust of scientific research, accusations of ‘fake science’ (people make up data)
- internal: wrong incentives, quantity over quality (publications), production of ‘fake science’
outlook =
- correct incentives
- counter ‘fake news’
- commitment to and practice of good science, based on ethical principles and scientific integrity (and through education)