lecture 13 - big data + data analysis Flashcards

Question 1

Q

The three Vs of Big Data

Answer

A

Volume (scale)
Velocity (speed of info. production)
Variety (diversity of forms)

(Laney)

Question 2

Q

big data - examples + research

Answer

A

e.g. social media, Smart Watches, Smart Homes, Video Doorbells, Classroom Scanners, etc.

Research = XXL large-N studies
- not possible without computer-assisted content analysis

Data Mining = throw data in computer, let computer process it (without any guidance), to see if it can detect patterns/relations/correlation
= data-driven search for correlations/patterns (than logical theory development)
Machine Learning = computers ‘learn’ problem solving algorithms from data (AI approaches)
instead of programming algorithms
!they can also be wrong, but sometimes can come up with good strategies

data mining = less sophisticated than machine learning

Question 3

Q

big data example: fake news on twitter

Answer

A

tweet-based analysis:

sample of Twitter users with voter registration
focus on Tweets with links to political websites
using list of ‘fake news’ websites to classify/code tweets
calculate she of ‘fake news’ links by Tweet exposure & posts

findings = (extreme) left less likely to be exposed to fake news than (extreme) left
*superconsumers + supersharers = high exposure

Question 4

Q

big data: ethical questions + recommendations

Answer

A

ownership

personal vs commercial (who gets the decision of informed consent)
public vs private information

ethical principles

minimizing harm
potential benefits (public interest) vs potential harm (rights, reputation, money)
informed consent
retroactively vs in advance
protecting privacy and confidentiality
intent and expectation of users/sources is crucial
(rule of thumb: people don’t think about if their information will be used)

Question 5

Q

big data ethics recommendations
5

Answer

A

anonymization of personal data
!quoting specific posts can still lead to identification
pseudoanonymization: remove identifyers + store them somewhere for if you want to be more transparent
data minimization (only collect what you really need)
data encryption (not everyone can access)
secure storage
arrangements that enable data subjects to exercise their fundamental rights = ways for people to agree
(e.g. direct access to their personal data and consent ot its use or transfer)
(e.g. if you remove data from the site, than it would also be removed from data available for research)

Question 6

Q

secondary data

Answer

A

= data collected by other researchers/institutions & made available often for no cost e.g. in data archives

trade-off

quick & convenient access to data
lack of control & constraints on measurement
e.g. questions not exactly phrased the way you want

assessment of validity & reliability

during data analysis: researcher needs to look at reliability and validity whilst analyzing the data
!is necessary if you do something with the data beyond what it was created for

Ethical principles still apply -> only informed consent is not required (already given)
- still ethics review necessary (esp. for privacy and confidentiality)

Question 7

Q

data management

Answer

A

own data entry/existing data (numerical data) into database/spreadsheet

typical structure =

columns = variables/categories/dimensions
row = data for individual cases (e.g each participant)

!!documentation and archiving = make a BACKUP

Question 8

Q

levels of measurement
4

Answer

A

nominal = mutually exclusive, equivalent, and exhaustive categories
e.g. gender, ethnicity, religion, countries
ordinal = rank ordering (with/without ties) on some dimension
e.g. agreement-scale, evaluation, arbitrary intervals (e.g. time)
!intervals are arbitrary, but can be ranked
interval = precise measurement units with arbitrary zero point
ratio = precise measurement units with absolute zero point

interesting: statistical analysis usually assumes ordinal and interval level data, but most data is ordinal (e.g. agreement scales in surveys)
-> **ordinal scales are sometimes treated as interval scales (requires heroic assumptions: assumes that the interval between the units is equal)

Question 9

Q

tables - how to do it

Answer

A

meaningful title
self-explanatory labels
consistent number format
(total) number of cases
data source(s)

Democratic Peace Theory
table 1: Wars by type of regime and type period
columns: wars 1800-1939, 1940-2010, totals
rows: dem-dem, dem-aut, aut-aut, total

Question 10

Q

table types

Answer

A

bar charts = best choice for cross-sectional data
line charts = best choice for time series data
pie charts = almost always a bad choice

Question 11

Q

scatterplots

Answer

A

= two interval scales

scatterplotting data is a good idea: it visualizes numerical trends

diff. forms of scatterplots can have the same linear regression line
same line can in scatterplot turn out to be

linear relationship
non-linear relationship
linear relationship with an outlier
no relationship

Question 12

Q

inferential statistics basic idea + key assumptions

Answer

A

= making inferences from sample to population

sampling from complete target population
SRS (simple random sampling) with perfect response rate (or non-response completely random)
no nonsampling error

Question 13

Q

statistical significance

Answer

A

sample is never 100% representative, there is always some error, we don’t know how much -> we can make informed guesses

rule of thumb =

observed difference/relationship > random/sampling error
Statistical tests are “mechanical” tests, not a substitute for substantive interpretation

sampling error = difference between different samples
sampling distribution gives confidence in observed difference/mean in the single sample is random sampling error or if it represents a real meaningful difference

Question 14

Q

significance testing: issues and problems
4

Answer

A

statistical vs practical significance
e.g. you could have same statistics/correlation, but small sample -> e.g. large random error -> not significant
you could have a large sample -> small random error -> highly significant
-> everything becomes significant in large-N
population (e.g. all countries) & significance testing
(no sample -> no SE -> no inferential statistics = view 1)
(population not static but changes over time thus (even) a single census is just one cross-sectional snapshot over time = view 2)
non-probability samples, e.g. internet panels
increasingly hard to get probability samples (low response-rate) -> lot of shit goes via internet panels = not representative -> difficult statistics to fix this
publication bias & significance testing
(no findings is boring -> does not get published -> what you read in journals is not representative (it ignores studies that ‘‘failed’’))
solulu: pre-registration, so that ‘‘failed studies’’ show up

a lot of significance depends on the size of the sample

discussion of results should focus on : pattern of results + substantive meaning (shouldn’t be so small that it is only meaningful in statistics)

Question 15

Q

outlook: science in the ‘‘post-truth’’ era

Answer

A

contemporary threats to sciences

external: distrust of scientific research, accusations of ‘fake science’ (people make up data)
internal: wrong incentives, quantity over quality (publications), production of ‘fake science’

outlook =

correct incentives
counter ‘fake news’
commitment to and practice of good science, based on ethical principles and scientific integrity (and through education)

Question 16

Q

what is a confidence interval (CI)

Answer

Study These Flashcards

A

drawn on around an estimated sample statistic such as a mean, a difference in means, or a regression coefficient.

lecture 13 - big data + data analysis Flashcards

(16 cards)