lecture 4 (1) Flashcards
What are data driven studies good for?
Classification
prediction
Causation
Classification
Identifying which of a set of categories an observation belongs to
E.g. is this email spam or not
Prediction
Making projects (not necessarily in the future) about the possible values of a target variable
E.g. What is the next word you will write, given the words you have already written
Causation
Making inferences about whether and how certain variables causally affect other variables
E.g. how do the eating habits in my town causally impact the inflow of patients at the hospital
Data science
The study and application of algorithmic, statistical and mathematical techniques for data mining
Data mining
The art of finding useful patterns in very large sets of data (similar to “exploratory data analysis” in statistics)
Data
Public records produced by sensory observation or by some measuring device
E.g. your clicks and searchs online, your choices at the supermarket, etc.
Algorithm
An explicit set of step-by-step instructions for answering some question, or for performing some task
E.g. for ying you shoes; for making spaghetti alla carbonara, for multiplying two numbers; etc
example from text (autonomous cars)
If an autonomous car hits a person, who is responsible
When does data count as big data
Large (=size of the files used to archive and distribute data) datasets
(typically) in digital format
Efficiently (=in a reasonable amount of time) analyzable with computational standards
How does data become big
Data-sharing and data-producing practices are supported by political and economic interest
Before the 19th century: data were mostly private,
gathered and used by scientists (e.g., astronomical data) or
the administrators of a state (e.g., demographics)
– 1900’s: International institutions (such as the UN) gathering
and spreading information on health, employment,
migration, etc. to base policy on.
– 2000’s: Corporations (e.g., Google, Amazon, TikTok)
creating and controlling data left by billions of people on
the Internet.
Why should we care about big data
Ask yourself: would you mind if I download the content of your phone and use it for purposes you may not be aware of? Would you mind if I used everything you wrote online to train ChatGPT? would you mind if you had to run a study on nutrition and healthcare but cannot access or use relevant data (because of privacy or because you do not have the right computing infrastructture)
one COMMON MISUNDERSTANDING of the nature of data science:
data reflects objective truth
Data should be
Found and stored
E.g. various legal and IT constraints, including privacy laws on data protection, on commercialization of data collection and distribution, availability of suitable technology
Data should be analysed and all tools for data analysis make assumptions (about the statistical structure of the dataset, abouot how to weight difference sources of data)
That “data reflects objective truth” could mean that data is simply “out there in the world uncontaminated”, free from human biases and theorising
But this understanding is misguided because of
Fake data (e.g. (chat) bots
Incomplete data (e.g. missing records)