Lecture 11 – Issues Flashcards
What is meant when talking about “human-in-the-loop” in data science?
Automated systems may speed up the processes but humans are better at understanding the context and should be involved in designing, understanding, and reviewing of the data science process
Name different ways bias can be introduced into a data science project
Bias of design
- are the variables appropriate for all situations being modeled?
- assumptions made about the stakeholders who the data related to
Bias of data
- regional
- undertested in varied contexts
- gender
- ethnicity/race
Explain data management and data governance
What are the stages of the CSIRO research data lifecycle?
Explain the Capability Maturity Model
- Good management happens all through the data lifecycle
- 4 key process areas:
➡ Data acquisition, processing and quality assurance
Goal: Reliably capture and describe scientific data in a way that facilitates preservation and reuse
➡ Data description and representation
Goal: Create quality metadata for data discovery, preservation, and provenance functions
➡ Data dissemination
Goal: Design and implement interfaces for users to obtain and interact with data
➡ Repository services/preservation
Goal: Preserve collected data for long-term use
- Good data governance uses a good management system
➡ A mature system manages data all through the data lifecycle and
throughout all projects.
What is linked data?
How does semantic web work?
Name a format for linked (open) data and explain it
Explain ethics in data science
Ethics is the moral handling of data –> e.g. don’t sell private data to scammers
People have rights (privacy, access, erasure, etc.)
Companies have rights (ownership, confidentiality, intellectual property, copyright)
confidentiality vs. privacy:
privacy: I shall decide what happens with the data
confidentiality: is my data kept as I decided
Companies and gorvernment build business models on data
–> data as a valuable asset
–> data as a valuable product
Breaking it down:
What can you do?
What should you do?
How can you make sure the right things are done?
Surveillance
Australian government
My.gov.au provides access to the public to their data
➡ Greater dependency on online interfaces
➡ Less pen and paper data processing
➡ More automation of processing
➡ Cf. RoboDebt, Census
- Less clear what access each government can have to the data
(Australian) Data retention laws
* “require some telecommunications service providers to retain specific telecommunications data (the data set) relating to the services they offer for at least 2 years”
➡ Who talks to whom on the phone & when
➡ Who emails whom & when
➡ The IP address
- What doesn’t it include?
➡ information about telecommunications content or web, browsing history - Who has access to the data without a warrant?
➡ 20 intelligence agencies, criminal law enforcement agencies,
ATO, ASIC and ACCC
➡ Civil litigation exemption
Data retention laws - issues
Rights vs functionality
* Change in responsibilities
➡ Change in processes and technology in response
- Where does automation and AI fit?
➡ Where is the responsibility and accountability?
➡ Snowden and the NSA surveillance
AI veracity
Can you trust the analysis?
Various factors can affect the “accuracy” of any analysis
➡ Data quality
➡ Choice of analysis
➡ Design of analysis
➡ Choice of data
- It is easy for the modelling to misrepresent what the data is supposed to reflect.
➡ Even statistical analysis can be biased!
AI veracity
What is meant by ‘bias of design’?
Not all bias is in the numbers
* Bias can also be in how you have designed the research
➡ Are the variables appropriate for all situations being modelled?
➡ Are assumptions made about the stakeholders who the data relates to?
➡ Are assumptions being made about the context of the data?
AI veracity
What is meant by ‘bias of data’?
Sometimes the data used to train a ML system is biased, regardless of its volume
➡ Narrow
➡ Regional
➡ Undertested in varied contexts
- Biased system may discriminate in its results, forinstance by
➡ gender
➡ ethnic associations
➡ generalities - Biased system may not be as accurate in its results for unfamiliar contexts and subjects
e.g. Google:
Shows ads for high paying jobs to men more
than women
Sampling
What do you have to look out for sampling populations?
When collecting data for processing, it has to be relevant
➡ Can you get all data relating to the scenario you are modelling?
➡ Can you only get a random sample of data?The sample data has to be representative of the population being modelled
Observe the population before you make any unqualified assumptions