Lecture 11 – Issues Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is meant when talking about “human-in-the-loop” in data science?

A

Automated systems may speed up the processes but humans are better at understanding the context and should be involved in designing, understanding, and reviewing of the data science process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name different ways bias can be introduced into a data science project

A

Bias of design
- are the variables appropriate for all situations being modeled?
- assumptions made about the stakeholders who the data related to

Bias of data
- regional
- undertested in varied contexts
- gender
- ethnicity/race

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain data management and data governance

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the stages of the CSIRO research data lifecycle?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the Capability Maturity Model

A
  • Good management happens all through the data lifecycle
  • 4 key process areas:
    ➡ Data acquisition, processing and quality assurance
    Goal: Reliably capture and describe scientific data in a way that facilitates preservation and reuse

➡ Data description and representation
Goal: Create quality metadata for data discovery, preservation, and provenance functions

➡ Data dissemination
Goal: Design and implement interfaces for users to obtain and interact with data

➡ Repository services/preservation
Goal: Preserve collected data for long-term use

  • Good data governance uses a good management system
    ➡ A mature system manages data all through the data lifecycle and
    throughout all projects.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is linked data?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does semantic web work?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name a format for linked (open) data and explain it

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain ethics in data science

A

Ethics is the moral handling of data –> e.g. don’t sell private data to scammers

People have rights (privacy, access, erasure, etc.)

Companies have rights (ownership, confidentiality, intellectual property, copyright)

confidentiality vs. privacy:
privacy: I shall decide what happens with the data
confidentiality: is my data kept as I decided

Companies and gorvernment build business models on data
–> data as a valuable asset
–> data as a valuable product

Breaking it down:
What can you do?
What should you do?
How can you make sure the right things are done?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Surveillance

Australian government

A

My.gov.au provides access to the public to their data
➡ Greater dependency on online interfaces
➡ Less pen and paper data processing
➡ More automation of processing
➡ Cf. RoboDebt, Census

  • Less clear what access each government can have to the data

(Australian) Data retention laws
* “require some telecommunications service providers to retain specific telecommunications data (the data set) relating to the services they offer for at least 2 years”

➡ Who talks to whom on the phone & when
➡ Who emails whom & when
➡ The IP address

  • What doesn’t it include?
    ➡ information about telecommunications content or web, browsing history
  • Who has access to the data without a warrant?
    ➡ 20 intelligence agencies, criminal law enforcement agencies,
    ATO, ASIC and ACCC
    ➡ Civil litigation exemption
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data retention laws - issues

A

Rights vs functionality
* Change in responsibilities
➡ Change in processes and technology in response

  • Where does automation and AI fit?
    ➡ Where is the responsibility and accountability?
    ➡ Snowden and the NSA surveillance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AI veracity

Can you trust the analysis?

A

Various factors can affect the “accuracy” of any analysis
➡ Data quality
➡ Choice of analysis
➡ Design of analysis
➡ Choice of data

  • It is easy for the modelling to misrepresent what the data is supposed to reflect.
    ➡ Even statistical analysis can be biased!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

AI veracity

What is meant by ‘bias of design’?

A

Not all bias is in the numbers
* Bias can also be in how you have designed the research
➡ Are the variables appropriate for all situations being modelled?
➡ Are assumptions made about the stakeholders who the data relates to?
➡ Are assumptions being made about the context of the data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

AI veracity

What is meant by ‘bias of data’?

A

Sometimes the data used to train a ML system is biased, regardless of its volume
➡ Narrow
➡ Regional
➡ Undertested in varied contexts

  • Biased system may discriminate in its results, forinstance by
    ➡ gender
    ➡ ethnic associations
    ➡ generalities
  • Biased system may not be as accurate in its results for unfamiliar contexts and subjects

e.g. Google:
Shows ads for high paying jobs to men more
than women

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Sampling

What do you have to look out for sampling populations?

A

When collecting data for processing, it has to be relevant

➡ Can you get all data relating to the scenario you are modelling?
➡ Can you only get a random sample of data?The sample data has to be representative of the population being modelled

Observe the population before you make any unqualified assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A/B testing and significance testing

A
  • Blind experiments or A/B testing may be used to show if relationship between various variables
  • The experimental scenario needs to be divided into:
    ➡ A: Sample is subject to the known variable
    ➡ B: Sample is not subject to the known variable (the Control set)

Must test the statistical significance
➡ p-value: units of chance of your “surprise”(0 to 1) => Considering how likely you could get the same results regardless of the hypothesis