Data Science Flashcards

1
Q

Explain the Standard Value Chain: Parts of a Data Science Project

A

Collection: getting the data;
* Wrangling: data preprocessing, cleaning;
* Analysis: discovery (learning, visualisation, etc.);
* Presentation: arguing the case that the results are
significant and useful;
* Engineering: storage and computational resources across
full lifecycle;
* Governance: overall management of data across full
lifecycle;
* Operationalisation: putting the results to work, so as to
gain benefits or value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name four roles (entities) in a data science project and explain

A

Business analyst:
Collection: copy and paste into Excel Engineering: use Excel to store and retrieve Wrangling: use Excel functions, VBA
Analysis: charts

Programmer:
Collection: web APIs, scraping, database queries
Engineering: flat files
Wrangling: Python and Perl, etc.
Analysis: Matplotlib in Python, R

Enterprise:
Collection: application databases, intranet files, server logs
Engineering: Teradata, Oracle, MS SQL Server
Wrangling: Talend, Informatica
Analysis: Cognos, Business Objects, SAS, SPSS

Web Company:
Collection: application databases, server logs, crawl data
Engineering: Hadoop/Hive, Flume, HBase
Wrangling: Pig, Oozie
Analysis: dashboards, R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which of the following definitions is TRUE about data science?

Data Science is:
A. machine learning on big data
B. extraction of knowledge/value from data through the complete
data lifecycle process
C. almost everything that has something to do with data:
collecting, analyzing, modeling, etc, yet the most important
part is its applications - all sorts of applications
D. All of the options

A

D. All of the options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is machine learning?

A

Machine Learning is concerned with the development of algorithms and techniques that allow computers to learn.

  • concerned with building computational artifacts,
  • the underlying theory is statistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why use Machine Learning?

A

Machine learning is useful when:
* Human expertise is not available, e.g., Martian exploration

  • Humans cannot explain their expertise (as a set of rules), or their explanation is incomplete and needs tuning, e.g., speech recognition
  • Many solutions need to be adapted automatically, e.g., user personalisation.
  • Situation changes over time, e.g., junk email
  • There are large amounts of data, e.g., discover astronomical objects
  • Humans are expensive to use for the work, e.g., handwritten zip code recognition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Scientists vs. Data Engineers

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is data science?

(L2)

A

Data science is about

  • technology for working with data
  • processes for working with data
  • getting value from data

in a way that is effective and consistent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Evolution of Data Science …

(L2)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain Gartner’s Hype Cycle
(L2)

A

Gartner’s Hype Cycle attempts to quantify the level of maturity of various technologies:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hype Cycle in 2014
(L2)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hype Cycle for Analytics and Business Intelligence in 2019
(L2)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hype Cycle for Data Science and Machine Learning 2021

(L2)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Relationship of Data Science to Other Disciplines

(L2)

A

Related: Data Analysis

Performing analysis and understanding results
* e.g. R, Tableau, Weka, Microsoft Azure Machine Learning, …
* machine learning, computational statistics, visualisation, …

Related: Data Engineering
Building scalable systems for storage, processing data
* e.g. Amazon Web Services, Teradata, Hadoop, …
* databases, distributed processing, datalakes, cloud computing, GPUs, wrangling, …

Related: Data Management
Managing data through its lifecycle
* e.g. ANDS, Talend, Master Data Management, …
* ethics, privacy, providence, curation, backup, governance, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Relationship of Data Science to Other Disciplines

(L2)

A

Related: Data Analysis

Performing analysis and understanding results
* e.g. R, Tableau, Weka, Microsoft Azure Machine Learning, …
* machine learning, computational statistics, visualisation, …

Related: Data Engineering
Building scalable systems for storage, processing data
* e.g. Amazon Web Services, Teradata, Hadoop, …
* databases, distributed processing, datalakes, cloud computing, GPUs, wrangling, …

Related: Data Management
Managing data through its lifecycle
* e.g. ANDS, Talend, Master Data Management, …
* ethics, privacy, providence, curation, backup, governance, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Impact of Data Science

Name there examples on how data science is impacting others:
(L2)

A

Your Life on the Cloud
–> datafication of you

Science and social good
… scientific method holds true, but broadens
technology

Futurology
… healthcare and automobiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Impact of Data Science

What is meant by “Your Life on the Cloud”?

(L2)

A

Our personal information is increasingly stored in the cloud:
* social life (Facebook),
* career (LinkedIn),
* search history (Google, etc.),
* health and medical (Fitbit, TBD), * music (Apple), …

This provides many, many, many advantages:
* e.g. personal agents, computerised support for health, but also
some disadvantages:
* e.g. security and privacy breaches
corporate leakage to government (security, tax, etc.) * what if you don’t have rights to access/delete data?
* the department of pre-crime (e.g., having recidivism) * corporate mergers
* “the science is settled” and government mandates

16
Q

Impact of Data Science

Scientific Method and Data Science:
What is the relationship?

How does Data Science affect the Scientific Method?

(L2)

A

“The end of theory?”

Science:
- is largely driven by laborious studies to find complex causal models
- find an explanation that can be used for future prediction

Data Science:
- No semantic or causal analysis required

Example:
–> When Google is delivering an advert, it doesn’t need to be right, it just needs a good guess
–> Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough.

17
Q

Impact of Data Science

Data Science for Social Good examples

(L2)

A

Identifying Factors Driving School Dropout and Improving the Impact of Social Programs in El Salvador

Predicting long-term unemployment in Portugal

Quantifying the stability of society

detect and localize an earthquake from cell call data

18
Q

Impact of Data Science
Futurology

(L2)

A

Health Care
* Your stomach can be instrumented to assess contents, nutrients, etc.
* Your bloodstream can be instrumented too assess insulin levels, etc.
* Your “health” dashboard can be online and shared by your GP
* Health management organisations (HMO) tying funding levels to
patient care performance
* GP/HMO will know about your ice cream/beer binge last night and
you missing your morning run

Automobile Futurology
Self-driving cars:
* how does the city replace traffic fine revenue?
* can you drink and drive if the car is automatic?
* what happens to the taxi industry?
* what happens to the auto insurance industry?
* what happens to people still “self” driving, and their insurance?

19
Q

What is data wrangling?

(L2)

A

Data Wrangling
is the process of transforming “raw” data into data that can be analysed
to generate valid actionable results
and insights.

Needed when:
- mistakes in data (e.g. missing, incorrect, outliers)
- too much data (e.g. filter )
- combination of data sets (e.g. merge)
- discretisation of data

20
Q

Tidy data: Gather

(L2)

A
21
Q

Tidy data: Spread

(L2)

A
22
Q

Tidy data: Separate

(L2)

A