Lecture 1 & 2 – Overview & Impact Flashcards
Explain the Standard Value Chain: Parts of a Data Science Project
Collection: getting the data;
* Wrangling: data preprocessing, cleaning;
* Analysis: discovery (learning, visualisation, etc.);
* Presentation: arguing the case that the results are
significant and useful;
* Engineering: storage and computational resources across
full lifecycle;
* Governance: overall management of data across full
lifecycle;
* Operationalisation: putting the results to work, so as to
gain benefits or value.
Name four roles (entities) in a data science project and explain
Business analyst:
Collection: copy and paste into Excel Engineering: use Excel to store and retrieve Wrangling: use Excel functions, VBA
Analysis: charts
Programmer:
Collection: web APIs, scraping, database queries
Engineering: flat files
Wrangling: Python and Perl, etc.
Analysis: Matplotlib in Python, R
Enterprise:
Collection: application databases, intranet files, server logs
Engineering: Teradata, Oracle, MS SQL Server
Wrangling: Talend, Informatica
Analysis: Cognos, Business Objects, SAS, SPSS
Web Company:
Collection: application databases, server logs, crawl data
Engineering: Hadoop/Hive, Flume, HBase
Wrangling: Pig, Oozie
Analysis: dashboards, R
Which of the following definitions is TRUE about data science?
Data Science is:
A. machine learning on big data
B. extraction of knowledge/value from data through the complete
data lifecycle process
C. almost everything that has something to do with data:
collecting, analyzing, modeling, etc, yet the most important
part is its applications - all sorts of applications
D. All of the options
D. All of the options
What is machine learning?
Machine Learning is concerned with the development of algorithms and techniques that allow computers to learn.
- concerned with building computational artifacts,
- the underlying theory is statistics
Why use Machine Learning?
Machine learning is useful when:
* Human expertise is not available, e.g., Martian exploration
- Humans cannot explain their expertise (as a set of rules), or their explanation is incomplete and needs tuning, e.g., speech recognition
- Many solutions need to be adapted automatically, e.g., user personalisation.
- Situation changes over time, e.g., junk email
- There are large amounts of data, e.g., discover astronomical objects
- Humans are expensive to use for the work, e.g., handwritten zip code recognition
Data Scientists vs. Data Engineers
What is data science?
(L2)
Data science is about
- technology for working with data
- processes for working with data
- getting value from data
in a way that is effective and consistent.
Evolution of Data Science …
(L2)
Explain Gartner’s Hype Cycle
(L2)
Gartner’s Hype Cycle attempts to quantify the level of maturity of various technologies:
Hype Cycle in 2014
(L2)
Hype Cycle for Analytics and Business Intelligence in 2019
(L2)
Hype Cycle for Data Science and Machine Learning 2021
(L2)
Relationship of Data Science to Other Disciplines
(L2)
Related: Data Analysis
Performing analysis and understanding results
* e.g. R, Tableau, Weka, Microsoft Azure Machine Learning, …
* machine learning, computational statistics, visualisation, …
Related: Data Engineering
Building scalable systems for storage, processing data
* e.g. Amazon Web Services, Teradata, Hadoop, …
* databases, distributed processing, datalakes, cloud computing, GPUs, wrangling, …
Related: Data Management
Managing data through its lifecycle
* e.g. ANDS, Talend, Master Data Management, …
* ethics, privacy, providence, curation, backup, governance, …
Impact of Data Science
Name three examples on how data science is impacting others:
(L2)
Your Life on the Cloud
–> datafication of you
Science and social good
… scientific method holds true, but broadens
technology
Futurology
… healthcare and automobiles
Impact of Data Science
What is meant by “Your Life on the Cloud”?
(L2)
Our personal information is increasingly stored in the cloud:
* social life (Facebook),
* career (LinkedIn),
* search history (Google, etc.),
* health and medical (Fitbit, TBD), * music (Apple), …
This provides many, many, many advantages:
* e.g. personal agents, computerised support for health, but also
some disadvantages:
* e.g. security and privacy breaches
corporate leakage to government (security, tax, etc.) * what if you don’t have rights to access/delete data?
* the department of pre-crime (e.g., having recidivism) * corporate mergers
* “the science is settled” and government mandates