Tools for Data Science Flashcards

IBM Data Science Professional Certificate (Course 2/10)

1
Q

Do I Visit Beautiful Destinations Most Autumns?

What data science categories does raw data need to pass through before it is deemed useful?

A
  • Data management
  • Data integration and transformation
  • Data visualisation
  • Model building
  • Model deployment
  • Model monitoring and assessment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Do Cats Dance Exquisitely Everywhere?

What tools are used to support the tasks performed in the Data Science Categories?

A
  • Data Asset Management
  • Code Asset Management
  • Development Environments
  • Execution Environments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data management?

A

The process of collecting, persisting, and retrieving data securely, efficiently, and cost-effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Where is data collected from?

A

Many sources, including (but not limited to) Twitter, Flipkart, sensors, and the Internet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Where should you store data so it is available whenever you need it?

A

Store the data in persistent storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data integration and transformation?

A

The process of extracting, transforming, and loading data (ETL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Give examples of repositories where data is commonly distributed

A
  • Databases
  • Data cubes
  • Flat files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When extracting data through the extraction process, where should you save this extracted data?

A

It is common practice to save extracted data in a central repository like a data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a data warehouse primarily used for?

A

It is primarily used to collect and store massive amounts of data for data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is data transformation?

A

The process of transforming the values, structure, and format of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What do you do after extracting data?

A

Transform the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What happens to data after it has been transformed?

A

It is loaded back to the data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is data visualisation?

A

It is the graphical representation of data and information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some ways to visualise data?

A

You can visualise the data through charts, plots, maps, and animations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is data visualisation a good thing?

A

It conveys data more effectively for decision-makers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What happens after data visualisation?

A

Model building

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is model building?

A

It is a step where you train the data and analyse patterns using suitable machine learning algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What happens after model building?

A

Model deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is model deployment?

A

It is the process of integrating a model into a production environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In model deployment, how is a machine learning model made available to third-party applications?

A

They are made available via APIs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What goal do third-party applications help achieve during model deployment?

A

They allow business users to access and interact with data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the purpose of model monitoring?

A

To track the performance of deployed models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Give an example of a tool used during model monitoring

A

Fiddler

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the purpose of model assessment

A

To check for model accuracy, fairness, and monitor its robustness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What are some common metrics used during model assessment?
* F1 Score * True positive rate * Sum of squared error
26
What is a popular Model Monitoring and Assessment tool?
IBM Watson Open Scale
27
What is Code Asset Management?
It is a unified view where you manage an inventory of assets. | It is a tool
28
What do developers use versioning for?
To track the changes made to a software project's code
29
# You use it Give an example of a Code Asset Management platform
GitHub
30
What is Data Asset Management?
It is a platform for organising and managing data collected from different sources
31
What do DAM platforms typically support?
Replication, backup, and access right management for stored data
32
# Do I Eat Tasty Donuts? What do Development Environments provide a workspace to do?
IDEs provide a workspace and tools to develop, implement, execute, test, and deploy source code.
33
What do execution environments have?
They have libraries for code compiling and the system resources to execute and verify code.
34
Give an example of a fully-integrated visual tool
IBM Watson Studio
35
What are the most widely-used open-source data management tools?
* MySQL * PostgreSQL * mongoDB (NoSQL) * Apache CouchDB (NoSQL) * Hadoop (file-based) * ceph (cloud-based system)
36
What is the task of data integration and transformation in the classic data warehousing world?
It is for ETL or ELT.
37
What are the most widely-used data integration and transformation tools?
* Apache Airflow * Kubeflow * Apache Nifi
38
What are the most widely-used data visualisation tools?
* PixieDust * Hue * kibana * Apache Superset
39
What are some popular model deployment tools?
* PredictionIO * SELDON * mleap
40
What are some popular model monitoring and assessment tools?
* ModelDB * Prometheus * Adverserial Robustness 360 Toolbox * AI Explainability 360
41
Name one popular code asset management tool
git
42
What are some popular data asset management tools?
* Apache Atlas * Kylo
43
What is the most popular development environment that data scientists are currently using?
Jupyter
44
What is the next version of Jupyter Notebooks?
Jupyter Lab ## Footnote In the long term, it will replace Jupyter Notebooks
45
What are some characteristics of RStudio?
* Exclusively runs R and its associated libraries * Enables Python development * Provides optimal user experience when tightly integrated in the tool
46
# People Enjoy Delicious Red Dates Every Valentine’s What is RStudio able to unify into one tool?
* Programming * Execution * Debugging * Remote data access * Data exploration * Visualisation
47
What are the features of Spyder?
It integrates: * Code * Documentation * Visualisation to a single canvas | Not on par with the functionality of RStudio ## Footnote Spyder tries to mimic RStudio's behaviour in order to bring its functionality to the Python world.
48
What is the key feature of Apache Spark?
Linear scalability
49
What does linear scalability mean? | In the context of Apache Spark
It essentially means the servers there are in a cluster, the more the performance
50
What is the difference between Apache Spark and Flink?
* Spark is a batch processing engine capable of processing huge amounts of data one by one or file by file * Flink is a stream processing image with a main focus of processing real-time data streams
51
What is a commercial tool? | In the context of data science
Commercial tools are software applications that are often licensed and used by businesses to perform various tasks related to data science.
52
What does open-source mean?
Open-source refers to a type of software where the source code is made publicly accessible. This means that anyone can view, modify, and distribute the code as they see fit.
53
Who delivers commercial support to data science tools?
* Software vendors * Influential partners * Support networks
54
What do commercial tools do?
They support the most common tasks in data science
55
Name 2 commercial tools for data management
1. Oracle Database 2. Microsoft SQL Server
56
Name 2 commercial tools for data integration
1. Informatica Powercenter 2. IBM Infosphere DataStage
57
Name 2 commercial tools for model building
1. SPSS Modeler 2. SAS enterprise miner
58
Name 2 providers of commercial data asset management tools
1. Informatica 2. IBM
59
Name 1 fully-integrated commercial tool that covers the entire data science life cycle
IBM Watson Studio
60
Which 2 cloud-based tools cover the complete development life cycle for all data science, AI, and machine learning tasks?
* Watson Studio * Watson OpenScale
61
In which category of data science tasks will you find an SaaS version of existing open-source and commercial tools? | with some exceptions, of course
Data management
62
What do Informatica Cloud Data Integration and IBM's Data Refinery have in common?
They are both cloud-based commercial data integration tools
63
What is IBM's Congos Business Intelligence suite an example of?
A cloud-based data visualisation tool
64
Name a cloud-based tool for model building
Watson Machine Learning
65
What is Amazon SageMaker Model Monitor an example of?
A cloud-based tool for monitoring deployed machine learning and deep learning models continuously
66
How do you decide which programming language to learn?
It largely depends on your needs, the problems you are trying to solve, and who you are solving the problem for
67
What are the popular languages within data science?
Python, R, SQL, Scala, Java, C++, and Julia
68
Which Python scientific computing libraries are commonly used in data science?
Pandas, Numpy, SciPy, Matplotlib
69
How can you use Python for Natural Language Processing (NLP)?
By making use of the Natural Language Toolkit
70
What are the similarities between open source and free software?
* Both are free to use * Both commonly refer to the same set of licences * Both support collaboration
71
What are the differences between open source and free software
Open source is more business-focused while free software is more focused on a set of values
72
Who is R for?
* Statisticians, mathematicians, and data engineers * People with minimal or no programming experience * For learners with a data science career * R is popular in academia
73
What can you use R for?
For developing statistical software, graphing, as well as data analysis
74
What makes SQL great?
* Knowing SQL will help you get a job in data science and data engineering * It speeds up workflow executions * It acts as an interpreter between you and the database * It is an ANSI standard
75
How is SQL different from other development languages?
It is a non-procedural language
76
What is SQL's scope?
It is limited to querying and managing data
77
What was SQL designed for?
Managing data in relational databases
78
What are the most popular languages in data science?
Python, R, SQL, Scala, Java, C++, and Julia.
79
What is a substantial benefit of learning SQL?
If you learn SQL and use it with one database, you can apply your SQL knowledge with many other databases easily
80
What data science tools are built with Java?
Weka, Java-ML, Apache MLlib, and Deeplearning4
81
What popular program is built with Scala?
Apache Spark which includes Shark, MLlib, GraphX, and Spark Streaming
82
What data science programs are built with JavaScript?
TensorFlow.js and R-js
83
What's a great application for Julia in data science?
JuliaDB
84
What are libraries?
Libraries are a collection of functions and methods that allow you to perform many actions without writing the code
85
What are the popular scientific computing libraries in Python?
* Pandas (used for data structures and tools) * Numpy (based on arrays and matrices)
86
# * What are data visualisation libraries used for?
They are used to communicate with others and explain meaningful results of an analysis
87
What are some popular data visualisation libraries in Python?
* Matplotlib, it is used mostly for plots and graphs * Seaborn, popular for its plots (e.g. time series, heat maps, violin plots)
88
What are popular Machine Learning and Deep Learning libraries in Python?
* Scikit-learn (Machine Learning: regression, classification, clustering) * Keras (Deep Learning Neural Networks)
89
What are popular Deep Learning libraries in Python?
* TensorFlow (Deep Learning: Production and Deployment) * PyTorch (Deep Learning: regression, classification)
90
What is TensorFlow?
A low-level framework used in large scale production of deep learning models
91
What does REST API stand for?
Representational State Transfer Application Programming Interface
92
What do REST APIs allow you to do?
* They allow you to communicate through the internet * They enable you to use resources like storage, data, and artificially intelligent algorithms
93
What are REST APIs used for?
They are used to interact with web services, however, there are rules regarding communication, input or request, and output or response when using these web services.
94
What does an API do?
It allows communication between two pieces of software
95
What is a data set?
A structured collection of data
96
What are the types of data ownership?
* Private data * Open data
97
What are characteristics of private data?
* It is confidential * It is commercially sensitive
98
What are characteristics of open data?
* It is publically available *
99
What has open data contributed to
The growth of data science, machine learning, and artificial intelligence
100
When can you find open data?
* An open data portal list from around the world * Governmental, intergovernmental, and organisation websites * Kaggle
101
What is the CDLA-Sharing Licence for?
It grants you permission to use and modify data. The licence stipulates that if you publish your modified version of the data, you must do so under the same licence terms as the original data.
102
What is the CDLA-Permissive Licence for?
This licence also grants you permission to use and modify data. However, you are not required to share changes to the data.
103
What is important about the CDLA-Permissive and CDLA-Restrictive Licences?
Neither licence imposes any restrictions on the results you might derive from the data.
104
Why is the Community Data Licence Agreement important?
It makes it easier to share open data
105
Why might open data sets not meet enterprise requirements?
* Open data sets may not always be accurate or of high quality. They could contain errors, inconsistencies, or outdated information, which could lead to incorrect insights or decisions if used in an enterprise setting. * The data in open data sets might not be relevant to the specific needs of the enterprise. Businesses often require very specific data tailored to their operations, market, and customers.
106
What are open datasets?
Datasets that are freely available for anyone to access, use, modify, and share.
107
What are attributes of proprietary datasets?
* They contain data primarily owned and controlled by specific individuals or organizations. * This data is limited in distribution because it is sold with a licensing agreement. * Some data from private sources cannot be easily disclosed, like public data.
108
What are some examples of proprietary data?
National security data, geological, geophysical, and biological data
109
What does the Data Asset eXchange provide?
It provides a curated collection of open datasets, both from IBM research and trusted third-party sources. | This data is ready for use in enterprise applications.
110
What do Machine Learning (ML) models do?
They identify patterns in data
111
What is model training?
The process by which the model learns the data patterns
112
What can a trained model be used for?
It can be used to make predictions
113
What are the types of Machine Learning?
* Supervised Learning * Unsupervised Learning * Reinforcement Learning
114
What is the most commonly used type of machine learning?
Supervised learning
115
What does a supervised learning model do?
The model identifies relationships and dependencies between the input data and the correct output
116
What are the types of supervised learning models?
Regression and classification
117
What are regression models used for?
To predict numeric (or "real") values
118
Give an example of something you can use a regression model for
Predicting the estimated sales prices for homes in an area
119
What are classification models used for?
To predict whether some information or data belongs to a category (or "class")
120
Give an example of something you can use a classification model for
A set of emails along with a designation you can classify whether they are to be considered as spam or not
121
What happens in unsupervised learning?
The data is unlabeled and the model tries to identify patterns without external help
122
Give an example of an application of unsupervised learning
Clustering
123
What is anomaly detection used for?
Identifying outliers in a dataset
124
What has reinforcement learning been likened to?
It has been likened to the learning process of a human; e.g. learning something through trial and error.
125
How does a reinforcement learning model work?
a reinforcement learning model learns the best set of actions to take, given its current environment, to get the most rewards over time.
126
What is deep learning?
It is a specialised type of machine learning that loosely emulates the way the human brain solves a wide range of problems, using a general set of models and techniques
127
What are some applications of deep learning?
* Natural language processing * Image, audio, and video analysis * Time series forecasting
128
What are some deep learning requirements?
* it requires large datasets of labeled data and is compute intensive * it requires special purpose hardware
129
Which popular frameworks are used to implement deep learning models?
* TensorFlow * PyTorch * Keras
130
What are model zoos?
They are pre-trained state-of-the-art models from repositories
131
What are the high-level tasks involved in building a model?
* Prepare data * Build the model * Train the model This is an iterative process requiring data, expertise, time, and resources Then you can deploy the model and use the model.
132
What is the Model Asset eXchange (MAX)?
It is a free open source repository for ready-to-use and customizable deep learning microservices.
133
How can you use the time to value of a project?
By making use of pre-trained models
134
Where are MAX model-serving microservices built and distributed?
On GitHub as open source Docker images
135
What is Red Hat OpenShift?
It is a Kubernetes platform used to automate deployment, scaling, and management of microservices
136
What is useful about Ml-exchange.org?
It has multiple predefined models
137
What does the The Community Data License Agreement (CDLA) facilitate?
It facilitates open data sharing by providing clear licensing terms for distribution and use
138
In a sentence, what do machine learning models do?
Machine learning models analyse data and identify patterns to make predictions and automate complex tasks
139
What do Python libraries provide the tools for?
* data manipulation, * mathematical operations, * and simplified machine learning model development
140
What is the best way to represent network data?
A graph is often used to represent connections between people on a social networking website
141
What does a tabular data set comprise?
It comprises a collection of rows containing columns that store the information
142
Name a popular tabular data format
Comma-separated values or ".csv"
143
What are hierarchical or network data structures typically used for?
They are used to represent relationships between data
144
What format are hierarchical data structures organised in?
They are organised in a tree-like format
145
What format are network structures organised in?
This data is organised in a graph
146
Name a popular dataset for data science
The Modified National Institute of Standards and Technology. It contains images of handwritten digits and is commonly used to train image processing systems.
147
What did Jupyter Notebooks originate as?
iPython
148
What are the key functionalities of a Jupyter Notebook?
* It records data science experiments * it allows combining text, code blocks, and code output in a single file * it exports the notebook to a PDF or HTML file format
149
What are the key functionalities of JupyterLab?
* It allows access to multiple Jupyter Notebook files, other code, and data files * It enables working in an integrated manner * It is compatible with several file formats * It is an open source
150
What is a kernel?
It is a computational engine that executes the code contained in a Notebook file
151
What do Jupyter notebooks represent?
They represent code, metadata, contents, and outputs
152
What does Jupyter implement?
A two-process model with a kernel and a client
153
What is the Notebook server responsible for?
Saving and loading the notebooks
154
What does the kernel execute?
The cells of code contained in the notebook
155
What does the Jupyter architecture use to convert files to other formats?
The NB convert tool
156
What do computational notebooks do?
They combine code, computational output, explanatory text, and multimedia resources in a single document
157
What is JupyterLite
It is a lightweight tool built from Jupyterlab components that executes entirely in the browser
158
What is R?
R is a statistical programming language
159
What is R used for?
* Data processing and manipulation * Statistical inference, data analysis and machine learning
160
Where is R used the most?
Academia, healthcare, and the government
161
Why is R a preferred language for some data scientists?
* It is easy to use compared to some other data science tools * It is a great tool for visualistion * It doesn't require installing packages for basic data analysis
162
What are some popular R libraries for Data Science?
* dplyr (for data manipulation) * stringr (for string manipulation) * ggplot (for data visualisation) * caret (for machine learning)
163
What are some data visualisation packages in R?
* ggplot (histograms, bar charts, scatterplots) * plotly (web-based data visualisations) * lattice (complex, multi-variable data sets) * leaflet (interactive plots)
164
What does version control do?
It allows you to keep track of changes to your documents
165
What is git?
It is a distributed version control system
166
What is one of the most common version control systems in the world?
git
167
What is the SSH protocol?
A method for secure remote login from one computer to another
168
What is a repository?
The folders of your project that are set up for version control
169
What do repositories do?
They store documents (including your source code) and they enable version control
170
What is a fork?
A copy of a repository
171
What is a pull request?
The process you use to request that someone reviews and approves your changes before they become final. * They serve as a way of proposing changes to the main branch * Other team members review the changes and approve the merging to the master branch
172
What is a working directory?
A directory on your file system, including its files and subdirectories, that is associated with a git repository
173
What is special about the Git Repository Model?
* It is a distributed version control system * It tracks source code * It allows collaboration among programmers * It enables non-linear workflows
174
What is GitHub
The online hosting service for git repositories
175
What is a branch?
A snapshot of your repository
176
What is the master branch?
The official version of the project
177
What does a child branch do?
It creates a copy of the master branch
178
When using git repositories, where are edits and changes made?
In the child branch. In this branch, you can build, make edits, test the changes, and when you are satisfied with them, you can merge them back to the master branch; where you can prepare the model for deployment
179
What is a key benefit of using child branches?
Branches allow for simultaneous development and testing by multiple team members
180