Tools for Data Science Flashcards

Question

What are some common metrics used during model assessment?

Answer 1

* F1 Score * True positive rate * Sum of squared error

Answer 2

IBM Watson Open Scale

Answer 3

It is a unified view where you manage an inventory of assets. | It is a tool

Answer 4

To track the changes made to a software project's code

Answer 5

It is a platform for organising and managing data collected from different sources

Answer 6

Replication, backup, and access right management for stored data

Answer 7

IDEs provide a workspace and tools to develop, implement, execute, test, and deploy source code.

Answer 8

They have libraries for code compiling and the system resources to execute and verify code.

Answer 9

IBM Watson Studio

Answer 10

* MySQL * PostgreSQL * mongoDB (NoSQL) * Apache CouchDB (NoSQL) * Hadoop (file-based) * ceph (cloud-based system)

Answer 11

It is for ETL or ELT.

Answer 12

* Apache Airflow * Kubeflow * Apache Nifi

Answer 13

* PixieDust * Hue * kibana * Apache Superset

Answer 14

* PredictionIO * SELDON * mleap

Answer 15

* ModelDB * Prometheus * Adverserial Robustness 360 Toolbox * AI Explainability 360

Answer 16

* Apache Atlas * Kylo

Answer 17

Jupyter Lab ## Footnote In the long term, it will replace Jupyter Notebooks

Answer 18

* Exclusively runs R and its associated libraries * Enables Python development * Provides optimal user experience when tightly integrated in the tool

Answer 19

* Programming * Execution * Debugging * Remote data access * Data exploration * Visualisation

Answer 20

It integrates: * Code * Documentation * Visualisation to a single canvas | Not on par with the functionality of RStudio ## Footnote Spyder tries to mimic RStudio's behaviour in order to bring its functionality to the Python world.

Answer 21

Linear scalability

Answer 22

It essentially means the servers there are in a cluster, the more the performance

Answer 23

* Spark is a batch processing engine capable of processing huge amounts of data one by one or file by file * Flink is a stream processing image with a main focus of processing real-time data streams

Answer 24

Commercial tools are software applications that are often licensed and used by businesses to perform various tasks related to data science.

Answer 25

Open-source refers to a type of software where the source code is made publicly accessible. This means that anyone can view, modify, and distribute the code as they see fit.

Answer 26

* Software vendors * Influential partners * Support networks

Answer 27

They support the most common tasks in data science

Answer 28

1. Oracle Database 2. Microsoft SQL Server

Answer 29

1. Informatica Powercenter 2. IBM Infosphere DataStage

Answer 30

1. SPSS Modeler 2. SAS enterprise miner

Answer 31

1. Informatica 2. IBM

Answer 32

IBM Watson Studio

Answer 33

* Watson Studio * Watson OpenScale

Answer 34

Data management

Answer 35

They are both cloud-based commercial data integration tools

Answer 36

A cloud-based data visualisation tool

Answer 37

Watson Machine Learning

Answer 38

A cloud-based tool for monitoring deployed machine learning and deep learning models continuously

Answer 39

It largely depends on your needs, the problems you are trying to solve, and who you are solving the problem for

Answer 40

Python, R, SQL, Scala, Java, C++, and Julia

Answer 41

Pandas, Numpy, SciPy, Matplotlib

Answer 42

By making use of the Natural Language Toolkit

Answer 43

* Both are free to use * Both commonly refer to the same set of licences * Both support collaboration

Answer 44

Open source is more business-focused while free software is more focused on a set of values

Answer 45

* Statisticians, mathematicians, and data engineers * People with minimal or no programming experience * For learners with a data science career * R is popular in academia

Answer 46

For developing statistical software, graphing, as well as data analysis

Answer 47

* Knowing SQL will help you get a job in data science and data engineering * It speeds up workflow executions * It acts as an interpreter between you and the database * It is an ANSI standard

Answer 48

It is a non-procedural language

Answer 49

It is limited to querying and managing data

Answer 50

Managing data in relational databases

Answer 51

Python, R, SQL, Scala, Java, C++, and Julia.

Answer 52

If you learn SQL and use it with one database, you can apply your SQL knowledge with many other databases easily

Answer 53

Weka, Java-ML, Apache MLlib, and Deeplearning4

Answer 54

Apache Spark which includes Shark, MLlib, GraphX, and Spark Streaming

Answer 55

TensorFlow.js and R-js

Answer 56

Libraries are a collection of functions and methods that allow you to perform many actions without writing the code

Answer 57

* Pandas (used for data structures and tools) * Numpy (based on arrays and matrices)

Answer 58

They are used to communicate with others and explain meaningful results of an analysis

Answer 59

* Matplotlib, it is used mostly for plots and graphs * Seaborn, popular for its plots (e.g. time series, heat maps, violin plots)

Answer 60

* Scikit-learn (Machine Learning: regression, classification, clustering) * Keras (Deep Learning Neural Networks)

Answer 61

* TensorFlow (Deep Learning: Production and Deployment) * PyTorch (Deep Learning: regression, classification)

Answer 62

A low-level framework used in large scale production of deep learning models

Answer 63

Representational State Transfer Application Programming Interface

Answer 64

* They allow you to communicate through the internet * They enable you to use resources like storage, data, and artificially intelligent algorithms

Answer 65

They are used to interact with web services, however, there are rules regarding communication, input or request, and output or response when using these web services.

Answer 66

It allows communication between two pieces of software

Answer 67

A structured collection of data

Answer 68

* Private data * Open data

Answer 69

* It is confidential * It is commercially sensitive

Answer 70

* It is publically available *

Answer 71

The growth of data science, machine learning, and artificial intelligence

Answer 72

* An open data portal list from around the world * Governmental, intergovernmental, and organisation websites * Kaggle

Answer 73

It grants you permission to use and modify data. The licence stipulates that if you publish your modified version of the data, you must do so under the same licence terms as the original data.

Answer 74

This licence also grants you permission to use and modify data. However, you are not required to share changes to the data.

Answer 75

Neither licence imposes any restrictions on the results you might derive from the data.

Answer 76

It makes it easier to share open data

Answer 77

* Open data sets may not always be accurate or of high quality. They could contain errors, inconsistencies, or outdated information, which could lead to incorrect insights or decisions if used in an enterprise setting. * The data in open data sets might not be relevant to the specific needs of the enterprise. Businesses often require very specific data tailored to their operations, market, and customers.

Answer 78

Datasets that are freely available for anyone to access, use, modify, and share.

Answer 79

* They contain data primarily owned and controlled by specific individuals or organizations. * This data is limited in distribution because it is sold with a licensing agreement. * Some data from private sources cannot be easily disclosed, like public data.

Answer 80

National security data, geological, geophysical, and biological data

Answer 81

It provides a curated collection of open datasets, both from IBM research and trusted third-party sources. | This data is ready for use in enterprise applications.

Answer 82

They identify patterns in data

Answer 83

The process by which the model learns the data patterns

Answer 84

It can be used to make predictions

Answer 85

* Supervised Learning * Unsupervised Learning * Reinforcement Learning

Answer 86

Supervised learning

Answer 87

The model identifies relationships and dependencies between the input data and the correct output

Answer 88

Regression and classification

Answer 89

To predict numeric (or "real") values

Answer 90

Predicting the estimated sales prices for homes in an area

Answer 91

To predict whether some information or data belongs to a category (or "class")

Answer 92

A set of emails along with a designation you can classify whether they are to be considered as spam or not

Answer 93

The data is unlabeled and the model tries to identify patterns without external help

Answer 94

Clustering

Answer 95

Identifying outliers in a dataset

Answer 96

It has been likened to the learning process of a human; e.g. learning something through trial and error.

Answer 97

a reinforcement learning model learns the best set of actions to take, given its current environment, to get the most rewards over time.

Answer 98

It is a specialised type of machine learning that loosely emulates the way the human brain solves a wide range of problems, using a general set of models and techniques

Answer 99

* Natural language processing * Image, audio, and video analysis * Time series forecasting

Answer 100

* it requires large datasets of labeled data and is compute intensive * it requires special purpose hardware

Answer 101

* TensorFlow * PyTorch * Keras

Answer 102

They are pre-trained state-of-the-art models from repositories

Answer 103

* Prepare data * Build the model * Train the model This is an iterative process requiring data, expertise, time, and resources Then you can deploy the model and use the model.

Answer 104

It is a free open source repository for ready-to-use and customizable deep learning microservices.

Answer 105

By making use of pre-trained models

Answer 106

On GitHub as open source Docker images

Answer 107

It is a Kubernetes platform used to automate deployment, scaling, and management of microservices

Answer 108

It has multiple predefined models

Answer 109

It facilitates open data sharing by providing clear licensing terms for distribution and use

Answer 110

Machine learning models analyse data and identify patterns to make predictions and automate complex tasks

Answer 111

* data manipulation, * mathematical operations, * and simplified machine learning model development

Answer 112

A graph is often used to represent connections between people on a social networking website

Answer 113

It comprises a collection of rows containing columns that store the information

Answer 114

Comma-separated values or ".csv"

Answer 115

They are used to represent relationships between data

Answer 116

They are organised in a tree-like format

Answer 117

This data is organised in a graph

Answer 118

The Modified National Institute of Standards and Technology. It contains images of handwritten digits and is commonly used to train image processing systems.

Answer 119

* It records data science experiments * it allows combining text, code blocks, and code output in a single file * it exports the notebook to a PDF or HTML file format

Answer 120

* It allows access to multiple Jupyter Notebook files, other code, and data files * It enables working in an integrated manner * It is compatible with several file formats * It is an open source

Answer 121

It is a computational engine that executes the code contained in a Notebook file

Answer 122

They represent code, metadata, contents, and outputs

Answer 123

A two-process model with a kernel and a client

Answer 124

Saving and loading the notebooks

Answer 125

The cells of code contained in the notebook

Answer 126

The NB convert tool

Answer 127

They combine code, computational output, explanatory text, and multimedia resources in a single document

Answer 128

It is a lightweight tool built from Jupyterlab components that executes entirely in the browser

Answer 129

R is a statistical programming language

Answer 130

* Data processing and manipulation * Statistical inference, data analysis and machine learning

Answer 131

Academia, healthcare, and the government

Answer 132

* It is easy to use compared to some other data science tools * It is a great tool for visualistion * It doesn't require installing packages for basic data analysis

Answer 133

* dplyr (for data manipulation) * stringr (for string manipulation) * ggplot (for data visualisation) * caret (for machine learning)

Answer 134

* ggplot (histograms, bar charts, scatterplots) * plotly (web-based data visualisations) * lattice (complex, multi-variable data sets) * leaflet (interactive plots)

Answer 135

It allows you to keep track of changes to your documents

Answer 136

It is a distributed version control system

Answer 137

A method for secure remote login from one computer to another

Answer 138

The folders of your project that are set up for version control

Answer 139

They store documents (including your source code) and they enable version control

Answer 140

A copy of a repository

Answer 141

The process you use to request that someone reviews and approves your changes before they become final. * They serve as a way of proposing changes to the main branch * Other team members review the changes and approve the merging to the master branch

Answer 142

A directory on your file system, including its files and subdirectories, that is associated with a git repository

Answer 143

* It is a distributed version control system * It tracks source code * It allows collaboration among programmers * It enables non-linear workflows

Answer 144

The online hosting service for git repositories

Answer 145

A snapshot of your repository

Answer 146

The official version of the project

Answer 147

It creates a copy of the master branch

Answer 148

In the child branch. In this branch, you can build, make edits, test the changes, and when you are satisfied with them, you can merge them back to the master branch; where you can prepare the model for deployment

Answer 149

Branches allow for simultaneous development and testing by multiple team members

Tools for Data Science Flashcards

IBM Data Science Professional Certificate (Course 2/10) (180 cards)