Tools for Data Science Flashcards
IBM Data Science Professional Certificate (Course 2/10)
Do I Visit Beautiful Destinations Most Autumns?
What data science categories does raw data need to pass through before it is deemed useful?
- Data management
- Data integration and transformation
- Data visualisation
- Model building
- Model deployment
- Model monitoring and assessment
Do Cats Dance Exquisitely Everywhere?
What tools are used to support the tasks performed in the Data Science Categories?
- Data Asset Management
- Code Asset Management
- Development Environments
- Execution Environments
What is data management?
The process of collecting, persisting, and retrieving data securely, efficiently, and cost-effectively.
Where is data collected from?
Many sources, including (but not limited to) Twitter, Flipkart, sensors, and the Internet.
Where should you store data so it is available whenever you need it?
Store the data in persistent storage
What is data integration and transformation?
The process of extracting, transforming, and loading data (ETL)
Give examples of repositories where data is commonly distributed
- Databases
- Data cubes
- Flat files
When extracting data through the extraction process, where should you save this extracted data?
It is common practice to save extracted data in a central repository like a data warehouse.
What is a data warehouse primarily used for?
It is primarily used to collect and store massive amounts of data for data analysis.
What is data transformation?
The process of transforming the values, structure, and format of data.
What do you do after extracting data?
Transform the data
What happens to data after it has been transformed?
It is loaded back to the data warehouse
What is data visualisation?
It is the graphical representation of data and information.
What are some ways to visualise data?
You can visualise the data through charts, plots, maps, and animations.
Why is data visualisation a good thing?
It conveys data more effectively for decision-makers
What happens after data visualisation?
Model building
What is model building?
It is a step where you train the data and analyse patterns using suitable machine learning algorithms.
What happens after model building?
Model deployment
What is model deployment?
It is the process of integrating a model into a production environment
In model deployment, how is a machine learning model made available to third-party applications?
They are made available via APIs.
What goal do third-party applications help achieve during model deployment?
They allow business users to access and interact with data
What is the purpose of model monitoring?
To track the performance of deployed models.
Give an example of a tool used during model monitoring
Fiddler
What is the purpose of model assessment
To check for model accuracy, fairness, and monitor its robustness
What are some common metrics used during model assessment?
- F1 Score
- True positive rate
- Sum of squared error
What is a popular Model Monitoring and Assessment tool?
IBM Watson Open Scale
What is Code Asset Management?
It is a unified view where you manage an inventory of assets.
It is a tool
What do developers use versioning for?
To track the changes made to a software project’s code
You use it
Give an example of a Code Asset Management platform
GitHub
What is Data Asset Management?
It is a platform for organising and managing data collected from different sources
What do DAM platforms typically support?
Replication, backup, and access right management for stored data
Do I Eat Tasty Donuts?
What do Development Environments provide a workspace to do?
IDEs provide a workspace and tools to develop, implement, execute, test, and deploy source code.
What do execution environments have?
They have libraries for code compiling and the system resources to execute and verify code.
Give an example of a fully-integrated visual tool
IBM Watson Studio
What are the most widely-used open-source data management tools?
- MySQL
- PostgreSQL
- mongoDB (NoSQL)
- Apache CouchDB (NoSQL)
- Hadoop (file-based)
- ceph (cloud-based system)
What is the task of data integration and transformation in the classic data warehousing world?
It is for ETL or ELT.
What are the most widely-used data integration and transformation tools?
- Apache Airflow
- Kubeflow
- Apache Nifi
What are the most widely-used data visualisation tools?
- PixieDust
- Hue
- kibana
- Apache Superset
What are some popular model deployment tools?
- PredictionIO
- SELDON
- mleap
What are some popular model monitoring and assessment tools?
- ModelDB
- Prometheus
- Adverserial Robustness 360 Toolbox
- AI Explainability 360
Name one popular code asset management tool
git
What are some popular data asset management tools?
- Apache Atlas
- Kylo
What is the most popular development environment that data scientists are currently using?
Jupyter
What is the next version of Jupyter Notebooks?
Jupyter Lab
In the long term, it will replace Jupyter Notebooks
What are some characteristics of RStudio?
- Exclusively runs R and its associated libraries
- Enables Python development
- Provides optimal user experience when tightly integrated in the tool
People Enjoy Delicious Red Dates Every Valentine’s
What is RStudio able to unify into one tool?
- Programming
- Execution
- Debugging
- Remote data access
- Data exploration
- Visualisation
What are the features of Spyder?
It integrates:
* Code
* Documentation
* Visualisation
to a single canvas
Not on par with the functionality of RStudio
Spyder tries to mimic RStudio’s behaviour in order to bring its functionality to the Python world.
What is the key feature of Apache Spark?
Linear scalability
What does linear scalability mean?
In the context of Apache Spark
It essentially means the servers there are in a cluster, the more the performance
What is the difference between Apache Spark and Flink?
- Spark is a batch processing engine capable of processing huge amounts of data one by one or file by file
- Flink is a stream processing image with a main focus of processing real-time data streams
What is a commercial tool?
In the context of data science
Commercial tools are software applications that are often licensed and used by businesses to perform various tasks related to data science.
What does open-source mean?
Open-source refers to a type of software where the source code is made publicly accessible. This means that anyone can view, modify, and distribute the code as they see fit.
Who delivers commercial support to data science tools?
- Software vendors
- Influential partners
- Support networks
What do commercial tools do?
They support the most common tasks in data science
Name 2 commercial tools for data management
- Oracle Database
- Microsoft SQL Server
Name 2 commercial tools for data integration
- Informatica Powercenter
- IBM Infosphere DataStage
Name 2 commercial tools for model building
- SPSS Modeler
- SAS enterprise miner
Name 2 providers of commercial data asset management tools
- Informatica
- IBM
Name 1 fully-integrated commercial tool that covers the entire data science life cycle
IBM Watson Studio
Which 2 cloud-based tools cover the complete development life cycle for all data science, AI, and machine learning tasks?
- Watson Studio
- Watson OpenScale
In which category of data science tasks will you find an SaaS version of existing open-source and commercial tools?
with some exceptions, of course
Data management
What do Informatica Cloud Data Integration and IBM’s Data Refinery have in common?
They are both cloud-based commercial data integration tools
What is IBM’s Congos Business Intelligence suite an example of?
A cloud-based data visualisation tool
Name a cloud-based tool for model building
Watson Machine Learning
What is Amazon SageMaker Model Monitor an example of?
A cloud-based tool for monitoring deployed machine learning and deep learning models continuously
How do you decide which programming language to learn?
It largely depends on your needs, the problems you are trying to solve, and who you are solving the problem for
What are the popular languages within data science?
Python, R, SQL, Scala, Java, C++, and Julia
Which Python scientific computing libraries are commonly used in data science?
Pandas, Numpy, SciPy, Matplotlib
How can you use Python for Natural Language Processing (NLP)?
By making use of the Natural Language Toolkit
What are the similarities between open source and free software?
- Both are free to use
- Both commonly refer to the same set of licences
- Both support collaboration
What are the differences between open source and free software
Open source is more business-focused while free software is more focused on a set of values
Who is R for?
- Statisticians, mathematicians, and data engineers
- People with minimal or no programming experience
- For learners with a data science career
- R is popular in academia
What can you use R for?
For developing statistical software, graphing, as well as data analysis
What makes SQL great?
- Knowing SQL will help you get a job in data science and data engineering
- It speeds up workflow executions
- It acts as an interpreter between you and the database
- It is an ANSI standard
How is SQL different from other development languages?
It is a non-procedural language
What is SQL’s scope?
It is limited to querying and managing data
What was SQL designed for?
Managing data in relational databases
What are the most popular languages in data science?
Python, R, SQL, Scala, Java, C++, and Julia.
What is a substantial benefit of learning SQL?
If you learn SQL and use it with one database, you can apply your SQL knowledge with many other databases easily
What data science tools are built with Java?
Weka, Java-ML, Apache MLlib, and Deeplearning4
What popular program is built with Scala?
Apache Spark which includes Shark, MLlib, GraphX, and Spark Streaming
What data science programs are built with JavaScript?
TensorFlow.js and R-js
What’s a great application for Julia in data science?
JuliaDB
What are libraries?
Libraries are a collection of functions and methods that allow you to perform many actions without writing the code
What are the popular scientific computing libraries in Python?
- Pandas (used for data structures and tools)
- Numpy (based on arrays and matrices)
*
What are data visualisation libraries used for?
They are used to communicate with others and explain meaningful results of an analysis
What are some popular data visualisation libraries in Python?
- Matplotlib, it is used mostly for plots and graphs
- Seaborn, popular for its plots (e.g. time series, heat maps, violin plots)
What are popular Machine Learning and Deep Learning libraries in Python?
- Scikit-learn (Machine Learning: regression, classification, clustering)
- Keras (Deep Learning Neural Networks)
What are popular Deep Learning libraries in Python?
- TensorFlow (Deep Learning: Production and Deployment)
- PyTorch (Deep Learning: regression, classification)
What is TensorFlow?
A low-level framework used in large scale production of deep learning models
What does REST API stand for?
Representational State Transfer Application Programming Interface
What do REST APIs allow you to do?
- They allow you to communicate through the internet
- They enable you to use resources like storage, data, and artificially intelligent algorithms
What are REST APIs used for?
They are used to interact with web services, however, there are rules regarding communication, input or request, and output or response when using these web services.
What does an API do?
It allows communication between two pieces of software
What is a data set?
A structured collection of data
What are the types of data ownership?
- Private data
- Open data
What are characteristics of private data?
- It is confidential
- It is commercially sensitive
What are characteristics of open data?
- It is publically available
*
What has open data contributed to
The growth of data science, machine learning, and artificial intelligence
When can you find open data?
- An open data portal list from around the world
- Governmental, intergovernmental, and organisation websites
- Kaggle
What is the CDLA-Sharing Licence for?
It grants you permission to use and modify data.
The licence stipulates that if you publish your modified version of the data, you must do so under the same licence terms as the original data.
What is the CDLA-Permissive Licence for?
This licence also grants you permission to use and modify data. However, you are not required to share changes to the data.
What is important about the CDLA-Permissive and CDLA-Restrictive Licences?
Neither licence imposes any restrictions on the results you might derive from the data.
Why is the Community Data Licence Agreement important?
It makes it easier to share open data
Why might open data sets not meet enterprise requirements?
- Open data sets may not always be accurate or of high quality. They could contain errors, inconsistencies, or outdated information, which could lead to incorrect insights or decisions if used in an enterprise setting.
- The data in open data sets might not be relevant to the specific needs of the enterprise. Businesses often require very specific data tailored to their operations, market, and customers.
What are open datasets?
Datasets that are freely available for anyone to access, use, modify, and share.
What are attributes of proprietary datasets?
- They contain data primarily owned and controlled by specific individuals or organizations.
- This data is limited in distribution because it is sold with a licensing agreement.
- Some data from private sources cannot be easily disclosed, like public data.
What are some examples of proprietary data?
National security data, geological, geophysical, and biological data
What does the Data Asset eXchange provide?
It provides a curated collection of open datasets, both from IBM research and trusted third-party sources.
This data is ready for use in enterprise applications.
What do Machine Learning (ML) models do?
They identify patterns in data
What is model training?
The process by which the model learns the data patterns
What can a trained model be used for?
It can be used to make predictions
What are the types of Machine Learning?
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
What is the most commonly used type of machine learning?
Supervised learning
What does a supervised learning model do?
The model identifies relationships and dependencies between the input data and the correct output
What are the types of supervised learning models?
Regression and classification
What are regression models used for?
To predict numeric (or “real”) values
Give an example of something you can use a regression model for
Predicting the estimated sales prices for homes in an area
What are classification models used for?
To predict whether some information or data belongs to a category (or “class”)
Give an example of something you can use a classification model for
A set of emails along with a designation you can classify whether they are to be considered as spam or not
What happens in unsupervised learning?
The data is unlabeled and the model tries to identify patterns without external help
Give an example of an application of unsupervised learning
Clustering
What is anomaly detection used for?
Identifying outliers in a dataset
What has reinforcement learning been likened to?
It has been likened to the learning process of a human; e.g. learning something through trial and error.
How does a reinforcement learning model work?
a reinforcement learning model learns the best set of actions to take, given its current environment, to get the most rewards over time.
What is deep learning?
It is a specialised type of machine learning that loosely emulates the way the human brain solves a wide range of problems, using a general set of models and techniques
What are some applications of deep learning?
- Natural language processing
- Image, audio, and video analysis
- Time series forecasting
What are some deep learning requirements?
- it requires large datasets of labeled data and is compute intensive
- it requires special purpose hardware
Which popular frameworks are used to implement deep learning models?
- TensorFlow
- PyTorch
- Keras
What are model zoos?
They are pre-trained state-of-the-art models from repositories
What are the high-level tasks involved in building a model?
- Prepare data
- Build the model
- Train the model
This is an iterative process requiring data, expertise, time, and resources
Then you can deploy the model and use the model.
What is the Model Asset eXchange (MAX)?
It is a free open source repository for ready-to-use and customizable deep learning microservices.
How can you use the time to value of a project?
By making use of pre-trained models
Where are MAX model-serving microservices built and distributed?
On GitHub as open source Docker images
What is Red Hat OpenShift?
It is a Kubernetes platform used to automate deployment, scaling, and management of microservices
What is useful about Ml-exchange.org?
It has multiple predefined models
What does the The Community Data License Agreement (CDLA) facilitate?
It facilitates open data sharing by providing clear licensing terms for distribution and use
In a sentence, what do machine learning models do?
Machine learning models analyse data and identify patterns to make predictions and automate complex tasks
What do Python libraries provide the tools for?
- data manipulation,
- mathematical operations,
- and simplified machine learning model development
What is the best way to represent network data?
A graph is often used to represent connections between people on a social networking website
What does a tabular data set comprise?
It comprises a collection of rows containing columns that store the information
Name a popular tabular data format
Comma-separated values or “.csv”
What are hierarchical or network data structures typically used for?
They are used to represent relationships between data
What format are hierarchical data structures organised in?
They are organised in a tree-like format
What format are network structures organised in?
This data is organised in a graph
Name a popular dataset for data science
The Modified National Institute of Standards and Technology. It contains images of handwritten digits and is commonly used to train image processing systems.
What did Jupyter Notebooks originate as?
iPython
What are the key functionalities of a Jupyter Notebook?
- It records data science experiments
- it allows combining text, code blocks, and code output in a single file
- it exports the notebook to a PDF or HTML file format
What are the key functionalities of JupyterLab?
- It allows access to multiple Jupyter Notebook files, other code, and data files
- It enables working in an integrated manner
- It is compatible with several file formats
- It is an open source
What is a kernel?
It is a computational engine that executes the code contained in a Notebook file
What do Jupyter notebooks represent?
They represent code, metadata, contents, and outputs
What does Jupyter implement?
A two-process model with a kernel and a client
What is the Notebook server responsible for?
Saving and loading the notebooks
What does the kernel execute?
The cells of code contained in the notebook
What does the Jupyter architecture use to convert files to other formats?
The NB convert tool
What do computational notebooks do?
They combine code, computational output, explanatory text, and multimedia resources in a single document
What is JupyterLite
It is a lightweight tool built from Jupyterlab components that executes entirely in the browser
What is R?
R is a statistical programming language
What is R used for?
- Data processing and manipulation
- Statistical inference, data analysis and machine learning
Where is R used the most?
Academia, healthcare, and the government
Why is R a preferred language for some data scientists?
- It is easy to use compared to some other data science tools
- It is a great tool for visualistion
- It doesn’t require installing packages for basic data analysis
What are some popular R libraries for Data Science?
- dplyr (for data manipulation)
- stringr (for string manipulation)
- ggplot (for data visualisation)
- caret (for machine learning)
What are some data visualisation packages in R?
- ggplot (histograms, bar charts, scatterplots)
- plotly (web-based data visualisations)
- lattice (complex, multi-variable data sets)
- leaflet (interactive plots)
What does version control do?
It allows you to keep track of changes to your documents
What is git?
It is a distributed version control system
What is one of the most common version control systems in the world?
git
What is the SSH protocol?
A method for secure remote login from one computer to another
What is a repository?
The folders of your project that are set up for version control
What do repositories do?
They store documents (including your source code) and they enable version control
What is a fork?
A copy of a repository
What is a pull request?
The process you use to request that someone reviews and approves your changes before they become final.
* They serve as a way of proposing changes to the main branch
* Other team members review the changes and approve the merging to the master branch
What is a working directory?
A directory on your file system, including its files and subdirectories, that is associated with a git repository
What is special about the Git Repository Model?
- It is a distributed version control system
- It tracks source code
- It allows collaboration among programmers
- It enables non-linear workflows
What is GitHub
The online hosting service for git repositories
What is a branch?
A snapshot of your repository
What is the master branch?
The official version of the project
What does a child branch do?
It creates a copy of the master branch
When using git repositories, where are edits and changes made?
In the child branch.
In this branch, you can build, make edits, test the changes, and when you are satisfied with them, you can merge them back to the master branch; where you can prepare the model for deployment
What is a key benefit of using child branches?
Branches allow for simultaneous development and testing by multiple team members