Tools for Data Science Flashcards
IBM Data Science Professional Certificate (Course 2/10)
Do I Visit Beautiful Destinations Most Autumns?
What data science categories does raw data need to pass through before it is deemed useful?
- Data management
- Data integration and transformation
- Data visualisation
- Model building
- Model deployment
- Model monitoring and assessment
Do Cats Dance Exquisitely Everywhere?
What tools are used to support the tasks performed in the Data Science Categories?
- Data Asset Management
- Code Asset Management
- Development Environments
- Execution Environments
What is data management?
The process of collecting, persisting, and retrieving data securely, efficiently, and cost-effectively.
Where is data collected from?
Many sources, including (but not limited to) Twitter, Flipkart, sensors, and the Internet.
Where should you store data so it is available whenever you need it?
Store the data in persistent storage
What is data integration and transformation?
The process of extracting, transforming, and loading data (ETL)
Give examples of repositories where data is commonly distributed
- Databases
- Data cubes
- Flat files
When extracting data through the extraction process, where should you save this extracted data?
It is common practice to save extracted data in a central repository like a data warehouse.
What is a data warehouse primarily used for?
It is primarily used to collect and store massive amounts of data for data analysis.
What is data transformation?
The process of transforming the values, structure, and format of data.
What do you do after extracting data?
Transform the data
What happens to data after it has been transformed?
It is loaded back to the data warehouse
What is data visualisation?
It is the graphical representation of data and information.
What are some ways to visualise data?
You can visualise the data through charts, plots, maps, and animations.
Why is data visualisation a good thing?
It conveys data more effectively for decision-makers
What happens after data visualisation?
Model building
What is model building?
It is a step where you train the data and analyse patterns using suitable machine learning algorithms.
What happens after model building?
Model deployment
What is model deployment?
It is the process of integrating a model into a production environment
In model deployment, how is a machine learning model made available to third-party applications?
They are made available via APIs.
What goal do third-party applications help achieve during model deployment?
They allow business users to access and interact with data
What is the purpose of model monitoring?
To track the performance of deployed models.
Give an example of a tool used during model monitoring
Fiddler
What is the purpose of model assessment
To check for model accuracy, fairness, and monitor its robustness
What are some common metrics used during model assessment?
- F1 Score
- True positive rate
- Sum of squared error
What is a popular Model Monitoring and Assessment tool?
IBM Watson Open Scale
What is Code Asset Management?
It is a unified view where you manage an inventory of assets.
It is a tool
What do developers use versioning for?
To track the changes made to a software project’s code
You use it
Give an example of a Code Asset Management platform
GitHub
What is Data Asset Management?
It is a platform for organising and managing data collected from different sources
What do DAM platforms typically support?
Replication, backup, and access right management for stored data
Do I Eat Tasty Donuts?
What do Development Environments provide a workspace to do?
IDEs provide a workspace and tools to develop, implement, execute, test, and deploy source code.
What do execution environments have?
They have libraries for code compiling and the system resources to execute and verify code.
Give an example of a fully-integrated visual tool
IBM Watson Studio
What are the most widely-used open-source data management tools?
- MySQL
- PostgreSQL
- mongoDB (NoSQL)
- Apache CouchDB (NoSQL)
- Hadoop (file-based)
- ceph (cloud-based system)
What is the task of data integration and transformation in the classic data warehousing world?
It is for ETL or ELT.
What are the most widely-used data integration and transformation tools?
- Apache Airflow
- Kubeflow
- Apache Nifi
What are the most widely-used data visualisation tools?
- PixieDust
- Hue
- kibana
- Apache Superset
What are some popular model deployment tools?
- PredictionIO
- SELDON
- mleap
What are some popular model monitoring and assessment tools?
- ModelDB
- Prometheus
- Adverserial Robustness 360 Toolbox
- AI Explainability 360
Name one popular code asset management tool
git
What are some popular data asset management tools?
- Apache Atlas
- Kylo
What is the most popular development environment that data scientists are currently using?
Jupyter
What is the next version of Jupyter Notebooks?
Jupyter Lab
In the long term, it will replace Jupyter Notebooks
What are some characteristics of RStudio?
- Exclusively runs R and its associated libraries
- Enables Python development
- Provides optimal user experience when tightly integrated in the tool
People Enjoy Delicious Red Dates Every Valentine’s
What is RStudio able to unify into one tool?
- Programming
- Execution
- Debugging
- Remote data access
- Data exploration
- Visualisation
What are the features of Spyder?
It integrates:
* Code
* Documentation
* Visualisation
to a single canvas
Not on par with the functionality of RStudio
Spyder tries to mimic RStudio’s behaviour in order to bring its functionality to the Python world.
What is the key feature of Apache Spark?
Linear scalability
What does linear scalability mean?
In the context of Apache Spark
It essentially means the servers there are in a cluster, the more the performance
What is the difference between Apache Spark and Flink?
- Spark is a batch processing engine capable of processing huge amounts of data one by one or file by file
- Flink is a stream processing image with a main focus of processing real-time data streams
What is a commercial tool?
In the context of data science
Commercial tools are software applications that are often licensed and used by businesses to perform various tasks related to data science.
What does open-source mean?
Open-source refers to a type of software where the source code is made publicly accessible. This means that anyone can view, modify, and distribute the code as they see fit.
Who delivers commercial support to data science tools?
- Software vendors
- Influential partners
- Support networks
What do commercial tools do?
They support the most common tasks in data science
Name 2 commercial tools for data management
- Oracle Database
- Microsoft SQL Server
Name 2 commercial tools for data integration
- Informatica Powercenter
- IBM Infosphere DataStage
Name 2 commercial tools for model building
- SPSS Modeler
- SAS enterprise miner
Name 2 providers of commercial data asset management tools
- Informatica
- IBM
Name 1 fully-integrated commercial tool that covers the entire data science life cycle
IBM Watson Studio
Which 2 cloud-based tools cover the complete development life cycle for all data science, AI, and machine learning tasks?
- Watson Studio
- Watson OpenScale
In which category of data science tasks will you find an SaaS version of existing open-source and commercial tools?
with some exceptions, of course
Data management
What do Informatica Cloud Data Integration and IBM’s Data Refinery have in common?
They are both cloud-based commercial data integration tools
What is IBM’s Congos Business Intelligence suite an example of?
A cloud-based data visualisation tool
Name a cloud-based tool for model building
Watson Machine Learning
What is Amazon SageMaker Model Monitor an example of?
A cloud-based tool for monitoring deployed machine learning and deep learning models continuously
How do you decide which programming language to learn?
It largely depends on your needs, the problems you are trying to solve, and who you are solving the problem for
What are the popular languages within data science?
Python, R, SQL, Scala, Java, C++, and Julia
Which Python scientific computing libraries are commonly used in data science?
Pandas, Numpy, SciPy, Matplotlib
How can you use Python for Natural Language Processing (NLP)?
By making use of the Natural Language Toolkit
What are the similarities between open source and free software?
- Both are free to use
- Both commonly refer to the same set of licences
- Both support collaboration
What are the differences between open source and free software
Open source is more business-focused while free software is more focused on a set of values
Who is R for?
- Statisticians, mathematicians, and data engineers
- People with minimal or no programming experience
- For learners with a data science career
- R is popular in academia