Module One - Data Science Tools Flashcards
Name the Data categories of Data Science tasks
Data Management
Data Integration and Modification
Data Visualization
Name the Model categories of Data Science tasks
Model Building
Model Deployment
Model Monitoring and Assessment
How are Data Science tasks supported?
Data Science Tasks are supported by:
Code Asset Management
Data Asset Management
Execution Environments
Development Environments
What is the main characteristic of Jupyter Notebooks?
A key property of Jupyter Notebooks is to unify documentation, code, output from the code, shell commands, and visualizations in a single document.
Suggest a tool suitable for web applications that handle large volumes of unstructured data
mongoDB is a document-oriented NoSQL database that stores data in flexible JSON (Binary JSON).
IBM tools that cover the complete Data Science, ML and AI life cycle
Watson Studio and Watson OpenScale, cover the complete development life cycle for all data science, machine learning, and AI tasks.
What is RStudio?
Studio is an integrated development environment (IDE) designed specifically for working with the R programming language, which is a popular tool for statistical computing and data analysis.
What is the de facto standard tool in Code Asset Management?
Git is an quintessential code asset management tool. GitHub is a popular web-based platform for storing and managing source code. Its features make it an ideal tool for collaborative software development, including version control, issue tracking, and project management.
Main difference between Apache Spark and Apache Hadoop for Big Data
Apache Hadoop - distributed file system for large datasets
Apache Spark - analytics engine for large-scale data processing, with support for batch, streaming, and ML workloads
Is Apache Cassandra suitable to allow your database services to scale across many commodities servers?
Yes. Apache Cassandra is a highly scalable, distributed NoSQL database that handles large amounts of structured and unstructured data across servers.
Describe Apache Airflow
Apache Airflow is an open source platform for programmatically authoring, scheduling and monitoring workflows.
Is Kubernetes a proper tool for scaling, managing and launching containerized applications?
Yes. Kubernetes is an open source platform for container orchestration. It automatically launches, scales and manages containerized applications.
Which Apache tool is suitable for analyzing and visualizing large data sets in Apache Hadoop?
Hue is an open source web interface for analyzing and visualizing large data sets in Apache Hadoop
What is the other name used for Data Asset Management?
Data asset management is also known as data governance or data lineage.
What is the key advantages of APIs in model deployment?
Real-time predictions.
Enable real-time interaction between models and applications. APIs are the most common enabler of real-time, data-based decisions as they allow external systems to make predictions on the fly, often combined with CI/CD automation (e.g., GitLab) for seamless deployment and updates.