Module One - Data Science Tools Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Name the Data categories of Data Science tasks

A

Data Management
Data Integration and Modification
Data Visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name the Model categories of Data Science tasks

A

Model Building
Model Deployment
Model Monitoring and Assessment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How are Data Science tasks supported?

A

Data Science Tasks are supported by:
Code Asset Management
Data Asset Management
Execution Environments
Development Environments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main characteristic of Jupyter Notebooks?

A

A key property of Jupyter Notebooks is to unify documentation, code, output from the code, shell commands, and visualizations in a single document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Suggest a tool suitable for web applications that handle large volumes of unstructured data

A

mongoDB is a document-oriented NoSQL database that stores data in flexible JSON (Binary JSON).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

IBM tools that cover the complete Data Science, ML and AI life cycle

A

Watson Studio and Watson OpenScale, cover the complete development life cycle for all data science, machine learning, and AI tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is RStudio?

A

Studio is an integrated development environment (IDE) designed specifically for working with the R programming language, which is a popular tool for statistical computing and data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the de facto standard tool in Code Asset Management?

A

Git is an quintessential code asset management tool. GitHub is a popular web-based platform for storing and managing source code. Its features make it an ideal tool for collaborative software development, including version control, issue tracking, and project management.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Main difference between Apache Spark and Apache Hadoop for Big Data

A

Apache Hadoop - distributed file system for large datasets
Apache Spark - analytics engine for large-scale data processing, with support for batch, streaming, and ML workloads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is Apache Cassandra suitable to allow your database services to scale across many commodities servers?

A

Yes. Apache Cassandra is a highly scalable, distributed NoSQL database that handles large amounts of structured and unstructured data across servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe Apache Airflow

A

Apache Airflow is an open source platform for programmatically authoring, scheduling and monitoring workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is Kubernetes a proper tool for scaling, managing and launching containerized applications?

A

Yes. Kubernetes is an open source platform for container orchestration. It automatically launches, scales and manages containerized applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which Apache tool is suitable for analyzing and visualizing large data sets in Apache Hadoop?

A

Hue is an open source web interface for analyzing and visualizing large data sets in Apache Hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the other name used for Data Asset Management?

A

Data asset management is also known as data governance or data lineage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the key advantages of APIs in model deployment?

A

Real-time predictions.
Enable real-time interaction between models and applications. APIs are the most common enabler of real-time, data-based decisions as they allow external systems to make predictions on the fly, often combined with CI/CD automation (e.g., GitLab) for seamless deployment and updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the key advantage of Packages during model deployment?

A

Packaging models as software libraries or containers can help with reproducibility, version control, and easier deployment.

17
Q

How does APIs work during the model deployment stage?

A

A model is deployed as a service via an API (often a REST or gRPC API), allowing external systems to send data to the model and receive predictions in return. This setup facilitates seamless integration into web applications, mobile apps, and other software systems.

18
Q

How does Packages work during the model deployment stage?

A

A model is wrapped into a package (e.g., Python .whl or .tar.gz files) or a container (e.g., Docker), making it easy to deploy across environments or integrate into other software systems. Packaging can also be done using tools like TensorFlow Serving or PyTorch Serve to serve models at scale.

19
Q

Describe the key features of DataMeer data analytics platform

A
  1. Data Integration
  2. Self-Service Data Preparation (Users can clean, join, filter, and transform datasets using a visual interface)
  3. Advanced Complex Analytics
  4. Data Exploration and Visualization
  5. Collaboration and Governance
20
Q

What is the key purpose of the DataMeer data analytics platform?

A

DataMeer is designed to simplify and speed up the process of preparing, analyzing, and visualizing large datasets.
It is primarily used for data preparation, business intelligence (BI), and data exploration.

21
Q

Describe the key features of DataRefinery data transformation platform

A

Data Preparation and Transformation
Scalability
Automation
Collaboration
Data Governance and Compliance

22
Q

What is the key purpose of the DataRefinery data transformation platform?

A

DataRefinery is designed to simplify and automate the process of preparing, cleaning, and transforming large datasets from various sources for analytics and machine learning with focus on scalability. It helps data teams streamline the often time-consuming and complex tasks of data wrangling, enabling faster and more efficient analysis.

23
Q

Is DataRefinery platform tool a part of IBM Watson Studio?

A

Yes. Data Refinery is part of IBM Watson Studio.

24
Q

What is the main advantage of Apache Airflow platform?

A

Apache Airflow excels at automating, orchestrating, and monitoring workflows in data-intensive environments. It’s highly versatile, allowing it to be used for a wide range of tasks from data engineering and ETL, to machine learning, to infrastructure management, and more.

25
Q

Give an example of an use case of ETL with Apache Airflow platform

A

Automating Extract Transform and Load data pipelines.
Extracting sales data from multiple APIs or databases, transforming it to unify formats, and loading the processed data into a cloud data warehouse like Amazon Redshift or Google BigQuery.

26
Q

Give an example of an use case of ETL with Apache Airflow platform

A

Airflow is often used to manage and automate the movement of data into and within data warehouses.
Example: Automating the nightly data load into a Snowflake warehouse, where new data is appended, and old data is archived or updated.

27
Q

Give an example of an use case of implementing ML pipelines with Apache Airflow platform

A

Airflow helps manage machine learning workflows by scheduling model training, validation, hyperparameter tuning, and model deployment. It ensures that the steps in the pipeline are run in a specific order with dependencies.
Example: Running daily batch jobs to retrain a machine learning model with new data, then evaluating the model’s performance and deploying it if it meets certain accuracy thresholds.

28
Q

What are the benefits of using Fiddler for model monitoring in Data Science?

A

Fiddler is essential in data science for ensuring that machine learning models in production continue to perform well, remain interpretable, avoid bias, and comply with industry regulations, enabling data teams to maintain transparency and trust in AI systems.

Proactive Management: Fiddler helps teams catch problems in production early.

Accountability and Trust - By providing explainability, Fiddler builds trust in machine learning models

Ethical AI Development: It helps ensure that models are free from bias and are ethically sound

Scalability: Fiddler scales across multiple models

29
Q

Describe the key features of Apache Airflow

A
  1. Workflow as Code: define workflows as Directed Acyclic Graphs (DAGs) using Python code.
  2. Task Scheduling, Parallel Execution and Task Orchestration
  3. Extensibility and Integration with AWS, Google Cloud Platform, Apache Spark, Hadoop, Databricks, databases, and many more.
  4. Horizontal Scalability
  5. Web-based UI for Monitoring and Visualization: