All Flashcards
List stages of experimentation and prototyping.
- problem refinement,
- data selection,
- data exploration,
- feature engineering,
- model prototyping which covers algorithm selection,
- model training, hyperparameter tuning
- model evaluation.
What solutions are available for experimentation?
a low code or no code solution
What if this is a case of one-off development with no need to develop a retraining pipeline?
the validated training model and its associated metadata and artifacts are registered with model registry.
What is referred to as training operationalization?
If the model needs to be repeatedly retrained in the future, an automated training pipeline is also developed.
What happens when the model is deployed to its target environment as a service?
It serves predictions to various consumers in the following forms:
- online inference, which is in realtime as a Rest API;
- streaming inference, which is near realtime such as in an event processing pipeline;
- batch inference, which is offline usually integrated with your ETL processes;
- embedded inference on an embedded system
From what Google Cloud sources Data catalog can catalog data on data assets?
- BigQuery data sets, tables, and views,
- Pub/Subtopics,
- Dataproc Metastore services, databases, and tables.
- Non-GCP data assets: Hive, Oracle, SQL server, Teradata, Redshift, MySQL, PostgreSQL, Looker, Tableau
What is Dataplex for?
Dataplex’s intelligent data fabric enables organizations to centrally manage, monitor and govern their data across data lakes, data warehouses and data marts with consistent controls, thus providing access to trusted data, empowering analytics at scale.
List advantages of Dataplex.
- gives you freedom of choice to store data wherever you want for the right price and performance.
- choose the best analytics tools for the job, including Google Cloud and open-source analytics technology such as Apache Spark and Presto.
- lets you enforce consistent controls across your data to ensure unified security and governance.
- take advantage of built-in data intelligence using Google’s best in class AI ML capabilities to automate much of the manual toil around data management and get access to higher-quality data as well.
What is one of the core tenets of Dataplex?
Letting you organize and manage your data in a way that makes sense for your business without data movement or duplication.
Dataplex provides built-in one-click templates for common data-management task.
True
What is one of the biggest differentiators for Dataplex?
Its data-intelligence capabilities, using Google’s best in class AI ML technologies.
What Analytics Hub is for?
Analytics Hub exchanges data analytics assets across organizations to address challenges of data reliability and cost. You can exchange data, ML models, or other analytics assets and easily publish or subscribe to shared datasets in the open, secure, and privacy-safe environment.
It is a convenient way to build a data ecosystem.
List roles in Analytics Hub.
- a data publisher,
- an exchange administrator,
- a data subscriber.
List Analytics Hub components.
- a publisher project,
- a subscriber project,
- the exchange.
What are data Exchanges?
Exchanges are collections of data and analytics assets designed for sharing.
What are BigQuery shared datasets?
Shared datasets are collections of tables and views in BigQuery defined by a data publisher and make up the unit of cross-project or cross-organizational sharing.
What is recommended to use with large volumes of unstructured data?
- With large volumes of unstructured data, consider using Dataflow, which uses the Apache Beam programming model.
- You can use Dataflow to convert the unstructured data into binary data formats like TFRecord, which can improve performance of data ingestion during the training process.
When Dataproc is recommended for customers?
- if your organization has an investment in an Apache Spark code base and skills
- if they have existing implementations using Hadoop with Spark to perform ETL.