Data Architect 37 questions Flashcards

Question

WHAT ARE THE DIFFERENCES BETWEEN STRUCTURED AND UNSTRUCTURED DATA

Answer 1

Data Engineers constantly work with data that is coming into the systems in all sorts of formats. Broadly categorizing them as structured and unstructured. They differ in the way these are stored and accessed. For your convenience, some of the differences are listed. Criteria Structured Data Unstructured Data Storage: DBMS Unmanaged file structures Standard: ADO.net, ODBC, and SQL STMP, XML, CSV, and SMS Integration Tool: ELT (Extract, Transform, Load) Manual data entry or batch processing that includes codes Scaling Schema scaling is difficult Scaling is very easy.

Answer 2

This is an important question and you should be thorough with your answer as it assesses your understanding of the role and how much you have invested in learning them. You should include the below points in your response. A data engineer might be involved in any one or more areas of architecting, building, maintaining the data infrastructure especially the ones which are massive in size like Big Data. Be responsible for data acquisition and data ingestion processes. Responsible for pipeline development of various ETL operations. Identifying ways to improve data reliability and availability.

Answer 3

This question is asked to assess your knowledge of the systems from the ground up. There is no perfect answer to this question and no answer is bad. Responses to the below questions might give you a good answer. What is the goal of the product? What are the data sources important to the customer and to the success of the product? What formats are these available and where are they located? What is the volume of data being acquired? What is the requirement on availability of the data or in other words, how available do you want your data to be? Will there be a need to transform the acquired data? Will you need to respond to data being ingested in real-time? Are there data streams involved or will there be a possibility of any in the future? Once these questions are answered, you try to map to the available technologies to address the challenges and characteristics of each. This is not an exhaustive list of questions but is an approach that you can take to respond to the original question from the interviewer.

Answer 4

The algorithm that you select to discuss must be one that you are good at and preferably used by the company. There will be follow up questions to understand the depth of your answer, like, What made you choose this algorithm? How scalable is this algorithm? What were the challenges you faced in using this algorithm? How did you tackle them?

Answer 5

Be sure to include the challenges that prop up when transforming from unstructured to structured in your response.

Answer 6

There is a good likely hood of this question being asked if you are an experienced candidate. Do mention the tools used for building the model and small brief about how you’ve done it.

Answer 7

Talk about the tool you have used and highlight some of its features that helped you pick it for ETL.

Answer 8

Big Data is a phenomenon, a result of exponential growth in data availability, storage technology and processing power, while Hadoop is a framework that helps to handle huge volumes of data that reside in the Big Data ecosystem. Describe the components of Hadoop as below. MapReduce Hadoop Common YARN (Yet Another Resource Negotiator)

Answer 9

NameNodes store metadata of all the files stored on the cluster. Basically, metadata about data nodes, bits of information like the location of blocks, size of files, hierarchy. It is similar to a File Allocation Table (FAT), which stores information about blocks of data that make up files and where they are stored on a single computer. NameNodes keep the same kind of information for a distributed file system. Under normal circumstances, NameNode crash will result in non-availability of data, although all blocks of data are intact. A high availability setup will ensure there is a passive NameNode that backs up the primary one and takes over in case the NameNode fails.

Answer 10

Blocks are the smallest unit of data allocated to a file, which the Hadoop system automatically creates for storage in different nodes in a distributed file system. Block Scanner verifies the integrity of a DataNode by checking the data blocks stored on it.

Answer 11

MapReduce is a processing module in the Apache Hadoop project. Hadoop is a platform built to tackle big data using a network of computers to store and process data. What is so attractive about Hadoop is that affordable dedicated servers are enough to run a cluster. You can use low-cost consumer hardware to handle your data. Hadoop is highly scalable. You can start with as low as one machine, and then expand your cluster to an infinite number of servers. The two major default components of this software library are: MapReduce HDFS – Hadoop distributed file system Reducers process the intermediate data from the maps into smaller tuples, that reduces the tasks, leading to the final output of the framework. The MapReduce framework enhances the scheduling and monitoring of tasks. By using the resources of multiple interconnected machines, MapReduce effectively handles a large amount of structured and unstructured data. MapReduce HDFS diagram with input and output. Before Spark and other modern frameworks, this platform was the only player in the field of distributed big data processing. MapReduce assigns fragments of data across the nodes in a Hadoop cluster. The goal is to split a dataset into chunks and use an algorithm to process those chunks at the same time. The parallel processing on multiple machines greatly increases the speed of handling even petabytes of data. Distributed Data Processing Apps This framework allows for the writing of applications for distributed data processing. Usually, Java is what most programmers use since Hadoop is based on Java. However, you can write MapReduce apps in other languages, such as Ruby or Python. No matter what language a developer may use, there is no need to worry about the hardware that the Hadoop cluster runs on. Scalability Hadoop infrastructure can employ enterprise-grade servers, as well as commodity hardware. MapReduce creators had scalability in mind. There is no need to rewrite an application if you add more machines. Simply change the cluster setup, and MapReduce continues working with no disruptions. What makes MapReduce so efficient is that it runs on the same nodes as HDFS. The scheduler assigns tasks to nodes where the data already resides. Operating in this manner increases available throughput in a cluster.

Answer 12

Basic Terminology of Hadoop MapReduce As we mentioned above, MapReduce is a processing layer in a Hadoop environment. MapReduce works on tasks related to a job. The idea is to tackle one large request by slicing it into smaller units. JobTracker and TaskTracker In the early days of Hadoop (version 1), JobTracker and TaskTracker daemons ran operations in MapReduce. At the time, a Hadoop cluster could only support MapReduce applications. A JobTracker controlled the distribution of application requests to the compute resources in a cluster. Since it monitored the execution and the status of MapReduce, it resided on a master node. A TaskTracker processed the requests that came from the JobTracker. All task trackers were distributed across the slave nodes in a Hadoop cluster. JobTracker and TaskTracker diagram in Hadoop 1 MapReduce YARN Later in Hadoop version 2 and above, YARN became the main resource and scheduling manager. Hence the name Yet Another Resource Manager. Yarn also worked with other frameworks for the distributed processing in a Hadoop cluster. MapReduce Job A MapReduce job is the top unit of work in the MapReduce process. It is an assignment that Map and Reduce processes need to complete. A job is divided into smaller tasks over a cluster of machines for faster execution. The tasks should be big enough to justify the task handling time. If you divide a job into unusually small segments, the total time to prepare the splits and create tasks may outweigh the time needed to produce the actual job output. MapReduce Task MapReduce jobs have two types of tasks. A Map Task is a single instance of a MapReduce app. These tasks determine which records to process from a data block. The input data is split and analyzed, in parallel, on the assigned compute resources in a Hadoop cluster. This step of a MapReduce job prepares the pair output for the reduce step. A Reduce Task processes an output of a map task. Similar to the map stage, all reduce tasks occur at the same time, and they work independently. The data is aggregated and combined to deliver the desired output. The final result is a reduced set of pairs which MapReduce, by default, stores in HDFS.

Answer 13

The partitioner is responsible for processing the map output. Once MapReduce splits the data into chunks and assigns them to map tasks, the framework partitions the key-value data. This process takes place before the final mapper task output is produced. MapReduce partitions and sorts the output based on the key. Here, all values for individual keys are grouped, and the partitioner creates a list containing the values associated with each key. By sending all values of a single key to the same reducer, the partitioner ensures equal distribution of map output to the reducer. Note: The number of map output files depends on the number of different partitioning keys and the configured number of reducers. That amount of reducers is defined in the reducer configuration file. The default partitioner well-configured for many use cases, but you can reconfigure how MapReduce partitions data. If you happen to use a custom partitioner, make sure that the size of the data prepared for every reducer is roughly the same. When you partition data unevenly, one reduce task can take much longer to complete. This would slow down the whole MapReduce job.

Answer 14

As the name suggests, MapReduce works by processing input data in two stages – Map and Reduce. To demonstrate this, we will use a simple example with counting the number of occurrences of words in each document. The final output we are looking for is: How many times the words Apache, Hadoop, Class, and Track appear in total in all documents. For illustration purposes, the example environment consists of three nodes. The input contains six documents distributed across the cluster. We will keep it simple here, but in real circumstances, there is no limit. You can have thousands of servers and billions of documents. MapReduce example diagram when processing data. 1. First, in the map stage, the input data (the six documents) is split and distributed across the cluster (the three servers). In this case, each map task works on a split containing two documents. During mapping, there is no communication between the nodes. They perform independently. 2. Then, map tasks create a pair for every word. These pairs show how many times a word occurs. A word is a key, and a value is its count. For example, one document contains three of four words we are looking for: Apache 7 times, Class 8 times, and Track 6 times. The key-value pairs in one map task output look like this: This process is done in parallel tasks on all nodes for all documents and gives a unique output. 3. After input splitting and mapping completes, the outputs of every map task are shuffled. This is the first step of the Reduce stage. Since we are looking for the frequency of occurrence for four words, there are four parallel Reduce tasks. The reduce tasks can run on the same nodes as the map tasks, or they can run on any other node. The shuffle step ensures the keys Apache, Hadoop, Class, and Track are sorted for the reduce step. This process groups the values by keys in the form of pairs. 4. In the reduce step of the Reduce stage, each of the four tasks process a to provide a final key-value pair. The reduce tasks also happen at the same time and work independently. In our example from the diagram, the reduce tasks get the following individual results: Note: The MapReduce process is not necessarily successive. The Reduce stage does not have to wait for all map tasks to complete. Once a map output is available, a reduce task can begin. 5. Finally, the data in the Reduce stage is grouped into one output. MapReduce now shows us how many times the words Apache, Hadoop, Class, and track appeared in all documents. The aggregate data is, by default, stored in the HDFS. The example we used here is a basic one. MapReduce performs much more complicated tasks. Some of the use cases include: Turning Apache logs into tab-separated values (TSV). Determining the number of unique IP addresses in weblog data. Performing complex statistical modeling and analysis. Running machine-learning algorithms using different frameworks, such as Mahout.

Answer 15

Answer : There are three steps to deploy a Big Data … Data Ingestion, Data Storage and Data Processing What are the steps to deploy a big data solution ? There are three steps to deploy a Big Data Solution Deploying Big Data solution Data Ingestion The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS. Data Storage After the data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access. Data Processing The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.

Answer 16

SQL delete duplicate Rows using Group By and having clause In this method, we use the SQL GROUP BY clause to identify the duplicate rows. The Group By clause groups data as per the defined columns and we can use the COUNT function to check the occurrence of a row.

Answer 17

With the help of big data, companies aim at offering improved customer services, which can help increase profit. Enhanced customer experience is the primary goal of most companies. Other goals include better target marketing, cost reduction, and improved efficiency of existing processes. 1. Customer Acquisition And Retention To stand out, organizations must have a unique approach to market their products. By using big data, companies can pinpoint exactly what customers are looking for. They establish a solid customer base right out of the gate. New big data processes observe the patterns of consumers. They then use those patterns to trigger brand loyalty by collecting more data to identify more trends and ways to make customers happy. Amazon has mastered this technique by providing one of the most personalized shopping experiences on the internet today. Suggestions are based not only on past purchases, but also on items that other customers have bought, browsing behavior and many other factors. 2. Focused And Targeted Campaigns Businesses can use big data to deliver tailored products to their targeted market. Forget spending money on advertising campaigns that don’t work. Big data helps companies make a sophisticated analysis of customer trends. This analysis usually includes monitoring online purchases and observing point-of-sale transactions. These insights then allow companies to create successful, focused and targeted campaigns, thus allowing companies to match and exceed customer expectations and build greater brand loyalty. 3. Identification Of Potential Risks These days businesses are thriving in high-risk environments, but these environments require risk management processes — and big data has been instrumental in developing new risk management solutions. Big data can improve the effectiveness of risk management models and create smarter strategies. 4. Innovative Products Big data continues to help companies update existing products while innovating new ones. By collecting large amounts of data, companies are able to distinguish what fits their customer base. If a company wants to remain competitive in today’s market, it can no longer rely on instinct. With so much data to work off of, organizations can now implement processes to track their customer feedback, product success and what their competitors are doing. 5. Complex Supplier Networks By using big data, companies offer supplier networks, otherwise known as B2B communities, with greater precision and insights. Suppliers are able to escape constraints they typically face by applying big data analytics. Through the application of big data, suppliers use higher levels of contextual intelligence, which is necessary for their success. Supply chain executives are now looking at data analytics as a disruptive technology by changing the foundation of supplier networks to include high-level collaboration. This collaboration lets networks apply new knowledge to existing problems or other scenarios. How To Begin Putting Big Data To Work If you are a business that has data, but you do not know where to begin or how to use it, don't worry. You are not alone. First, you must determine what business problem you will be trying to solve with the data that you have. For instance, are you trying to determine the level of shopping cart abandonment and why? Second, just because you have the data doesn't automatically mean that you can put it to use to solve your problem. Most organizations have been collecting data for a decade or more. Yet, it is unstructured and messy — what is known as "dirty data." You will need to clean it up by putting it into a structured format before you can put it to use. Third, if you decide to work with a firm, you will need one that can do more than just visualize the data. It will need to be a firm that can model the data to drive insights that will help you solve your business problem. Modeling data is not easy or inexpensive, so it's important to have a budget and plan in place before taking this step. An Important Investment The biggest businesses are continuing to grow, thanks to big data analytics. Developing technology is becoming available to more organizations than ever before. Once brands have data at their disposal, they can implement the appropriate analysis systems to solve many of their problems.

Answer 18

Replication Factor: It is basically the number of times Hadoop framework replicate each and every Data Block. Block is replicated to provide Fault Tolerance. The default replication factor is 3 which can be configured as per the requirement; it can be changed to 2 (less than 3) or can be increased (more than 3.)

Answer 19

Block Scanner is basically used to identify corrupt datanode Block. During a write operation, when a datanode writes in to the HDFS, it verifies a checksum for that data. This checksum helps in verifying the data corruptions during the data transmission

Answer 20

Block Scanner is basically used to identify corrupt datanode Block. During a write operation the datanode writing the data to HDFS verifies the checksum for the data that is being written to detect data corruption during transmission. During a read operation the client verifies the checksum that is returned by the datanode against the checksum that it calculates against the data to detect data corruption caused by disk during storage on the datanodes. These checksum verification are very helpful but they are only done when a client attempts a read (or write) to HDFS. They don’t find corruptions prematurely before a client request a read on a corrupted data. Every datanode periodically runs a block scanner, which periodically verifies all the blocks that is stored on the datanode. This helps to catch the corrupted block to be identified and fixed before a client request a read operation. With the block scanner service HDFS can prematurely identify and fix corruptions. During a write operation, when a datanode writes in to the HDFS, it verifies a checksum for that data. This checksum helps in verifying the data corruptions during the data transmission. When the same data is read from the HDFS, the client verifies the checksum returned by the datanode against the checksum it calculates against the data to check the data corruption that might have caused by the data node that might have occurred during the storage of data in the data node. Therefore every datanode periodically runs a block scanner, to verify all the blocks that are stored in the data node. So this helps to identify and fix the corrupt data before a read operation. With the block scanner service, HDFS can prematurely identify and fix corruptions. Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

Answer 21

HeartBeat and Block Report are the two messages that NameNode receives from DataNode. Heartbeat: datanode send heartbeat signals to namenode every 3 seconds (default configured in hdfs-site.xml file) about the node status live or not. This helps the namenode to decide whether to use that datanode block or not. Block Report: NameNode gets information about what data is stored on a specific DataNode. A block report contains the block ID, the generation timestamp and the length for each block replica the server hosts.

Answer 22

Security features of Hadoop consist of Authentication, Service Level Authorization, Authentication for Web Consoles and Data Confidentiality.

Answer 23

What Are DAS Servers? Direct-attached storage (DAS) is a digital storage system directly attached to a host computer accessing it. Examples of DAS include: ``` Internal hard drives and SSDs External hard drives and SSDs CDs USB flash drives Pros of DAS Servers DAS servers bring a lot of advantages to the table. Here are some of the most important ones: ``` ``` High performance Low maintenance Easy to set up and configure Relatively inexpensive These advantages of portable DAS storage make it suitable for small to medium businesses (SMBs) that require a lot of storage but don’t have large budgets for advanced storage solutions. ``` Cons of DAS Servers While DAS servers have many advantages, they aren’t without their drawbacks. One of the biggest is that a DAS system can’t be managed over a network. Data can only be accessed and manipulated by the DAS’s host computer. Another drawback is DAS systems don’t typically have the same data redundancy level options as their NAS counterparts. What Are NAS Servers? Network accessed storage (NAS) refers to a self-contained storage system that’s attached to a local area network (LAN) or a wide area network (WAN). All devices connected to the network can be granted access by the network administrator to the data stored on the NAS system. It comprises portable NAS servers and network management software that permits different users to log in to the storage system. Pros of NAS Servers 12 Bay SecureNAS Door Open Are there any advantages of going with a storage system that relies on a network? There sure are. Here are some of the advantages you get with NAS servers: Centralized file storage and access More efficient use of data storage resources Easy data recovery Easy scalability Better data redundancy Save space due to their compact nature Affordable high-capacity storage solution Cons of NAS Servers Require more administrator management than DAS. Latency issues may arise due to network issues. NAS servers are more advanced than their DAS counterparts and offer a more reliable data storage solution. DAS Servers vs. NAS Servers The main difference between DAS and NAS servers is that DAS servers are a more localized storage solution. As such, they don’t offer you the flexibility of file sharing and remote updating that NAS servers offer you. When you need more storage, NAS beats DAS because you can easily add another portable NAS storage device to your network as your data storage needs grow.

Answer 24

Google Big Data ServicesGCP offers a wide variety of big data services you can use to manage and analyze your data, including:Google Cloud BigQueryBigQuery lets you store and query datasets holding massive amounts of data. The service uses a table structure, supports SQL, and integrates seamlessly with all GCP services. You can use BigQuery for both batch processing and streaming. This service is ideal for offline analytics and interactive querying.Google Cloud DataflowDataflow offers serverless batch and stream processing. You can create your own management and analysis pipelines, and Dataflow will automatically manage your resources. The service can integrate with GCP services like BigQuery and third-party solutions like Apache Spark.Google Cloud DataprocDataproc lets you integrate your open source stack and streamline your process with automation. This is a fully managed service that can help you query and stream your data, using resources like Apache Hadoop in the GCP cloud. You can integrate Dataproc with other GCP services like Bigtable.Google Cloud Pub/SubPub/Sub is an asynchronous messaging service that manages the communication between different applications. Pub/Sub is typically used for stream analytics pipelines. You can integrate Pub/Sub with systems on or off GCP, and perform general event data ingestion and actions related to distribution patterns.Google Cloud ComposerComposer is a fully-managed cloud-based workflow orchestration service based on Apache Airflow. You can use Composer to manage data processing across several platforms and create your own hybrid environment. Composer lets you define the process using Python. The service then automates processing jobs, like ETL.Google Cloud Data FusionData Fusion is a fully-managed data integration service that enables stakeholders of various skill levels to prepare, transfer, and transform data. Data Fusion lets you create code-free ETL/ELT data pipelines using a point-and-click visual interface. Data Fusion is an open source project that provides the portability needed to work with hybrid and multicloud integrations.Google Cloud BigtableBigtable is a fully-managed NoSQL database service built to provide high performance for big data workloads. Bigtable runs on a low-latency storage stack, supports the open-source HBase API, and is available globally. The service is ideal for time-series, financial, marketing, graph data, and IoT. It powers core Google services, including Analytics, Search, Gmail, and Maps.Google Cloud Data CatalogData Catalog offers data discovery capabilities you can use to capture business and technical metadata. To easily locate data assets, you can use schematized tags and build a customized catalog. To protect your data, the service uses access-level controls. To classify sensitive information, the service integrates with Google Cloud Data Loss Prevention.

Answer 25

Google provides a reference architecture for large-scale analytics on Google Cloud, with more than 100,000 events per second or over 100 MB streamed per second. The architecture is based on Google BigQuery. Google recommends building a big data architecture with hot paths and cold paths. A hot path is a data stream that requires near-real-time processing, while a cold path is a data stream that can be processed after a short delay. This has several advantages, including: Ability to store logs for all events without exceeding quotas Reduced costs, because only some events need to be handled as streaming inserts (which are more expensive) Data originates from two possible sources—analytics events published to Cloud Pub/Sub, and logs from Google Stackdriver Logging. Data is divided into two paths: The hot path (red arrows) feeds into BigQuery using a streaming insert, to enable continuous data flow The cold path (blue arrows) feeds into Google Cloud Storage and from there, loaded in batches to BigQuery.

Answer 26

Here are a few best practices that will help you make more of key Google Cloud big data services, such as Cloud Pub/Sub and Google BigQuery. Data Ingestion and Collection Ingesting data is a commonly overlooked part of big data projects. There are several options for ingesting data on Google Cloud: -Using APIs on the data provider—pull data from APIs at scale using Compute Engine instances (virtual machines) or Kubernetes -Real time streaming—best with Cloud Pub/Sub -Large volume of data on-premises—most suitable for Google transfer appliance or GCP Online Transfer, depending on volume -Large volume of data on other cloud providers—use Cloud Storage Transfer Service Streaming Insert If you need to stream and process data in near-real time, you’ll need to use streaming inserts. A streaming insert writes data to BigQuery and queries it without requiring a load job, which can incur a delay. You can perform a streaming insert on a BigQuery table using the Cloud SDK or Google Dataflow. Note that it takes a few seconds for streaming data to become available for querying. After data is ingested using streaming insert, it takes up to 90 minutes for it to be available for operations like copy and export. Use Nested Tables You can nest records within tables to create efficiencies in Google BigQuery. For example, if you are processing invoices, the individual lines inside the invoice can be stored as an inner table. The outer table can contain data about the invoice as a whole (for example, the total invoice amount).This way, if you only need to process data about invoices, and not individual invoice lines, you can run a query only on the outer table to save costs and improve performance. Google only accessed items in the inner table when the query explicitly refers to them. Big Data Resource Management In many big data projects, you will need to grant access to certain resources to members of your team, other teams, partners or customers. Google Cloud Platform uses the concept of “resource containers”. A container is a grouping of GCP resources that can be dedicated to a specific organization or project.It is best to define a project for each big data model or dataset. Bring all the relevant resources, including storage, compute, and analytics or machine learning components, into the project container. This will allow you to more easily manage permissions, billing and security.

Answer 27

BigQuery runs on a serverless architecture that separates storage and compute and lets you scale each resource independently, on demand. The service lets you easily analyze data using Standard SQL. When using BigQuery, you can run compute resources as needed and significantly cut down on overall costs. You also do not need to perform database operations or system engineering tasks, because BigQuery manages this layer.

Answer 28

BI Engine performs fast in-memory analyses stored in BigQuery. BI Engine offers sub-second query response time and with high concurrency. You can integrate BI Engine with tools like Google Data Studio and accelerate your data exploration and analysis jobs. Once integrated, you can use Data Studio to create interactive dashboards and reports without compromising scale, performance, or security.

Answer 29

Data QnA is a natural language interface designed for running analytics jobs on BigQuery data. This service lets you get answers by running natural language queries. This means any stakeholder can get answers without having to first go through a skilled BI professional. Data QnA is currently running in private alpha.

Answer 30

Google’s cloud platform (GCP) offers a wide variety of database services. Of these, its NoSQL database services are unique in their ability to rapidly process very large, dynamic datasets with no fixed schema.

Answer 31

Google Cloud provides the following NoSQL database services: Cloud Firestore—a document-oriented database storing key-value pairs. Optimized for small documents and easy to use with mobile applications. Cloud Datastore—a document database built for automatic scaling, high performance, and ease of use. Cloud Bigtable—an alternative to HBase, a columnar database system running on HDFS. Suitable for high throughput applications. MongoDB Atlas—a managed MongoDB service, hosted by Google Cloud and built by the original makers of MongoDB.

Answer 32

Cloud Firestore is a NoSQL database that stores data in documents, arranged into collections. Firestore is optimized for such collections of small documents. Each of these documents includes a set of key-value pairs. Documents may contain subcollections and nested objects, including strings, complex objects, lists, or other primitive fields. Firestone creates these documents and collections implicitly. That is, when you assign data to a document or collection, Firestone creates the document or collection if it does not exist. Features Key features of Google Cloud Firestore include: -Automatic scaling—Firestore scales data storage automatically, retaining the same query performance regardless of database size. -Serverless development—networking and authentication are handled using client side SDKs, with less need to coding. -Backend security rules—enabling complex validation rules on data. -Offline support—databases can be accessed from user devices while offline on web browsers, iOS and Android. -Datastore mode—support for the Cloud Datastore API, enabling applications that currently work with Google Cloud Datastore to switch to Firestore without code changes.

Answer 33

Best Practices Here are a few best practices that will help you make the most of Cloud Firestore: -Database Location Select a database location closest to your users, to reduce latency. You can select two types of locations: Multi-regional location—for improved availability, deploys the database in at least two Google Cloud regions. Regional location—provides lower cost and better write latency (because there is no need to synchronize with another region) Indexes Minimize the number of indexes—too many indexes can increase write latency and storage costs. Do not index numeric values that increase monotonically, because this can impact latency in high throughput applications. Optimizing Write Performance In general, when using Firestore, write to a document no more than once per second. If possible, use asynchronous calls, because they have low latency impact. If there is no data dependency, there is no need to wait until a lookup completes before running a query.

Answer 34

Cloud Datastore offers high performance and automatic scaling, with a simplified user experience. It is perfect for applications that must process structured data at large scale. Datastore allows you to store and query ACID transactions, enabling rollback of complex multi-step operations. Behind the scenes, it stores data in Google Bigtable. Features Cloud Datastore’s key features include: -Atomic transactions—executing operation sets which must succeed in their entirety, or be rolled back. - High read and write availability—uses a highly redundant design to minimize the impact of component failure. - Automatic scalability—highly distributed with scaling transparently managed. - High performance—mixes index and query constraints to ensure that queries scale according to result-set size rather than the size of the data-set. - Flexible storage and querying of data—besides offering a SQL-like query language, maps naturally to object-oriented scripting languages.

Answer 35

Best Practices Here are a few best practices that can help you work with Cloud Datastore more effectively: API Calls -Use batch operations—these are more efficient because they use the same overhead as one operation. - Roll back failed transactions—if there is another request for the same resources, this will improve the latency of the retry operation. - Use asynchronous calls—like in Firestore, prefer to use asynchronous calls if there is no data dependency of the result of a query. Entities Do not write to an entity group more than once per second, to avoid timeouts for strongly consistent reads, which will negatively affect performance for your application. If you are using batch writes or transactions, these count as one write operation. Sharding and Replication For hot Datastore keys, you can use sharding or replication to read keys at a higher rate than allowed by Bigtable, the underlying storage. For example, you replicate keys three times to enable 3X faster read throughput. Or you can use sharding to break up the key range into several parts.

Answer 36

Cloud Bigtable is a managed NoSQL database, intended for analytics and operational workloads. It is an alternative to HBase, a columnar database system that runs on HDFS. Cloud Bigtable is suitable for applications that need high throughput and scalability, for values under 10 MB in size. It can also be used for batch MapReduce, stream processing, and machine learning. Cloud Bigtable has several advantages over a traditional HBase deployment. Scalability—scales linearly in proportion to the number of machines in the cluster. HBase has a limit in cluster size, beyond which read/write throughput does not improve. Automation—handles upgrades and restarts automatically, and ensures data durability via replication. While in HBase you would need to manage replicas and regions, in Cloud Bigtable you only need to design table schemas and add a second cluster to instances, and replication is configured automatically. Dynamic cluster resizing— can grow and shrink cluster size on demand. It takes only a few minutes to rebalance performance across nodes in the cluster. In HBase, cluster resizing is a complex operation that requires downtime.

Answer 37

How it WorksCloud Bigtable can support low-latency access to terabytes or even petabytes of single-keyed data. Because it is built as a sparsely populated table, it is able to scale to thousands of columns and billions of rows. A row key indexes each row as a single value, enabling very high read and write throughput, making it ideal as a MapReduce operations data source. Cloud Bigtable has several client libraries including an extension of Apache HBase for Java. This allows it to integrate with multiple open source big data solutions.

Answer 38

Best Practices Here are some best practices to make better use of Cloud Bigtable as an HBase replacement: - Trade-off Between High Throughput and Low LatencyWhen planning Cloud Bigtable capacity, consider your goals—you can optimize for throughput and reduce latency, or vice versa. Cloud Bigtable offers optimal latency when CPU load is under 70%, or preferably exactly 50%. If latency is less important, you can load CPUs to higher than 70%, to get higher throughput for the same number of cluster nodes. - Tables and SchemasIf you have several datasets with a similar schema, store them in one big table for better performance. You use a unique row key prefix to ensure datasets are stacked one after the other in the table. - Column FamiliesIf you have rows with multiple related values, it is best to group those columns into a column family. Grouping data as closely as possible avoids the need for complex filters—you can get exactly the data you need in a single read request.

Data Architect 37 questions Flashcards

(63 cards)