Data Architect 37 questions Flashcards

1
Q

What is a Data Architect?

A

Data architects are IT professionals that make up an integral part of a company’s technology team. These professionals work within database and network systems to streamline processes and ensure the safety and security of a company’s important business information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How long have you worked as a data architect?

A

…2 years aprox

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where did you complete your education?

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Do you hold any IT certifications?

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is your experience leading a team?

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What kinds of operational improvements did you make to the processes of your last company?

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you approach new development projects?

A

…Step 1: Identify the project. …
Step 2: Determine the desired outcome(s). …
Step 3: Delineate each of the project’s component tasks. …
Step 4: Identify the players. …
Identify who the players are within the practice. …
Identify any “project killers.” …
Identify the external players.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you choose the right project management methodology?

A

There are lots of factors that will impact which project management methodology is right for your project, team, and organization. Here’s a quick breakdown of some of the key considerations that can help you decide:

Cost and budget: On a scale of $ to $$$, what sort of budget are you working with? Is there room for that to change if necessary, or is it essential that it stays within these predetermined limits?

Team size: How many people are involved? How many stakeholders? Is your team relatively compact and self-organizing, or more sprawling, with a need for more rigorous delegation?

Ability to take risks: Is this a huge project with a big impact that needs to be carefully managed in order to deliver Very Serious Results? Or is it a smaller-scale project with a bit more room to play around?

Flexibility: Is there room for the scope of the project to change during the process? What about the finished product?

Timeline: How much time is allotted to deliver on the brief? Do you need a quick turnaround, or is it more important that you have a beautifully finished result, no matter how long it takes?

Client/stakeholder collaboration: How involved does the client/stakeholder need — or want — to be in the process? How involved do you need — or want — them to be?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you solve problems that arise when working on projects?

A

Problem Solving Techniques for Project Managers
Problem Solving Techniques: A 5-Step Approach. Some problems are small and can be resolved quickly. …
Define the Problem. …
Determine the Causes. …
Generate Ideas. …
Select the Best Solution. …
Take Action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Have you ever had a disagreement with a manager? How did you handle it?

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What kind of statistical and data analysis tools do you prefer to work with?

A

Tableau public, PowerBI, Python/Anaconda/Google Collab, Excel, Knime, Matlab, Databricks, SAC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What recent challenges did you face when completing database assignments? How did you resolve them?

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do you feel are the most important aspects of a data architect’s role?

A

The interviewer will want to know that you understand the responsibilities of the job and your role in the company. Use examples of professional role models and the traits you’ve developed throughout your experience in your answer to demonstrate your thoughtfulness and awareness of the position’s demands.

Example: “From my previous internship experience, the most important aspects I’ve noticed about the role include hands-on experience with data warehousing tools, automating processes and ensuring the security of company databases. I assisted my team leader in many of these processes, where I developed my ability to work with SAS frameworks to compile and sort sales data and integrate cybersecurity measures to mitigate the risk of comprising private and confidential information.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What key skills do you feel will help you succeed on the job here?

A

This question allows the interviewer to gauge your strongest skills and see how those skills will be an asset to the company. In your answer, describe how your strengths and abilities helped you complete a project or complete an objective in your experience.

Example: “I have keen attention to detail when working with data warehousing projects and automating data sorting functions within both a SAS and SQL environment. I feel that my extensive knowledge of these processes will help me achieve objectives on the job that lead to the company’s overall goals. Additionally, my ability to communicate effectively with an IT team when breaking down complex tasks will be an advantage that contributes to Highlands Data Solutions, Inc.’s growth.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can you describe the main elements of data warehouse architecture and how you apply them?

A

Technical questions allow the interviewer to assess your knowledge and expertise of working in database systems and structural frameworks. In your answer, highlight these elements and describe your experience with them.

Example: “Essentially, data warehouse architecture consists of three main tiers: a bottom tier, a middle tier and a top tier. Each tier houses an assortment of data, depending on its source. In my last position, I restructured the company’s database so that the bottom tier contained data I compiled from recurring customer subscribers, the middle tier contained sales data for each reporting period and the top tier contained all user information for executing operations within the database. This organizational method of data warehousing is especially useful for breaking down large amounts of data into easily accessible repositories.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the threee major areas in data warehouse?

A

The three main types of data warehouses are enterprise data warehouse (EDW), operational data store (ODS), and data mart. Differences between them:

Enterprise Data Warehouse (EDW)
An enterprise data warehouse (EDW) is a centralized warehouse that provides decision support services across the enterprise. EDWs are usually a collection of databases that offer a unified approach for organizing data and classifying data according to subject.

Operational Data Store (ODS)
An operational data store (ODS) is a central database used for operational reporting as a data source for the enterprise data warehouse described above. An ODS is a complementary element to an EDW and is used for operational reporting, controls, and decision making.

An ODS is refreshed in real-time, making it preferable for routine activities such as storing employee records. An EDW, on the other hand, is used for tactical and strategic decision support.

Data Mart
A data mart is considered a subset of a data warehouse and is usually oriented to a specific team or business line, such as finance or sales. It is subject-oriented, making specific data available to a defined group of users more quickly, providing them with critical insights. The availability of specific data ensures that they do not need to waste time searching through an entire data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a Data Warehouse?

A

A data warehouse (often abbreviated as DW or DWH) is a central data repository used for reporting and data analysis. It can connect to and integrate multiple data sources to provide a common area to generate business insights.

Summary
A data warehouse (often abbreviated as DW or DWH) is a system used for reporting and data analysis from various sources to provide business insights. It operates as a central repository where information arrives from various sources.
Once in the data warehouse, the data is ingested, transformed, processed, and made accessible for use in decision-making.
The three main types of data warehouses are enterprise data warehouse (EDW), operational data store (ODS), and data mart.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Can you describe a project or task you completed in SQL?

A

This software-specific question gives the interviewer insight into your experience level working within SQL to perform your job. If you have experience working in SQL, give some brief examples of the projects you’ve worked on.

Example: “I am currently completing a project for the company I am interning with to integrate more usability functions within its database. Right now, I’m normalizing the existing data to eliminate redundant information. This will help me organize the data into tables unambiguously so that users who want to access it can do so with as few functional commands as possible.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you compile user requirements when initiating a new project?

A

This question helps the interviewer evaluate your ability to break down project tasks, prioritize your work and manage your time efficiently to complete your assignments. Use examples of how you apply your organizational and time management skills to initiate your projects.

Example: “The very first thing I do when initiating new client projects is finding out exactly what they need. While these projects typically relate to company customer markets, I have also worked with companies whose stakeholders are highly involved with outlining some project requirements. Next, I determine the overall results I want to achieve on the end-user side of data processes. I also collaborate with other database architects to design the physical appearance of the system per business needs and technical capacity.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Which forecast models would you use for creating a physical model for our quarterly and yearly revenue?

A

These two elements of gathering and organizing data can show the interviewer that you understand when and how to use specific functions within a database to display information. Demonstrate your analytical skills and attention to detail by describing the functionality of each forecast model within a company’s database and how you create it.

Example: “Actually, a time-series model would be appropriate in both instances. But I would create a model to forecast a quarterly and annual report separately. I did this for my last organization, where I first generated a quarterly forecast that modeled projected sales revenue over the entire fiscal year. I integrated the quarterly data into an annual revenue forecast. Time-series modeling is extremely advantageous because it removes language ambiguity between the physical models and documentation reports. Additionally, it’s a time-sensitive way to measure performance and productivity KPIs and ensure the activity is on track with revenue goals.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Can you describe your process for identifying improvements that need to be made in an existing database?

A

Similar to initiating new projects, your processes for breaking down workloads and prioritizing tasks can show the interviewer your ability to take on challenges and find creative solutions. Describe your approaches and specific data analysis strategies that help you move through processes efficiently.

Example: “I first assess the performance of the database, including its infrastructure, processes, operational speed and execution time. Depending on the size of the database, I’ll automate some of these check-ups to help me move through the initial process of checking for any issues. In my last role, this method helped me identify areas where improvement was necessary. I implemented structural improvements to improve information sharing between company networks, which resulted in a 10% reduction in operating costs..”

Explore your next job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to Increase Database Performance

A

Tip 1: Optimize Queries
In many cases database performance issues are caused by inefficient SQL queries. Optimizing your SQL queries is one of the best ways to increase database performance. When you try to do that manually, you’ll encounter several dilemmas around choosing how best to improve query efficiency. These include understanding whether to write a join or a subquery, whether to use EXISTS or IN, and more. When you know the best path forward, you can write queries that improve efficiency and thus database performance as a whole. That means fewer bottlenecks and fewer unhappy end users.

The best way to optimize queries is to use a database performance analysis solution that can guide your optimization efforts by directing you to the most inefficient queries and offering expert advice on how best to improve them.

Tip 2: Improve Indexes
In addition to queries, the other essential element of the database is the index. When done right, indexing can increase your database performance and help optimize the duration of your query execution. Indexing creates a data structure that helps keep all your data organized and makes it easier to locate information. Because it’s easier to find data, indexing increases the efficiency of data retrieval and speeds up the entire process, saving both you and the system time and effort.

Tip 3: Defragment Data
Data defragmentation is one of the best approaches to increasing database performance. Over time, with so much data constantly being written to and deleted from your database, your data can become fragmented. That fragmentation can slow down the data retrieval process as it interferes with a query’s ability to quickly locate the information it’s looking for. When you defragment data, you allow for relevant data to be grouped together and you erase index page issues. That means your I/O related operations will run faster.

Tip 4: Increase Memory
The efficiency of your database can suffer significantly when you don’t have enough memory available for the database to work correctly. Even if it seems like you have a lot of memory in total, you might not be meeting the demands of your database. A good way to figure out if you need more memory is to check how many page faults your system has. When the number of faults is high, it means your hosts are either running low on or completely out of available memory. Increasing your memory allocation will help boost efficiency and overall performance.

Tip 5: Strengthen CPU
A better CPU translates directly into a more efficient database. That’s why you should consider upgrading to a higher-class CPU unit if you’re experiencing issues with your database performance. The more powerful your CPU is, the less strain it’ll have when dealing with multiple requests and applications. When assessing your CPU, you should keep track of all the elements of CPU performance, including CPU ready times, which tell you about the times your system tried to use the CPU, but couldn’t because the resources were otherwise occupied.

Tip 6: Review Access
Once you know your database hardware is working well, you need to review your database access, including which applications are actually accessing your database. If one of your services or applications is suffering from poor database performance, it’s important not to jump to conclusions about which service or application is responsible for the issue. It’s possible a single client is experiencing the bad performance, but it’s also possible the database as a whole is having issues. Dig into who and what is accessing the database and if it’s only one service that’s having an issue, drill down into its metrics to try and find the root cause.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

WHAT IS DATA MODELLING?

A

Data modelling is a scientific way of documenting complex data systems by way of a diagram to give a pictorial and conceptual representation of the system. You could also expand on any experience that you have had with data modelling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

CAN YOU SPEAK ABOUT TYPES OF DESIGN SCHEMAS IN DATA MODELLING

A

There are mainly two types of schemas in data modelling: 1) Star schema and 2) Snowflake schema. Expand on each or any one of them that you are asked to explain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

WHAT ARE THE DIFFERENCES BETWEEN STRUCTURED AND UNSTRUCTURED DATA

A

Data Engineers constantly work with data that is coming into the systems in all sorts of formats. Broadly categorizing them as structured and unstructured. They differ in the way these are stored and accessed. For your convenience, some of the differences are listed.

Criteria Structured Data Unstructured Data
Storage: DBMS Unmanaged file structures

Standard: ADO.net, ODBC, and SQL STMP, XML, CSV, and SMS

Integration Tool: ELT (Extract, Transform, Load) Manual data entry or batch processing that includes codes

Scaling Schema scaling is difficult Scaling is very easy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

CAN YOU ELABORATE ON THE DAILY RESPONSIBILITIES OF A DATA ENGINEER?

A

This is an important question and you should be thorough with your answer as it assesses your understanding of the role and how much you have invested in learning them. You should include the below points in your response.

A data engineer might be involved in any one or more areas of architecting, building, maintaining the data infrastructure especially the ones which are massive in size like Big Data.
Be responsible for data acquisition and data ingestion processes.
Responsible for pipeline development of various ETL operations.
Identifying ways to improve data reliability and availability.

27
Q

HOW WOULD YOU GO ABOUT DEVELOPING AN ANALYTICAL PRODUCT FROM SCRATCH?

A

This question is asked to assess your knowledge of the systems from the ground up. There is no perfect answer to this question and no answer is bad. Responses to the below questions might give you a good answer.

What is the goal of the product?
What are the data sources important to the customer and to the success of the product?
What formats are these available and where are they located?
What is the volume of data being acquired?
What is the requirement on availability of the data or in other words, how available do you want your data to be?
Will there be a need to transform the acquired data?
Will you need to respond to data being ingested in real-time?
Are there data streams involved or will there be a possibility of any in the future?
Once these questions are answered, you try to map to the available technologies to address the challenges and characteristics of each. This is not an exhaustive list of questions but is an approach that you can take to respond to the original question from the interviewer.

28
Q

TAKE US THROUGH ANY ALGORITHM YOU USED ON A RECENT PROJECT?

A

The algorithm that you select to discuss must be one that you are good at and preferably used by the company. There will be follow up questions to understand the depth of your answer, like,

What made you choose this algorithm?
How scalable is this algorithm?
What were the challenges you faced in using this algorithm? How did you tackle them?

29
Q

HAVE YOU EVER TRANSFORMED UNSTRUCTURED DATA INTO STRUCTURED DATA?

A

Be sure to include the challenges that prop up when transforming from unstructured to structured in your response.

30
Q

WHAT IS YOUR EXPERIENCE WITH DATA MODELLING?

A

There is a good likely hood of this question being asked if you are an experienced candidate. Do mention the tools used for building the model and small brief about how you’ve done it.

31
Q

WHAT IS YOUR EXPERIENCE WITH ETL AND WHICH ETL TOOLS HAVE YOU USED?

A

Talk about the tool you have used and highlight some of its features that helped you pick it for ETL.

32
Q

WHAT IS BIG DATA AND HOW IS HADOOP RELATED TO BIG DATA?

A

Big Data is a phenomenon, a result of exponential growth in data availability, storage technology and processing power, while Hadoop is a framework that helps to handle huge volumes of data that reside in the Big Data ecosystem. Describe the components of Hadoop as below.

MapReduce
Hadoop Common
YARN (Yet Another Resource Negotiator)

33
Q

WHAT IS A NAMENODE AND WHAT ARE THE IMPLICATIONS OF A NAMENODE CRASH?

A

NameNodes store metadata of all the files stored on the cluster. Basically, metadata about data nodes, bits of information like the location of blocks, size of files, hierarchy. It is similar to a File Allocation Table (FAT), which stores information about blocks of data that make up files and where they are stored on a single computer. NameNodes keep the same kind of information for a distributed file system. Under normal circumstances, NameNode crash will result in non-availability of data, although all blocks of data are intact. A high availability setup will ensure there is a passive NameNode that backs up the primary one and takes over in case the NameNode fails.

34
Q

WHAT IS A BLOCK AND WHAT ROLES DOES BLOCK SCANNER PLAY?

A

Blocks are the smallest unit of data allocated to a file, which the Hadoop system automatically creates for storage in different nodes in a distributed file system. Block Scanner verifies the integrity of a DataNode by checking the data blocks stored on it.

35
Q

What is MapReduce in Hadoop, what role does Reducer play?

A

MapReduce is a processing module in the Apache Hadoop project. Hadoop is a platform built to tackle big data using a network of computers to store and process data.

What is so attractive about Hadoop is that affordable dedicated servers are enough to run a cluster. You can use low-cost consumer hardware to handle your data.

Hadoop is highly scalable. You can start with as low as one machine, and then expand your cluster to an infinite number of servers. The two major default components of this software library are:

MapReduce
HDFS – Hadoop distributed file system

Reducers process the intermediate data from the maps into smaller tuples, that reduces the tasks, leading to the final output of the framework. The MapReduce framework enhances the scheduling and monitoring of tasks.

By using the resources of multiple interconnected machines, MapReduce effectively handles a large amount of structured and unstructured data.

MapReduce HDFS diagram with input and output.
Before Spark and other modern frameworks, this platform was the only player in the field of distributed big data processing.

MapReduce assigns fragments of data across the nodes in a Hadoop cluster. The goal is to split a dataset into chunks and use an algorithm to process those chunks at the same time. The parallel processing on multiple machines greatly increases the speed of handling even petabytes of data.

Distributed Data Processing Apps

This framework allows for the writing of applications for distributed data processing. Usually, Java is what most programmers use since Hadoop is based on Java.

However, you can write MapReduce apps in other languages, such as Ruby or Python. No matter what language a developer may use, there is no need to worry about the hardware that the Hadoop cluster runs on.

Scalability

Hadoop infrastructure can employ enterprise-grade servers, as well as commodity hardware. MapReduce creators had scalability in mind. There is no need to rewrite an application if you add more machines. Simply change the cluster setup, and MapReduce continues working with no disruptions.

What makes MapReduce so efficient is that it runs on the same nodes as HDFS. The scheduler assigns tasks to nodes where the data already resides. Operating in this manner increases available throughput in a cluster.

36
Q

What is the Basic terminology of Hadoop MapReduce?

A

Basic Terminology of Hadoop MapReduce
As we mentioned above, MapReduce is a processing layer in a Hadoop environment. MapReduce works on tasks related to a job. The idea is to tackle one large request by slicing it into smaller units.

JobTracker and TaskTracker
In the early days of Hadoop (version 1), JobTracker and TaskTracker daemons ran operations in MapReduce. At the time, a Hadoop cluster could only support MapReduce applications.

A JobTracker controlled the distribution of application requests to the compute resources in a cluster. Since it monitored the execution and the status of MapReduce, it resided on a master node.

A TaskTracker processed the requests that came from the JobTracker. All task trackers were distributed across the slave nodes in a Hadoop cluster.

JobTracker and TaskTracker diagram in Hadoop 1 MapReduce
YARN
Later in Hadoop version 2 and above, YARN became the main resource and scheduling manager. Hence the name Yet Another Resource Manager. Yarn also worked with other frameworks for the distributed processing in a Hadoop cluster.

MapReduce Job
A MapReduce job is the top unit of work in the MapReduce process. It is an assignment that Map and Reduce processes need to complete. A job is divided into smaller tasks over a cluster of machines for faster execution.

The tasks should be big enough to justify the task handling time. If you divide a job into unusually small segments, the total time to prepare the splits and create tasks may outweigh the time needed to produce the actual job output.

MapReduce Task
MapReduce jobs have two types of tasks.

A Map Task is a single instance of a MapReduce app. These tasks determine which records to process from a data block. The input data is split and analyzed, in parallel, on the assigned compute resources in a Hadoop cluster. This step of a MapReduce job prepares the pair output for the reduce step.

A Reduce Task processes an output of a map task. Similar to the map stage, all reduce tasks occur at the same time, and they work independently. The data is aggregated and combined to deliver the desired output. The final result is a reduced set of pairs which MapReduce, by default, stores in HDFS.

37
Q

How Hadoop Partitions Map Input Data

A

The partitioner is responsible for processing the map output. Once MapReduce splits the data into chunks and assigns them to map tasks, the framework partitions the key-value data. This process takes place before the final mapper task output is produced.

MapReduce partitions and sorts the output based on the key. Here, all values for individual keys are grouped, and the partitioner creates a list containing the values associated with each key. By sending all values of a single key to the same reducer, the partitioner ensures equal distribution of map output to the reducer.

Note: The number of map output files depends on the number of different partitioning keys and the configured number of reducers. That amount of reducers is defined in the reducer configuration file.

The default partitioner well-configured for many use cases, but you can reconfigure how MapReduce partitions data.

If you happen to use a custom partitioner, make sure that the size of the data prepared for every reducer is roughly the same. When you partition data unevenly, one reduce task can take much longer to complete. This would slow down the whole MapReduce job.

38
Q

How Hadoop Map and Reduce Work Together

A

As the name suggests, MapReduce works by processing input data in two stages – Map and Reduce. To demonstrate this, we will use a simple example with counting the number of occurrences of words in each document.

The final output we are looking for is: How many times the words Apache, Hadoop, Class, and Track appear in total in all documents.

For illustration purposes, the example environment consists of three nodes. The input contains six documents distributed across the cluster. We will keep it simple here, but in real circumstances, there is no limit. You can have thousands of servers and billions of documents.

MapReduce example diagram when processing data.
1. First, in the map stage, the input data (the six documents) is split and distributed across the cluster (the three servers). In this case, each map task works on a split containing two documents. During mapping, there is no communication between the nodes. They perform independently.

  1. Then, map tasks create a pair for every word. These pairs show how many times a word occurs. A word is a key, and a value is its count. For example, one document contains three of four words we are looking for: Apache 7 times, Class 8 times, and Track 6 times. The key-value pairs in one map task output look like this:

This process is done in parallel tasks on all nodes for all documents and gives a unique output.

  1. After input splitting and mapping completes, the outputs of every map task are shuffled. This is the first step of the Reduce stage. Since we are looking for the frequency of occurrence for four words, there are four parallel Reduce tasks. The reduce tasks can run on the same nodes as the map tasks, or they can run on any other node.

The shuffle step ensures the keys Apache, Hadoop, Class, and Track are sorted for the reduce step. This process groups the values by keys in the form of pairs.

  1. In the reduce step of the Reduce stage, each of the four tasks process a to provide a final key-value pair. The reduce tasks also happen at the same time and work independently.

In our example from the diagram, the reduce tasks get the following individual results:

Note: The MapReduce process is not necessarily successive. The Reduce stage does not have to wait for all map tasks to complete. Once a map output is available, a reduce task can begin.

  1. Finally, the data in the Reduce stage is grouped into one output. MapReduce now shows us how many times the words Apache, Hadoop, Class, and track appeared in all documents. The aggregate data is, by default, stored in the HDFS.

The example we used here is a basic one. MapReduce performs much more complicated tasks.

Some of the use cases include:

Turning Apache logs into tab-separated values (TSV).
Determining the number of unique IP addresses in weblog data.
Performing complex statistical modeling and analysis.
Running machine-learning algorithms using different frameworks, such as Mahout.

39
Q

What are the steps to deploy a big data solution ?

A

Answer : There are three steps to deploy a Big Data …
Data Ingestion, Data Storage and Data Processing

What are the steps to deploy a big data solution ?
There are three steps to deploy a Big Data Solution
Deploying Big Data solution

Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources.
The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc.
The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.

Data Storage
After the data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase).
The HDFS storage works well for sequential access whereas HBase for random read/write access.

Data Processing
The final step in deploying a big data solution is the data processing.
The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.

40
Q

How To Deal with Duplicate Entries Using SQL

A

SQL delete duplicate Rows using Group By and having clause

In this method, we use the SQL GROUP BY clause to identify the duplicate rows. The Group By clause groups data as per the defined columns and we can use the COUNT function to check the occurrence of a row.

41
Q

What is your experience of Big Data in a cloud environment?

A

42
Q

How can Data Analytics and Big Data help to positively impact the bottom line of the company?

A

With the help of big data, companies aim at offering improved customer services, which can help increase profit. Enhanced customer experience is the primary goal of most companies. Other goals include better target marketing, cost reduction, and improved efficiency of existing processes.

  1. Customer Acquisition And Retention

To stand out, organizations must have a unique approach to market their products. By using big data, companies can pinpoint exactly what customers are looking for. They establish a solid customer base right out of the gate.

New big data processes observe the patterns of consumers. They then use those patterns to trigger brand loyalty by collecting more data to identify more trends and ways to make customers happy. Amazon has mastered this technique by providing one of the most personalized shopping experiences on the internet today. Suggestions are based not only on past purchases, but also on items that other customers have bought, browsing behavior and many other factors.

  1. Focused And Targeted Campaigns

Businesses can use big data to deliver tailored products to their targeted market. Forget spending money on advertising campaigns that don’t work. Big data helps companies make a sophisticated analysis of customer trends. This analysis usually includes monitoring online purchases and observing point-of-sale transactions.

These insights then allow companies to create successful, focused and targeted campaigns, thus allowing companies to match and exceed customer expectations and build greater brand loyalty.

  1. Identification Of Potential Risks

These days businesses are thriving in high-risk environments, but these environments require risk management processes — and big data has been instrumental in developing new risk management solutions. Big data can improve the effectiveness of risk management models and create smarter strategies.

  1. Innovative Products

Big data continues to help companies update existing products while innovating new ones. By collecting large amounts of data, companies are able to distinguish what fits their customer base.

If a company wants to remain competitive in today’s market, it can no longer rely on instinct. With so much data to work off of, organizations can now implement processes to track their customer feedback, product success and what their competitors are doing.

  1. Complex Supplier Networks

By using big data, companies offer supplier networks, otherwise known as B2B communities, with greater precision and insights. Suppliers are able to escape constraints they typically face by applying big data analytics. Through the application of big data, suppliers use higher levels of contextual intelligence, which is necessary for their success.

Supply chain executives are now looking at data analytics as a disruptive technology by changing the foundation of supplier networks to include high-level collaboration. This collaboration lets networks apply new knowledge to existing problems or other scenarios.

How To Begin Putting Big Data To Work

If you are a business that has data, but you do not know where to begin or how to use it, don’t worry. You are not alone.

First, you must determine what business problem you will be trying to solve with the data that you have. For instance, are you trying to determine the level of shopping cart abandonment and why?

Second, just because you have the data doesn’t automatically mean that you can put it to use to solve your problem. Most organizations have been collecting data for a decade or more. Yet, it is unstructured and messy — what is known as “dirty data.” You will need to clean it up by putting it into a structured format before you can put it to use.

Third, if you decide to work with a firm, you will need one that can do more than just visualize the data. It will need to be a firm that can model the data to drive insights that will help you solve your business problem. Modeling data is not easy or inexpensive, so it’s important to have a budget and plan in place before taking this step.

An Important Investment

The biggest businesses are continuing to grow, thanks to big data analytics. Developing technology is becoming available to more organizations than ever before. Once brands have data at their disposal, they can implement the appropriate analysis systems to solve many of their problems.

43
Q

What is the replication factor in HDFS?

A

Replication Factor: It is basically the number of times Hadoop framework replicate each and every Data Block. Block is replicated to provide Fault Tolerance. The default replication factor is 3 which can be configured as per the requirement; it can be changed to 2 (less than 3) or can be increased (more than 3.)

44
Q

What is a Block Scanner in HDFS?

A

Block Scanner is basically used to identify corrupt datanode Block. During a write operation, when a datanode writes in to the HDFS, it verifies a checksum for that data. This checksum helps in verifying the data corruptions during the data transmission

45
Q

What sequence of events takes place when Block Scanner detects a problem with a data block?

A

Block Scanner is basically used to identify corrupt datanode Block.

During a write operation the datanode writing the data to HDFS verifies the checksum for the data that is being written to detect data corruption during transmission. During a read operation the client verifies the checksum that is returned by the datanode against the checksum that it calculates against the data to detect data corruption caused by disk during storage on the datanodes.

These checksum verification are very helpful but they are only done when a client attempts a read (or write) to HDFS. They don’t find corruptions prematurely before a client request a read on a corrupted data.

Every datanode periodically runs a block scanner, which periodically verifies all the blocks that is stored on the datanode. This helps to catch the corrupted block to be identified and fixed before a client request a read operation. With the block scanner service HDFS can prematurely identify and fix corruptions.

During a write operation, when a datanode writes in to the HDFS, it verifies a checksum for that data. This checksum helps in verifying the data corruptions during the data transmission.
When the same data is read from the HDFS, the client verifies the checksum returned by the datanode against the checksum it calculates against the data to check the data corruption that might have caused by the data node that might have occurred during the storage of data in the data node.
Therefore every datanode periodically runs a block scanner, to verify all the blocks that are stored in the data node. So this helps to identify and fix the corrupt data before a read operation. With the block scanner service, HDFS can prematurely identify and fix corruptions.

Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.

Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

46
Q

What messages are transacted between NameNode and DataNode?

A

HeartBeat and Block Report are the two messages that NameNode receives from DataNode.

Heartbeat:
datanode send heartbeat signals to namenode every 3 seconds (default configured in hdfs-site.xml file) about the node status live or not. This helps the namenode to decide whether to use that datanode block or not.

Block Report:
NameNode gets information about what data is stored on a specific DataNode.
A block report contains the block ID, the generation timestamp and the length for each block replica the server hosts.

47
Q

What are the main security features of Hadoop?

A

Security features of Hadoop consist of Authentication, Service Level Authorization, Authentication for Web Consoles and Data Confidentiality.

48
Q

What is the difference between NAS and DAS?

A

What Are DAS Servers?
Direct-attached storage (DAS) is a digital storage system directly attached to a host computer accessing it. Examples of DAS include:

Internal hard drives and SSDs
External hard drives and SSDs
CDs
USB flash drives
Pros of DAS Servers
DAS servers bring a lot of advantages to the table. Here are some of the most important ones:
High performance
Low maintenance
Easy to set up and configure
Relatively inexpensive
These advantages of portable DAS storage make it suitable for small to medium businesses (SMBs) that require a lot of storage but don’t have large budgets for advanced storage solutions.

Cons of DAS Servers
While DAS servers have many advantages, they aren’t without their drawbacks. One of the biggest is that a DAS system can’t be managed over a network. Data can only be accessed and manipulated by the DAS’s host computer.

Another drawback is DAS systems don’t typically have the same data redundancy level options as their NAS counterparts.

What Are NAS Servers?
Network accessed storage (NAS) refers to a self-contained storage system that’s attached to a local area network (LAN) or a wide area network (WAN). All devices connected to the network can be granted access by the network administrator to the data stored on the NAS system. It comprises portable NAS servers and network management software that permits different users to log in to the storage system.

Pros of NAS Servers
12 Bay SecureNAS Door Open
Are there any advantages of going with a storage system that relies on a network?

There sure are. Here are some of the advantages you get with NAS servers:

Centralized file storage and access
More efficient use of data storage resources
Easy data recovery
Easy scalability
Better data redundancy
Save space due to their compact nature
Affordable high-capacity storage solution
Cons of NAS Servers
Require more administrator management than DAS.
Latency issues may arise due to network issues.
NAS servers are more advanced than their DAS counterparts and offer a more reliable data storage solution.
DAS Servers vs. NAS Servers
The main difference between DAS and NAS servers is that DAS servers are a more localized storage solution. As such, they don’t offer you the flexibility of file sharing and remote updating that NAS servers offer you. When you need more storage, NAS beats DAS because you can easily add another portable NAS storage device to your network as your data storage needs grow.

49
Q

Explain GCP Data Services

A

Google Big Data ServicesGCP offers a wide variety of big data services you can use to manage and analyze your data, including:Google Cloud BigQueryBigQuery lets you store and query datasets holding massive amounts of data. The service uses a table structure, supports SQL, and integrates seamlessly with all GCP services. You can use BigQuery for both batch processing and streaming. This service is ideal for offline analytics and interactive querying.Google Cloud DataflowDataflow offers serverless batch and stream processing. You can create your own management and analysis pipelines, and Dataflow will automatically manage your resources. The service can integrate with GCP services like BigQuery and third-party solutions like Apache Spark.Google Cloud DataprocDataproc lets you integrate your open source stack and streamline your process with automation. This is a fully managed service that can help you query and stream your data, using resources like Apache Hadoop in the GCP cloud. You can integrate Dataproc with other GCP services like Bigtable.Google Cloud Pub/SubPub/Sub is an asynchronous messaging service that manages the communication between different applications. Pub/Sub is typically used for stream analytics pipelines. You can integrate Pub/Sub with systems on or off GCP, and perform general event data ingestion and actions related to distribution patterns.Google Cloud ComposerComposer is a fully-managed cloud-based workflow orchestration service based on Apache Airflow. You can use Composer to manage data processing across several platforms and create your own hybrid environment. Composer lets you define the process using Python. The service then automates processing jobs, like ETL.Google Cloud Data FusionData Fusion is a fully-managed data integration service that enables stakeholders of various skill levels to prepare, transfer, and transform data. Data Fusion lets you create code-free ETL/ELT data pipelines using a point-and-click visual interface. Data Fusion is an open source project that provides the portability needed to work with hybrid and multicloud integrations.Google Cloud BigtableBigtable is a fully-managed NoSQL database service built to provide high performance for big data workloads. Bigtable runs on a low-latency storage stack, supports the open-source HBase API, and is available globally. The service is ideal for time-series, financial, marketing, graph data, and IoT. It powers core Google services, including Analytics, Search, Gmail, and Maps.Google Cloud Data CatalogData Catalog offers data discovery capabilities you can use to capture business and technical metadata. To easily locate data assets, you can use schematized tags and build a customized catalog. To protect your data, the service uses access-level controls. To classify sensitive information, the service integrates with Google Cloud Data Loss Prevention.

50
Q

Detail an Architecture for Large Scale Big Data Processing on Google Cloud

A

Google provides a reference architecture for large-scale analytics on Google Cloud, with more than 100,000 events per second or over 100 MB streamed per second. The architecture is based on Google BigQuery.

Google recommends building a big data architecture with hot paths and cold paths. A hot path is a data stream that requires near-real-time processing, while a cold path is a data stream that can be processed after a short delay.

This has several advantages, including:
Ability to store logs for all events without exceeding quotas

Reduced costs, because only some events need to be handled as streaming inserts (which are more expensive)

Data originates from two possible sources—analytics events published to Cloud Pub/Sub, and logs from Google Stackdriver Logging. Data is divided into two paths:

The hot path (red arrows) feeds into BigQuery using a streaming insert, to enable continuous data flow

The cold path (blue arrows) feeds into Google Cloud Storage and from there, loaded in batches to BigQuery.

51
Q

Explain the GCP Big Data Best Practices

A

Here are a few best practices that will help you make more of key Google Cloud big data services, such as Cloud Pub/Sub and Google BigQuery.

Data Ingestion and Collection
Ingesting data is a commonly overlooked part of big data projects. There are several options for ingesting data on Google Cloud:
-Using APIs on the data provider—pull data from APIs at scale using Compute Engine instances (virtual machines) or Kubernetes
-Real time streaming—best with Cloud Pub/Sub
-Large volume of data on-premises—most suitable for Google transfer appliance or GCP Online Transfer, depending on volume
-Large volume of data on other cloud providers—use Cloud Storage Transfer Service

Streaming Insert
If you need to stream and process data in near-real time, you’ll need to use streaming inserts. A streaming insert writes data to BigQuery and queries it without requiring a load job, which can incur a delay. You can perform a streaming insert on a BigQuery table using the Cloud SDK or Google Dataflow.
Note that it takes a few seconds for streaming data to become available for querying. After data is ingested using streaming insert, it takes up to 90 minutes for it to be available for operations like copy and export.

Use Nested Tables
You can nest records within tables to create efficiencies in Google BigQuery. For example, if you are processing invoices, the individual lines inside the invoice can be stored as an inner table. The outer table can contain data about the invoice as a whole (for example, the total invoice amount).This way, if you only need to process data about invoices, and not individual invoice lines, you can run a query only on the outer table to save costs and improve performance. Google only accessed items in the inner table when the query explicitly refers to them.

Big Data Resource Management
In many big data projects, you will need to grant access to certain resources to members of your team, other teams, partners or customers. Google Cloud Platform uses the concept of “resource containers”. A container is a grouping of GCP resources that can be dedicated to a specific organization or project.It is best to define a project for each big data model or dataset. Bring all the relevant resources, including storage, compute, and analytics or machine learning components, into the project container. This will allow you to more easily manage permissions, billing and security.

52
Q

How Does Google BigQuery Work?

A

BigQuery runs on a serverless architecture that separates storage and compute and lets you scale each resource independently, on demand. The service lets you easily analyze data using Standard SQL.

When using BigQuery, you can run compute resources as needed and significantly cut down on overall costs. You also do not need to perform database operations or system engineering tasks, because BigQuery manages this layer.

53
Q

What is BigQuery BI Engine?

A

BI Engine performs fast in-memory analyses stored in BigQuery. BI Engine offers sub-second query response time and with high concurrency.

You can integrate BI Engine with tools like Google Data Studio and accelerate your data exploration and analysis jobs. Once integrated, you can use Data Studio to create interactive dashboards and reports without compromising scale, performance, or security.

54
Q

What is Google Cloud Data QnA?

A

Data QnA is a natural language interface designed for running analytics jobs on BigQuery data. This service lets you get answers by running natural language queries. This means any stakeholder can get answers without having to first go through a skilled BI professional. Data QnA is currently running in private alpha.

55
Q

What is Google Cloud NoSQL?

A

Google’s cloud platform (GCP) offers a wide variety of database services. Of these, its NoSQL database services are unique in their ability to rapidly process very large, dynamic datasets with no fixed schema.

56
Q

What are Google Cloud NoSQL Database Options?

A

Google Cloud provides the following NoSQL database services:

Cloud Firestore—a document-oriented database storing key-value pairs. Optimized for small documents and easy to use with mobile applications.

Cloud Datastore—a document database built for automatic scaling, high performance, and ease of use.

Cloud Bigtable—an alternative to HBase, a columnar database system running on HDFS. Suitable for high throughput applications.

MongoDB Atlas—a managed MongoDB service, hosted by Google Cloud and built by the original makers of MongoDB.

57
Q

What is Google Cloud Firestore?

A

Cloud Firestore is a NoSQL database that stores data in documents, arranged into collections.

Firestore is optimized for such collections of small documents. Each of these documents includes a set of key-value pairs. Documents may contain subcollections and nested objects, including strings, complex objects, lists, or other primitive fields.

Firestone creates these documents and collections implicitly. That is, when you assign data to a document or collection, Firestone creates the document or collection if it does not exist.

Features
Key features of Google Cloud Firestore include:
-Automatic scaling—Firestore scales data storage automatically, retaining the same query performance regardless of database size.
-Serverless development—networking and authentication are handled using client side SDKs, with less need to coding.
-Backend security rules—enabling complex validation rules on data.
-Offline support—databases can be accessed from user devices while offline on web browsers, iOS and Android.
-Datastore mode—support for the Cloud Datastore API, enabling applications that currently work with Google Cloud Datastore to switch to Firestore without code changes.

58
Q

What are the Best Practices for GCP Firestore?

A

Best Practices
Here are a few best practices that will help you make the most of Cloud Firestore:

-Database Location
Select a database location closest to your users, to reduce latency. You can select two types of locations:
Multi-regional location—for improved availability, deploys the database in at least two Google Cloud regions.
Regional location—provides lower cost and better write latency (because there is no need to synchronize with another region)

Indexes
Minimize the number of indexes—too many indexes can increase write latency and storage costs. Do not index numeric values that increase monotonically, because this can impact latency in high throughput applications.

Optimizing Write Performance
In general, when using Firestore, write to a document no more than once per second. If possible, use asynchronous calls, because they have low latency impact. If there is no data dependency, there is no need to wait until a lookup completes before running a query.

59
Q

What is Google Cloud Datastore?

A

Cloud Datastore offers high performance and automatic scaling, with a simplified user experience. It is perfect for applications that must process structured data at large scale. Datastore allows you to store and query ACID transactions, enabling rollback of complex multi-step operations. Behind the scenes, it stores data in Google Bigtable.

Features
Cloud Datastore’s key features include:
-Atomic transactions—executing operation sets which must succeed in their entirety, or be rolled back.

  • High read and write availability—uses a highly redundant design to minimize the impact of component failure.
  • Automatic scalability—highly distributed with scaling transparently managed.
  • High performance—mixes index and query constraints to ensure that queries scale according to result-set size rather than the size of the data-set.
  • Flexible storage and querying of data—besides offering a SQL-like query language, maps naturally to object-oriented scripting languages.
60
Q

Name some Best Practices of GCP Datastore

A

Best Practices
Here are a few best practices that can help you work with Cloud Datastore more effectively:
API Calls
-Use batch operations—these are more efficient because they use the same overhead as one operation.

  • Roll back failed transactions—if there is another request for the same resources, this will improve the latency of the retry operation.
  • Use asynchronous calls—like in Firestore, prefer to use asynchronous calls if there is no data dependency of the result of a query.

Entities
Do not write to an entity group more than once per second, to avoid timeouts for strongly consistent reads, which will negatively affect performance for your application. If you are using batch writes or transactions, these count as one write operation.

Sharding and Replication
For hot Datastore keys, you can use sharding or replication to read keys at a higher rate than allowed by Bigtable, the underlying storage. For example, you replicate keys three times to enable 3X faster read throughput. Or you can use sharding to break up the key range into several parts.

61
Q

What is Google Cloud Bigtable?

A

Cloud Bigtable is a managed NoSQL database, intended for analytics and operational workloads. It is an alternative to HBase, a columnar database system that runs on HDFS.

Cloud Bigtable is suitable for applications that need high throughput and scalability, for values under 10 MB in size. It can also be used for batch MapReduce, stream processing, and machine learning.

Cloud Bigtable has several advantages over a traditional HBase deployment.

Scalability—scales linearly in proportion to the number of machines in the cluster. HBase has a limit in cluster size, beyond which read/write throughput does not improve.

Automation—handles upgrades and restarts automatically, and ensures data durability via replication. While in HBase you would need to manage replicas and regions, in Cloud Bigtable you only need to design table schemas and add a second cluster to instances, and replication is configured automatically.

Dynamic cluster resizing— can grow and shrink cluster size on demand. It takes only a few minutes to rebalance performance across nodes in the cluster. In HBase, cluster resizing is a complex operation that requires downtime.

62
Q

How does BigTable work?

A

How it WorksCloud

Bigtable can support low-latency access to terabytes or even petabytes of single-keyed data. Because it is built as a sparsely populated table, it is able to scale to thousands of columns and billions of rows. A row key indexes each row as a single value, enabling very high read and write throughput, making it ideal as a MapReduce operations data source.

Cloud Bigtable has several client libraries including an extension of Apache HBase for Java. This allows it to integrate with multiple open source big data solutions.

63
Q

Name the GCP BigTable best-practices

A

Best Practices

Here are some best practices to make better use of Cloud Bigtable as an HBase replacement:

  • Trade-off Between High Throughput and Low LatencyWhen planning Cloud Bigtable capacity, consider your goals—you can optimize for throughput and reduce latency, or vice versa. Cloud Bigtable offers optimal latency when CPU load is under 70%, or preferably exactly 50%. If latency is less important, you can load CPUs to higher than 70%, to get higher throughput for the same number of cluster nodes.
  • Tables and SchemasIf you have several datasets with a similar schema, store them in one big table for better performance. You use a unique row key prefix to ensure datasets are stacked one after the other in the table.
  • Column FamiliesIf you have rows with multiple related values, it is best to group those columns into a column family. Grouping data as closely as possible avoids the need for complex filters—you can get exactly the data you need in a single read request.