Big Data Domain Flashcards

Question 1

Q

I: ANALYTICS

Answer

A

Improved Service Level Performance

Better Order Fulfillment

Improved Supplier Management

Maximize Customer Value

Driving Down Costs

Improved Advertising

Better Product Management

Question 2

Q

I: DATA WAREHOUSE

Answer

A

is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI), and machine learning.

Question 3

Q

I: DATE LAKE

Answer

A

Data Lake is the place where you dump all forms of data generated in various parts of your business

Structured data feeds
Chat logs
Emails
Images (of invoices, receipts, checks etc.)
Videos.

Question 4

Q

I: DATA POOL

Answer

A

Data Pool is a centralized repository of data where trading partners (e.g., retailers, distributors or suppliers) can obtain, maintain and exchange information about products in a standard format. Suppliers can, for instance, upload data to a data pool that cooperating retailers can then receive through their data pool.

Question 5

Q

I: BIG DATA

Answer

A

Big Data is often used in enterprise settings to describe large amounts of data. It does not refer to a specific amount of data, but rather describes a dataset that cannot be stored or processed using traditional database software.

Question 6

Q

I: BIG DATA LIFE CYCLE

Answer

A

Stated simply, there are three primary stages of the Big Data Life Cycle, along with an overarching glue (data governance) that helps to manage these:

Data Acquisition
Data Awareness
Data Analytics
Data Governance

Question 7

Q

I: HADOOP

Answer

A

Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

Question 8

Q

I: MAPREDUCE

Answer

A

Map Reduce is a software framework and programming model used for processing huge amounts of data. MapReduce program work in two phases, namely, Map and Reduce.

Map – tasks deal with splitting and mapping of data while
Reduce – tasks shuffle and reduce the data.

Question 9

Q

I: Apache Spark

Answer

A

Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others.

Question 10

Q

I: ETL

Answer

A

ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database.

Data migration
Data merge
Data warehousing

Question 11

Q

I: DATA SYNCHRONIZATION ISSUES

Answer

A

Data synchronization has become an important part of organizations’ information systems. However, the complexity of the process is very often underestimated, while it may be quite risky due to the factors among which are:

Different data formats due to variety of applications, tools, and databases an organization may use.

The quality of data synchronized. There is a risk to distribute inconsistent or out-of-date data enterprise-wide, which may result in additional (and quite palpable) data cleansing expenses.

Not to forget constant maintenance, meaning that as soon as new data enters an application, the next step is to synchronize it between the other organizations’ applications and systems.

Question 12

Q

I: HDFS

Answer

A

A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.