Big Data Domain Flashcards

1
Q

I: ANALYTICS

A

Improved Service Level Performance

Better Order Fulfillment

Improved Supplier Management

Maximize Customer Value

Driving Down Costs

Improved Advertising

Better Product Management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

I: DATA WAREHOUSE

A

is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI), and machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

I: DATE LAKE

A

Data Lake is the place where you dump all forms of data generated in various parts of your business

  • Structured data feeds
  • Chat logs
  • Emails
  • Images (of invoices, receipts, checks etc.)
  • Videos.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

I: DATA POOL

A

Data Pool is a centralized repository of data where trading partners (e.g., retailers, distributors or suppliers) can obtain, maintain and exchange information about products in a standard format. Suppliers can, for instance, upload data to a data pool that cooperating retailers can then receive through their data pool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

I: BIG DATA

A

Big Data is often used in enterprise settings to describe large amounts of data. It does not refer to a specific amount of data, but rather describes a dataset that cannot be stored or processed using traditional database software.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

I: BIG DATA LIFE CYCLE

A

Stated simply, there are three primary stages of the Big Data Life Cycle, along with an overarching glue (data governance) that helps to manage these:

Data Acquisition
Data Awareness
Data Analytics
Data Governance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

I: HADOOP

A

Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

I: MAPREDUCE

A

Map Reduce is a software framework and programming model used for processing huge amounts of data. MapReduce program work in two phases, namely, Map and Reduce.

Map – tasks deal with splitting and mapping of data while
Reduce – tasks shuffle and reduce the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

I: Apache Spark

A

Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

I: ETL

A

ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database.

  • Data migration
  • Data merge
  • Data warehousing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

I: DATA SYNCHRONIZATION ISSUES

A

Data synchronization has become an important part of organizations’ information systems. However, the complexity of the process is very often underestimated, while it may be quite risky due to the factors among which are:

Different data formats due to variety of applications, tools, and databases an organization may use.

The quality of data synchronized. There is a risk to distribute inconsistent or out-of-date data enterprise-wide, which may result in additional (and quite palpable) data cleansing expenses.

Not to forget constant maintenance, meaning that as soon as new data enters an application, the next step is to synchronize it between the other organizations’ applications and systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

I: HDFS

A

A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly