Big Data Domain Flashcards
I: ANALYTICS
Improved Service Level Performance
Better Order Fulfillment
Improved Supplier Management
Maximize Customer Value
Driving Down Costs
Improved Advertising
Better Product Management
I: DATA WAREHOUSE
is a system that aggregates data from different sources into a single, central, consistent data store to support data analysis, data mining, artificial intelligence (AI), and machine learning.
I: DATE LAKE
Data Lake is the place where you dump all forms of data generated in various parts of your business
- Structured data feeds
- Chat logs
- Emails
- Images (of invoices, receipts, checks etc.)
- Videos.
I: DATA POOL
Data Pool is a centralized repository of data where trading partners (e.g., retailers, distributors or suppliers) can obtain, maintain and exchange information about products in a standard format. Suppliers can, for instance, upload data to a data pool that cooperating retailers can then receive through their data pool.
I: BIG DATA
Big Data is often used in enterprise settings to describe large amounts of data. It does not refer to a specific amount of data, but rather describes a dataset that cannot be stored or processed using traditional database software.
I: BIG DATA LIFE CYCLE
Stated simply, there are three primary stages of the Big Data Life Cycle, along with an overarching glue (data governance) that helps to manage these:
Data Acquisition
Data Awareness
Data Analytics
Data Governance
I: HADOOP
Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
I: MAPREDUCE
Map Reduce is a software framework and programming model used for processing huge amounts of data. MapReduce program work in two phases, namely, Map and Reduce.
Map – tasks deal with splitting and mapping of data while
Reduce – tasks shuffle and reduce the data.
I: Apache Spark
Spark is an open source framework focused on interactive query, machine learning, and real-time workloads. It does not have its own storage system, but runs analytics on other storage systems like HDFS, or other popular stores like Amazon Redshift, Amazon S3, Couchbase, Cassandra, and others.
I: ETL
ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database.
- Data migration
- Data merge
- Data warehousing
I: DATA SYNCHRONIZATION ISSUES
Data synchronization has become an important part of organizations’ information systems. However, the complexity of the process is very often underestimated, while it may be quite risky due to the factors among which are:
Different data formats due to variety of applications, tools, and databases an organization may use.
The quality of data synchronized. There is a risk to distribute inconsistent or out-of-date data enterprise-wide, which may result in additional (and quite palpable) data cleansing expenses.
Not to forget constant maintenance, meaning that as soon as new data enters an application, the next step is to synchronize it between the other organizations’ applications and systems.
I: HDFS
A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.