technology and tools Flashcards
what is Hadoop?
open source distributed computing framework
what is Hadoop written in?
java
what are the 4 main compensates of Hadoop?
map reduce
YARN
HDFS
Hadoop common
what size blocks does a Hadoop distributed file system (hdfs) use?
128 mb blocks
in the HDFS of Hadoop is failure normal?
yes as its highly fault tolerable
what is the name node?
master server
what does the name node do?
holds file system
undertakes file and directory operations
maps blocks to datanodes
what is the data node?
a file split into more than one block
what do data nodes do?
read and write requests
reports back to namenode
what is bad about HDFS?
not good for small reads
not good for many small files
append not amend
what is map reduce?
java based programming paradigm
when to use map reduce?
problems that are embarrassingly parallel
what does the Map from map reduce do?
Performs a map function on input key-value pairs to generate intermediate key-value pairs
what does the reduce from map reduce do?
Performs a reduce function on intermediate key-value groups to generate output key-value pairs
name a case where you would use map reduce?
data mining spam detection ad optimisation index building in search engines article clustering for news statistical machine translation
What does YARN stand for?
Yet another resource negotiator
What does yarn do?
Manages and monitors workloads
What are the main features of yarn? A shared B fast C scalability D flexibility E efficiency
A
C
D
E
What is pig
Data flow language
What is hive/hiveQL
SQL style query language
What is hbase
Column-orientated database
What is mahout
Machine learning library
What is spark
In memory processing
In Hadoop what are the data ingestion programs Flume Hbase Sqoop Storm
Flume
Sqoop
Storm
In Hadoop what are the analytic and machine learning programs
Spark
Giraph
Mahout
Giraph
Mahout
What are the no sql programs on Hadoop Tez Hbase Cassandra Spark
Hbase
Cassandra
In Hadoop what programs are the engines
Spark
Storm
Tex
Spark
Tez
What is zookeeper in hadoop
Cluster and workflow management
What does hive do?
Coverts sql queries into java jobs
What does hbase allow you to do?
Read/write operations on large datasets and works in real time
What does spark do?
Analytic engine for large scale data processing
What is different with sparks data sharing?
It’s in memory and not disk
What is greenplum
Open source data platform
What is postgresql
Rdbms with object oriented features
What is MADlib
Open source library for in database analytics
In greenplum what is the intersect operation
Rows from all answer sets
In greenplum what is the except operation
Rows from first answer set minus rows from second
In greenplum what is the union all operation
Rows from all answer sets with repeating rows
In greenplum what is the union operation
Rows from all answer sets minus repeating rows
In greenplum what is the group by operation
Group results based on one or more specified columns
In greenplum what is the group by with union all operation
Add sub totals and grand totals
In greenplum what is the roll up operation
Replaces union all
In greenplum what is the cube operation
Creates sub totals of all possible combinations
In greenplum what is the grouping function
Distinguishes NULL from summary markers
In greenplum what is a window function.
Performs a calculation across a set of rows that are related to the current roe
In greenplum and window functions what clause should you apply to specify which data window
OVER
In greenplum window functions how would you define window partitions
PARTITION BY
what does MAD stand for in MADlib?
magnetic
agile
deep
what are the MADlib in-database analytical functions
a) regression
b) classification
c) validation
d) text analysis
e) descriptive analytics
f) clustering and top modelling
g) association rule mining
a) regression
b) classification
c) validation
e) descriptive analytics
f) clustering and top modelling
g) association rule mining
what does MADlib do?
creates models without moving data out of DBMS