Week 2: Data Collection Flashcards

1
Q

Batch Mode

A

It’s an analysis mode where results are updated infrequently (after days or months).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Real-time Mode

A

It’s an analysis mode where results are updated frequently (after seconds).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Interactive Mode

A

It’s an analysis mode where results are updated on demand as answers to queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Hadoop/MapReduce

A

It’s a framework for distributed data processing. It operates on batch mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pig

A

It’s a high-level language to write MapReduce programmes. It operates on batch mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Spark

A

It’s a cluster computing framework and has various data analytics components. It operates on batch mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Solr

A

It’s a scalabe framework for searching data. It operates on batch mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Spark Streaming Component

A

It’s an extension of the core Spark API used for stream processing. It operates in real-time mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Storm

A

It’s used for stream processing. It operates on real-time mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hive

A

It’s a data warehousing framework built on HDFS (Hadoop Distributed File System), and uses a SQL-like language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Spark SQL Component.

A

It’s a component of Apache Spark and allows for SQL-like queries within Spark programmes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Publish-subscribe Messaging

A

It’s a type of data access connector. Examples include Apache Kafka and Amazon Kinesis. Publishers send messages to topics. The messages are managed by an intermediary broker. Subscribers subscribe to topics. The broker routes the message from publishers to subscribers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Source-sink Connectors

A

It’s a type of data access connector. Apache Flume is an example. They import data from another system, i.e. a relational database, and send the data into a centralised data store, i.e. a distributed file system. Sink connectors export the data to another system, such as an HDFS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Database Connectors

A

It’s a type of data access connector. Apache Sqoop is an example. It imports data from relational DBMS’s into big data storage and analytics frameworks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Messaging Queues

A

It’s a type of data access connector. Examples include RabbitMQ, ZeroMQ, and AmazonSQS. Producers push the data into queues and consumers pull the data from the queues. Producers and consumers don’t need to be aware of each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Custom Connectors

A

It’s a type of data access connector. Examples include gathering data from social networks, NoSQL databases, or IoT. They’re built based on the data sources and data collection requirements.

17
Q

Apache Sqoop Imports

A

It imports data from a RDBMS into an HDFS. The exact steps are as follows:

  1. The table to be imported is examined, mainly by looking at the metadata.
  2. JAVA code is made, mainly a class for a table, a method for an attribute, and methods to interact with the JDBC.
  3. Sqoop connects to the Hadoop cluster and sends a MapReduce job which transfers the data from the DBMS to the HDFS in parallel.
18
Q

Apache Sqoop Exports

A

It exports data from an HDFS back to the RDBMS. Here are the exact steps:

  1. Receives strategy for target table
  2. Generates JAVA code to parse records from text files and generate INSERT statements.
  3. JAVA code is used in a submitted MapReduce Job that will export the data.
  4. For efficiency, “m” mappers write data in parallel, so the INSERT command may transfer multiple rows.
19
Q

Sqoop Import Command Template

A

sqoop import \
–connect jdbc:mysql://mysql.example.com/sqoop \
–username Username \
–password Password \
–table tableName

20
Q

JDBC

A

It’s a set of classes and interfaces written in Java that allows Java programmes to send SQL statements to the database.

21
Q

Parallelism

A

This is when the same task is split into multiple channels, thus reducing the amount of time for completion. In the case of Sqoop, more mappers can send the data in parallel, but this requires an increase in the number of concurrent queries to the DBMS. The parameter “m” specifies the number of mappers to run in parallel, with each mapper reading only a certain section of the rows in the dataset.

22
Q

Sqoop Imports Involving Updates

A

In this case, we use the “–incremental append” command, which focuses on the new data and doesn’t update existing rows.

23
Q

Sqoop Export Command Template

A

sqoop export \
–connect jdbc:mysql://mysql.example.com/sqoop \
–username Username \
–password Password \
–export -dir Directory

24
Q

Apache Flume

A

It’s a source-sink connector for collecting, aggregating, and moving data. Apache Flume gathers data from different sources and sends them into a centralised data store, which is either a distributed file system or a NoSQL database. Compared to ad-hoc alternatives, Apache Flume is reliable, scalable, and has high performance. It’s also manageable, customisable, and has low-cost installation, operation, and maintenance.

25
Q

Apache Flume: Architecture

A

It consists of a distributed pipeline, with agents connected to each other in a chain. It can also be optimised for different data sources and destinations.

26
Q

Apache Flume: Event

A

An event is a unit of data flow that has a byte payload and possibly headers.

27
Q

Apache Flume: Headers

A

These are sets of attributes, which can be used for contextual routing. They consist of unordered collections (maps) of string key-value pairs.

28
Q

Apache Flume: Payloads

A

They are byte arrays, and are opaque to Flume. Opaque in this context means that Flume treats the payloads are raw bytes and doesn’t attempt to interpret or restructure them.

29
Q

Apache Flume: Agents

A

Agents are processes that host components through which events move from one place to another. The agent operates in the following manner:
1. Events are received
2. Events are passed through interceptors if they exist
3. Events are put in channels selected by the channel selector if there are multiple channels. Otherwise, they’re feed into a single channel.
4. The sink processor invokes sink if it exists.
5. The sink or invoked sink takes events from its channel and sends them to the next hop destination.
6. If event transmission fails, the sink processor takes the secondary action.

30
Q

Apache Flume: Source

A

The source receives data from data generators and transfer the data to one or more channels. It requires at least one channel to function. There are also different types of sources for integration with well-known systems. Common sources include:
1. Data serialisation frameworks such as Avro and Thrift
2. Social Media posts
3. Data from stdout
4. Data from port written by NetCat
5. HTTP POST events

31
Q

Apache Flume: Channel

A

A channel is a transient store which buffers events until they’re consumed by sinks. It can work with any number of sources and sinks. Channel types include:
1. Memory
2. File
3. JDBC

32
Q

Apache Flume: Sink

A

A sink removes events from the channel and transmits them to their next hop destination. A sink requires exactly one channel to function. Sink types include:
1. HDFS
2. Hbase
3. File Roll
4. Avro
5. Thrift

33
Q

Apache Flume: Interceptor

A

This isn’t present in all Apache Flume Agents. The interceptor is applpied to the source in order to modify, filter, or drop events. Can also add information to headers such as timestamp, host, static, etc.

34
Q

Apache Flume: Channel Processor/Selector

A

This isn’t present in all Apache Flume Agents. When there are multiple channels, the channel processor/selector defines the policy for distributing events to channels. If there are interceptors, then the channel selector is applied after them to make sure that the events were modified by the interceptors. Channel selector types include:

  1. Replicating, which is the default option and replicates events to all connected channels.
  2. Multiplexing, which distributes events to all connected channels that based on header attribute “A” of the event and its mapping properties (values of A)
35
Q

Apache Flume: Sink Selector/Processor

A

This isn’t present in all Apache Flume Agents. It present, the sin selector picks one sink from a specified group of sinks. Common sink selector types include:
1. Load Balancing: the load is distributed to sinks selected by random. If the sink fails, the next available sink is chosen. Failed sinks can be blacklisted for a given timeout that increased exponentially if the sink still fails after the timeout.
2. Failover: unique priorities are assigned to sinks. The sink with the highest priority writes the data until it fails. If the sink fails while sending an event, it is moved to a pool to “cooldown” for the maximum penalty time period. The next sink with the highest priority is then used.

36
Q

Data Access Connectors

A

These are tools and frameworks to collect and ingest data from various sources into big data storage and analytics frameworks.