Week 2: Data Collection Flashcards
Batch Mode
It’s an analysis mode where results are updated infrequently (after days or months).
Real-time Mode
It’s an analysis mode where results are updated frequently (after seconds).
Interactive Mode
It’s an analysis mode where results are updated on demand as answers to queries.
Hadoop/MapReduce
It’s a framework for distributed data processing. It operates on batch mode.
Pig
It’s a high-level language to write MapReduce programmes. It operates on batch mode.
Spark
It’s a cluster computing framework and has various data analytics components. It operates on batch mode.
Solr
It’s a scalabe framework for searching data. It operates on batch mode.
Spark Streaming Component
It’s an extension of the core Spark API used for stream processing. It operates in real-time mode.
Storm
It’s used for stream processing. It operates on real-time mode.
Hive
It’s a data warehousing framework built on HDFS (Hadoop Distributed File System), and uses a SQL-like language.
Spark SQL Component.
It’s a component of Apache Spark and allows for SQL-like queries within Spark programmes.
Publish-subscribe Messaging
It’s a type of data access connector. Examples include Apache Kafka and Amazon Kinesis. Publishers send messages to topics. The messages are managed by an intermediary broker. Subscribers subscribe to topics. The broker routes the message from publishers to subscribers.
Source-sink Connectors
It’s a type of data access connector. Apache Flume is an example. They import data from another system, i.e. a relational database, and send the data into a centralised data store, i.e. a distributed file system. Sink connectors export the data to another system, such as an HDFS.
Database Connectors
It’s a type of data access connector. Apache Sqoop is an example. It imports data from relational DBMS’s into big data storage and analytics frameworks.
Messaging Queues
It’s a type of data access connector. Examples include RabbitMQ, ZeroMQ, and AmazonSQS. Producers push the data into queues and consumers pull the data from the queues. Producers and consumers don’t need to be aware of each other.
Custom Connectors
It’s a type of data access connector. Examples include gathering data from social networks, NoSQL databases, or IoT. They’re built based on the data sources and data collection requirements.
Apache Sqoop Imports
It imports data from a RDBMS into an HDFS. The exact steps are as follows:
- The table to be imported is examined, mainly by looking at the metadata.
- JAVA code is made, mainly a class for a table, a method for an attribute, and methods to interact with the JDBC.
- Sqoop connects to the Hadoop cluster and sends a MapReduce job which transfers the data from the DBMS to the HDFS in parallel.
Apache Sqoop Exports
It exports data from an HDFS back to the RDBMS. Here are the exact steps:
- Receives strategy for target table
- Generates JAVA code to parse records from text files and generate INSERT statements.
- JAVA code is used in a submitted MapReduce Job that will export the data.
- For efficiency, “m” mappers write data in parallel, so the INSERT command may transfer multiple rows.
Sqoop Import Command Template
sqoop import \
–connect jdbc:mysql://mysql.example.com/sqoop \
–username Username \
–password Password \
–table tableName
JDBC
It’s a set of classes and interfaces written in Java that allows Java programmes to send SQL statements to the database.
Parallelism
This is when the same task is split into multiple channels, thus reducing the amount of time for completion. In the case of Sqoop, more mappers can send the data in parallel, but this requires an increase in the number of concurrent queries to the DBMS. The parameter “m” specifies the number of mappers to run in parallel, with each mapper reading only a certain section of the rows in the dataset.
Sqoop Imports Involving Updates
In this case, we use the “–incremental append” command, which focuses on the new data and doesn’t update existing rows.
Sqoop Export Command Template
sqoop export \
–connect jdbc:mysql://mysql.example.com/sqoop \
–username Username \
–password Password \
–export -dir Directory
Apache Flume
It’s a source-sink connector for collecting, aggregating, and moving data. Apache Flume gathers data from different sources and sends them into a centralised data store, which is either a distributed file system or a NoSQL database. Compared to ad-hoc alternatives, Apache Flume is reliable, scalable, and has high performance. It’s also manageable, customisable, and has low-cost installation, operation, and maintenance.
Apache Flume: Architecture
It consists of a distributed pipeline, with agents connected to each other in a chain. It can also be optimised for different data sources and destinations.
Apache Flume: Event
An event is a unit of data flow that has a byte payload and possibly headers.
Apache Flume: Headers
These are sets of attributes, which can be used for contextual routing. They consist of unordered collections (maps) of string key-value pairs.
Apache Flume: Payloads
They are byte arrays, and are opaque to Flume. Opaque in this context means that Flume treats the payloads are raw bytes and doesn’t attempt to interpret or restructure them.
Apache Flume: Agents
Agents are processes that host components through which events move from one place to another. The agent operates in the following manner:
1. Events are received
2. Events are passed through interceptors if they exist
3. Events are put in channels selected by the channel selector if there are multiple channels. Otherwise, they’re feed into a single channel.
4. The sink processor invokes sink if it exists.
5. The sink or invoked sink takes events from its channel and sends them to the next hop destination.
6. If event transmission fails, the sink processor takes the secondary action.
Apache Flume: Source
The source receives data from data generators and transfer the data to one or more channels. It requires at least one channel to function. There are also different types of sources for integration with well-known systems. Common sources include:
1. Data serialisation frameworks such as Avro and Thrift
2. Social Media posts
3. Data from stdout
4. Data from port written by NetCat
5. HTTP POST events
Apache Flume: Channel
A channel is a transient store which buffers events until they’re consumed by sinks. It can work with any number of sources and sinks. Channel types include:
1. Memory
2. File
3. JDBC
Apache Flume: Sink
A sink removes events from the channel and transmits them to their next hop destination. A sink requires exactly one channel to function. Sink types include:
1. HDFS
2. Hbase
3. File Roll
4. Avro
5. Thrift
Apache Flume: Interceptor
This isn’t present in all Apache Flume Agents. The interceptor is applpied to the source in order to modify, filter, or drop events. Can also add information to headers such as timestamp, host, static, etc.
Apache Flume: Channel Processor/Selector
This isn’t present in all Apache Flume Agents. When there are multiple channels, the channel processor/selector defines the policy for distributing events to channels. If there are interceptors, then the channel selector is applied after them to make sure that the events were modified by the interceptors. Channel selector types include:
- Replicating, which is the default option and replicates events to all connected channels.
- Multiplexing, which distributes events to all connected channels that based on header attribute “A” of the event and its mapping properties (values of A)
Apache Flume: Sink Selector/Processor
This isn’t present in all Apache Flume Agents. It present, the sin selector picks one sink from a specified group of sinks. Common sink selector types include:
1. Load Balancing: the load is distributed to sinks selected by random. If the sink fails, the next available sink is chosen. Failed sinks can be blacklisted for a given timeout that increased exponentially if the sink still fails after the timeout.
2. Failover: unique priorities are assigned to sinks. The sink with the highest priority writes the data until it fails. If the sink fails while sending an event, it is moved to a pool to “cooldown” for the maximum penalty time period. The next sink with the highest priority is then used.
Data Access Connectors
These are tools and frameworks to collect and ingest data from various sources into big data storage and analytics frameworks.