Week 2: Data Collection Flashcards
Batch Mode
It’s an analysis mode where results are updated infrequently (after days or months).
Real-time Mode
It’s an analysis mode where results are updated frequently (after seconds).
Interactive Mode
It’s an analysis mode where results are updated on demand as answers to queries.
Hadoop/MapReduce
It’s a framework for distributed data processing. It operates on batch mode.
Pig
It’s a high-level language to write MapReduce programmes. It operates on batch mode.
Spark
It’s a cluster computing framework and has various data analytics components. It operates on batch mode.
Solr
It’s a scalabe framework for searching data. It operates on batch mode.
Spark Streaming Component
It’s an extension of the core Spark API used for stream processing. It operates in real-time mode.
Storm
It’s used for stream processing. It operates on real-time mode.
Hive
It’s a data warehousing framework built on HDFS (Hadoop Distributed File System), and uses a SQL-like language.
Spark SQL Component.
It’s a component of Apache Spark and allows for SQL-like queries within Spark programmes.
Publish-subscribe Messaging
It’s a type of data access connector. Examples include Apache Kafka and Amazon Kinesis. Publishers send messages to topics. The messages are managed by an intermediary broker. Subscribers subscribe to topics. The broker routes the message from publishers to subscribers.
Source-sink Connectors
It’s a type of data access connector. Apache Flume is an example. They import data from another system, i.e. a relational database, and send the data into a centralised data store, i.e. a distributed file system. Sink connectors export the data to another system, such as an HDFS.
Database Connectors
It’s a type of data access connector. Apache Sqoop is an example. It imports data from relational DBMS’s into big data storage and analytics frameworks.
Messaging Queues
It’s a type of data access connector. Examples include RabbitMQ, ZeroMQ, and AmazonSQS. Producers push the data into queues and consumers pull the data from the queues. Producers and consumers don’t need to be aware of each other.