Section 1 : Introduction Flashcards

Question 1

Q

What is Apache Kafka?

Answer

A

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.

Real-time processing
Batch processing
Operational use cases such as application logs collection

Question 2

Q

What is Structured data?

Answer

A

The term structured data refers to data that resides in a fixed field within a file or record. Structured data is typically stored in a relational database (RDBMS). It can consist of numbers and text, and sourcing can happen automatically or manually, as long as it’s within an RDBMS structure. It depends on the creation of a data model, defining what types of data to include and how to store and process it.

The programming language used for structured data is SQL (Structured Query Language). Developed by IBM in the 1970s, SQL handles relational databases. Typical examples of structured data are names, addresses, credit card numbers, geolocation, and so on.

e.g:

Databases
Census records
Phone numbers
Meta data
- Time and date of creation
- File size
- Author
Econimic data
- GDP, PPI, ASX

Question 3

Q

What is Semi-Structured data?

Answer

A

A simple definition of semi-structured data is data that can’t be organized in relational databases or doesn’t have a strict structural framework, yet does have some structural properties or loose organizational framework. Semi-structured data includes text that is organized by subject or topic or fit into a hierarchical programming language, yet the text within is open-ended, having no structure itself.

Emails, for example, are semi-structured by Sender, Recipient, Subject, Date, etc., or with the help of machine learning, are automatically categorized into folders, like Inbox, Spam, Promotions, etc.

e.g. JSON,XML, Spreadsheets

Question 4

Q

What is Unstructured Data?

Answer

A

Unstructured data is more or less all the data that is not structured. Even though unstructured data may have a native, internal structure, it’s not structured in a predefined way. There is no data model; the data is stored in its native format.

Typical examples of unstructured data are rich media, text, social media activity, surveillance imagery, and so on.

The amount of unstructured data is much larger than that of structured data. Unstructured data makes up a whopping 80% or more of all enterprise data, and the percentage keeps growing. This means that companies not taking unstructured data into account are missing out on a lot of valuable business intelligence.

As per the IDG - IUnstructured data volume rate growth is 62% per year.

e.g.

Text files
- Word processing
- Spreadsheets
- Presentations
Email Body
Mobile data (Text messages)
Social Media
- Data from Facebook
- Twitter
- LinkedIn
Communications:
- Chat
- IM
- Phone recordings
Media files
- MP3
- digital photos
- Audio and video files

Question 5

Q

What is Quasi-Structured data?

Answer

A

Quasi-structured data is more of a textual data with erratic data formats. It can be formatted. with effort, tools, and time. This data type includes web clickstream data such as Google. searches.

Examples:Clickstream data - Webpages a user has visted , what is the order of the visit

Question 6

Q

What is Big Data?

Answer

A

Big Data is a phrase used to mean a massive volume of data (Both structured and Unstructured data) that is so large that it is difficult to process using traditional database and software techniques.
Different social media contribution behind the exploding data
- YouTube - 300 hours of new videos are uploaded by users (Every Minute)
- Instagram - 4,166,667 post and 1,736,111 likes (Every Minute)
- Facebook - Users genrate 4 million likes (Every Minute)
- Twitter - 350,000 Tweets (Every Minute)
- Pinterest 9723 articles are pinned (Every Minute)
Big Data Examples:
- New York Stock Exchange generates 1 Terabyte trade data (Per day)
- Facebook generates 500 + Terabytes of data everyday
- A single Jet engine can generate 10 + Terabytes of data in 30 minutes of flight time.

Question 7

Q

What are the 6 V’s that Characterises Big data?

Answer

A

Velocity
- Speed at which incredible amount of data is being generated.
Volume
- The amount of data from myriad soruces
Variety
- The types of data; Structure, Semi-Structured, Unstructured.
Veracity
- The degree to which big data can be trusted (Quality of data)
Value
- The business value of the data collected.
Variabilty
- The ways in which the big data can be used and formatted.

Question 8

Q

What is Big data Analytics?

Answer

A

Big Data Analytics is the complex process of inspecting the Big Data to uncover information that can help organization make inform business decisions. Items which big data can assist to uncover:

Hidden Patterns
Unknwon correlations
Market Trends
Customer Preferences

Question 9

Q

What are the Benefits of Big Data?

Answer

A

Big Data helps the orginizations to create new growth opportunities by helping to increase their efficiency and in taking better decisions.
- Understand Market
  - With enhancement in analytical system such as using in-memory analytics having ability to analyze new sources of data, companies are able to understand the market much better way which in turn helps them to make smarter decisions on the go.
- Reducing time for decision making
  - In order to meet the ever-changing customer demands, organizations streamline their operational processes and give insight to take quick business decision.
- Cost saving
  - Big data rechnologies helps companies to reduce the cost of storing large amount of data.
- New product development services
  - Using big data analytics, organizations become familiar with customer’s needs and create new products to meet the customer’s needs.
- Manage Online Reputation
  - Big data tools can do sentiment analysis using which one can get feedback about their online reputation and in turn can manage.

Question 10

Q

What are the Use Cases for Big Data Analytics?

Answer

A

Customer sentiment analysis
Behavioral Analytics
Predictive support
Fraud detection
Customer segmentation

Question 11

Q

What are the Kafka Features?

Answer

A

High Throughput
- Spoort for millions of messages per second
Durability
- Provides support to persisting messages on disk
Data Loss
- Ensures no data loss
- Provides compression and Security
Replication
- Messages can be replicated accorss clusters, which supports mulitple subscribers.
Stream Processing
- Kafka can be used along with Spark and Storm
Scalability
- Highly scalable distributed systems with no downtime.

Question 12

Q

What are the Kafka Use Cases - Messaging?

Answer

A

Messaging - Kafka is good solution for large scale event processing applications.
- High Throughput
- Built-in partitioning
- Fault-tolerance
- Durability

Question 13

Q

What are the Kafka Use Cases - Activity Tracking?

Answer

A

Kafka was originally designed at LinkedIn to Track user activity
Following site activities are published with one topic per activity type:
- Page views
- Searches

Question 14

Q

What are the Kafka Use Cases - Matrics and Logging?

Answer

A

Collect application’s and system metrics and logs
Kafka is used for monitoring data:
- Accumulating measurements from distributed applications.
- Producing centrilized feeds of operational data

Question 15

Q

What are the Kafka Use Cases - Log Aggregation?

Answer

A

Log aggregation collects physical log files from servers and puts them in a central place(a file server or HDFS perhaps) for processing
Kafka abstracts the details of files
Gives log/event data as stream of messages which is a cleaner abstraction
In comparison to log-centric systems(Scribe or Flume), Kafka offers:
- Equally good performance
- Strong durability
- Lower end-to-end latency

Question 16

Q

What are the Kafka Use Cases - Stream Processing?

Answer

Study These Flashcards

A

Kafka can perform Stream processing
Stream processing operates on data in real-time, as quickly as messages are produced.
The raw input data is consumed from Kafka topics, aggregated, enriched, and if required it can be sent to new topics for further consumption by the other downstream systems.

Question 17

Q

What are the Kafka Use Cases - Event Sourcing?

Answer

Study These Flashcards

A

Style of application design
State changes are logged as time-sequenced records
Kafka makes an ideal backend for application built-in by event sourcing as it can support very large log data storage.
it also provides ordering of messages.

Question 18

Q

What are the Kafka Use Cases - Commit Log?

Answer

Study These Flashcards

A

Kafka can serve as a kind of external commit-log for a distributed system.
The log helps replicate data between nodes and acts as re-syncing mechanism for failed nodes to restore their data.
The log compaction in Kafka helps bolster this utilization. In this use Kafka is like Apache BookKeeper.

Question 19

Q

What are Kafka Terminologies?

Answer

Study These Flashcards

A

Message:
- Unit of data
- Key value pair
- Similar to a row or a record in a database system
- Message are distributed to different partitions based on a defined key (optional), if NO key is defined then messages are placed randomly in to partitions.
- Batch
  - Collection of messages
Topic
- A topic is a category or feed name to which
- records/message are published.
- Partition
  - Topics are broken up into ordered commit logs called partitions.
  - partitions behave like rows
  - The first message in the partition will have an “offset” of “0”, then the next message will be “1”.
  - An Offset uniquely identifies each message within the partition
  - Writes to a partition are sequential
  - Messages can be read from any point using an Offset value.
Producer
- An application that can publish messages to a topic.
- If no Key is defined, message is distributed using round-robin technique to a different broker.
Consumer
- An application that subscribes to a topic and consume the messages.
- A consumer (may be called subscribers or readers) read messages.
- - Broker
- Single Kafka server is called a broker
- receives messages from producers
- assigns offsets
- Commits the messages to storage on disk
- On fetch request for partition it responds with messages which are committed on disk.
- Retails all published messages irrespective of whether it is consumed or not.
Cluster
- Kafka cluster is a set of zookeeper serves/brokers
- The cluster keeps track of the messages a consumer has consumed using offset of message per partition per topic.
Zookeeper
- Used to managing and cordinating kafka cluster
- keeps a list of clusters
  *