Section 1 : Introduction Flashcards

1
Q

What is Apache Kafka?

A

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.

  • Real-time processing
  • Batch processing
  • Operational use cases such as application logs collection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Structured data?

A

The term structured data refers to data that resides in a fixed field within a file or record. Structured data is typically stored in a relational database (RDBMS). It can consist of numbers and text, and sourcing can happen automatically or manually, as long as it’s within an RDBMS structure. It depends on the creation of a data model, defining what types of data to include and how to store and process it.

The programming language used for structured data is SQL (Structured Query Language). Developed by IBM in the 1970s, SQL handles relational databases. Typical examples of structured data are names, addresses, credit card numbers, geolocation, and so on.

e.g:

  • Databases
  • Census records
  • Phone numbers
  • Meta data
    • Time and date of creation
    • File size
    • Author
  • Econimic data
    • GDP, PPI, ASX
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Semi-Structured data?

A

A simple definition of semi-structured data is data that can’t be organized in relational databases or doesn’t have a strict structural framework, yet does have some structural properties or loose organizational framework. Semi-structured data includes text that is organized by subject or topic or fit into a hierarchical programming language, yet the text within is open-ended, having no structure itself.

Emails, for example, are semi-structured by Sender, Recipient, Subject, Date, etc., or with the help of machine learning, are automatically categorized into folders, like Inbox, Spam, Promotions, etc.

e.g. JSON,XML, Spreadsheets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Unstructured Data?

A

Unstructured data is more or less all the data that is not structured. Even though unstructured data may have a native, internal structure, it’s not structured in a predefined way. There is no data model; the data is stored in its native format.

Typical examples of unstructured data are rich media, text, social media activity, surveillance imagery, and so on.

The amount of unstructured data is much larger than that of structured data. Unstructured data makes up a whopping 80% or more of all enterprise data, and the percentage keeps growing. This means that companies not taking unstructured data into account are missing out on a lot of valuable business intelligence.

As per the IDG - IUnstructured data volume rate growth is 62% per year.

e.g.

  • Text files
    • Word processing
    • Spreadsheets
    • Presentations
  • Email Body
  • Mobile data (Text messages)
  • Social Media
    • Data from Facebook
    • Twitter
    • LinkedIn
  • Communications:
    • Chat
    • IM
    • Phone recordings
  • Media files
    • MP3
    • digital photos
    • Audio and video files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Quasi-Structured data?

A

Quasi-structured data is more of a textual data with erratic data formats. It can be formatted. with effort, tools, and time. This data type includes web clickstream data such as Google. searches.

Examples:Clickstream data - Webpages a user has visted , what is the order of the visit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Big Data?

A
  • Big Data is a phrase used to mean a massive volume of data (Both structured and Unstructured data) that is so large that it is difficult to process using traditional database and software techniques.
  • Different social media contribution behind the exploding data
    • YouTube - 300 hours of new videos are uploaded by users (Every Minute)
    • Instagram - 4,166,667 post and 1,736,111 likes (Every Minute)
    • Facebook - Users genrate 4 million likes (Every Minute)
    • Twitter - 350,000 Tweets (Every Minute)
    • Pinterest 9723 articles are pinned (Every Minute)
  • Big Data Examples:
    • New York Stock Exchange generates 1 Terabyte trade data (Per day)
    • Facebook generates 500 + Terabytes of data everyday
    • A single Jet engine can generate 10 + Terabytes of data in 30 minutes of flight time.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 6 V’s that Characterises Big data?

A
  • Velocity
    • Speed at which incredible amount of data is being generated.
  • Volume
    • The amount of data from myriad soruces
  • Variety
    • The types of data; Structure, Semi-Structured, Unstructured.
  • Veracity
    • The degree to which big data can be trusted (Quality of data)
  • Value
    • The business value of the data collected.
  • Variabilty
    • The ways in which the big data can be used and formatted.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Big data Analytics?

A

Big Data Analytics is the complex process of inspecting the Big Data to uncover information that can help organization make inform business decisions. Items which big data can assist to uncover:

  • Hidden Patterns
  • Unknwon correlations
  • Market Trends
  • Customer Preferences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the Benefits of Big Data?

A
  • Big Data helps the orginizations to create new growth opportunities by helping to increase their efficiency and in taking better decisions.
    • Understand Market
      • With enhancement in analytical system such as using in-memory analytics having ability to analyze new sources of data, companies are able to understand the market much better way which in turn helps them to make smarter decisions on the go.
    • Reducing time for decision making
      • In order to meet the ever-changing customer demands, organizations streamline their operational processes and give insight to take quick business decision.
    • Cost saving
      • Big data rechnologies helps companies to reduce the cost of storing large amount of data.
    • New product development services
      • Using big data analytics, organizations become familiar with customer’s needs and create new products to meet the customer’s needs.
    • Manage Online Reputation
      • Big data tools can do sentiment analysis using which one can get feedback about their online reputation and in turn can manage.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the Use Cases for Big Data Analytics?

A
  • Customer sentiment analysis
  • Behavioral Analytics
  • Predictive support
  • Fraud detection
  • Customer segmentation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the Kafka Features?

A
  • High Throughput
    • Spoort for millions of messages per second
  • Durability
    • Provides support to persisting messages on disk
  • Data Loss
    • Ensures no data loss
    • Provides compression and Security
  • Replication
    • Messages can be replicated accorss clusters, which supports mulitple subscribers.
  • Stream Processing
    • Kafka can be used along with Spark and Storm
  • Scalability
    • Highly scalable distributed systems with no downtime.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the Kafka Use Cases - Messaging?

A
  • Messaging - Kafka is good solution for large scale event processing applications.
    • High Throughput
    • Built-in partitioning
    • Fault-tolerance
    • Durability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the Kafka Use Cases - Activity Tracking?

A
  • Kafka was originally designed at LinkedIn to Track user activity
  • Following site activities are published with one topic per activity type:
    • Page views
    • Searches
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the Kafka Use Cases - Matrics and Logging?

A
  • Collect application’s and system metrics and logs
  • Kafka is used for monitoring data:
    • Accumulating measurements from distributed applications.
    • Producing centrilized feeds of operational data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the Kafka Use Cases - Log Aggregation?

A
  • Log aggregation collects physical log files from servers and puts them in a central place(a file server or HDFS perhaps) for processing
  • Kafka abstracts the details of files
  • Gives log/event data as stream of messages which is a cleaner abstraction
  • In comparison to log-centric systems(Scribe or Flume), Kafka offers:
    • Equally good performance
    • Strong durability
    • Lower end-to-end latency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the Kafka Use Cases - Stream Processing?

A
  • Kafka can perform Stream processing
  • Stream processing operates on data in real-time, as quickly as messages are produced.
  • The raw input data is consumed from Kafka topics, aggregated, enriched, and if required it can be sent to new topics for further consumption by the other downstream systems.
17
Q

What are the Kafka Use Cases - Event Sourcing?

A
  • Style of application design
  • State changes are logged as time-sequenced records
  • Kafka makes an ideal backend for application built-in by event sourcing as it can support very large log data storage.
  • it also provides ordering of messages.
18
Q

What are the Kafka Use Cases - Commit Log?

A
  • Kafka can serve as a kind of external commit-log for a distributed system.
  • The log helps replicate data between nodes and acts as re-syncing mechanism for failed nodes to restore their data.
  • The log compaction in Kafka helps bolster this utilization. In this use Kafka is like Apache BookKeeper.
19
Q

What are Kafka Terminologies?

A
  • Message:
    • Unit of data
    • Key value pair
    • Similar to a row or a record in a database system
    • Message are distributed to different partitions based on a defined key (optional), if NO key is defined then messages are placed randomly in to partitions.
    • Batch
      • Collection of messages
  • Topic
    • A topic is a category or feed name to which
    • records/message are published.
    • Partition
      • Topics are broken up into ordered commit logs called partitions.
      • partitions behave like rows
      • The first message in the partition will have an “offset” of “0”, then the next message will be “1”.
      • An Offset uniquely identifies each message within the partition
      • Writes to a partition are sequential
      • Messages can be read from any point using an Offset value.
  • Producer
    • An application that can publish messages to a topic.
    • If no Key is defined, message is distributed using round-robin technique to a different broker.
  • Consumer
    • An application that subscribes to a topic and consume the messages.
    • A consumer (may be called subscribers or readers) read messages.
      • Broker
    • Single Kafka server is called a broker
    • receives messages from producers
    • assigns offsets
    • Commits the messages to storage on disk
    • On fetch request for partition it responds with messages which are committed on disk.
    • Retails all published messages irrespective of whether it is consumed or not.
  • Cluster
    • Kafka cluster is a set of zookeeper serves/brokers
    • The cluster keeps track of the messages a consumer has consumed using offset of message per partition per topic.
  • Zookeeper
    • Used to managing and cordinating kafka cluster
    • keeps a list of clusters
      *
20
Q

What are the Kafka Cluster Types - Single Node-Single Broker Cluster ?

A
  • Single Node-Single Broker Cluster
    • Not for production use (No backup cluster)
21
Q

What are the Kafka Cluster Types - Single Node-Multiple Broker Cluster ?

A
  • Single Node- Multiple Broker Cluster
    • Multiple brokers
    • One broker will act as a controller
22
Q

What are the Kafka Cluster Types - Multiple Nodes- Multiple Brokers Cluster ?

A
  • Multiple Nodes - Multiple Broker Cluster
    • cluster is independant of nodes