Section 1 : Introduction Flashcards
What is Apache Kafka?
Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.
- Real-time processing
- Batch processing
- Operational use cases such as application logs collection
What is Structured data?
The term structured data refers to data that resides in a fixed field within a file or record. Structured data is typically stored in a relational database (RDBMS). It can consist of numbers and text, and sourcing can happen automatically or manually, as long as it’s within an RDBMS structure. It depends on the creation of a data model, defining what types of data to include and how to store and process it.
The programming language used for structured data is SQL (Structured Query Language). Developed by IBM in the 1970s, SQL handles relational databases. Typical examples of structured data are names, addresses, credit card numbers, geolocation, and so on.
e.g:
- Databases
- Census records
- Phone numbers
-
Meta data
- Time and date of creation
- File size
- Author
- Econimic data
- GDP, PPI, ASX
What is Semi-Structured data?
A simple definition of semi-structured data is data that can’t be organized in relational databases or doesn’t have a strict structural framework, yet does have some structural properties or loose organizational framework. Semi-structured data includes text that is organized by subject or topic or fit into a hierarchical programming language, yet the text within is open-ended, having no structure itself.
Emails, for example, are semi-structured by Sender, Recipient, Subject, Date, etc., or with the help of machine learning, are automatically categorized into folders, like Inbox, Spam, Promotions, etc.
e.g. JSON,XML, Spreadsheets
What is Unstructured Data?
Unstructured data is more or less all the data that is not structured. Even though unstructured data may have a native, internal structure, it’s not structured in a predefined way. There is no data model; the data is stored in its native format.
Typical examples of unstructured data are rich media, text, social media activity, surveillance imagery, and so on.
The amount of unstructured data is much larger than that of structured data. Unstructured data makes up a whopping 80% or more of all enterprise data, and the percentage keeps growing. This means that companies not taking unstructured data into account are missing out on a lot of valuable business intelligence.
As per the IDG - IUnstructured data volume rate growth is 62% per year.
e.g.
-
Text files
- Word processing
- Spreadsheets
- Presentations
- Email Body
- Mobile data (Text messages)
-
Social Media
- Data from Facebook
- Communications:
- Chat
- IM
- Phone recordings
- Media files
- MP3
- digital photos
- Audio and video files
What is Quasi-Structured data?
Quasi-structured data is more of a textual data with erratic data formats. It can be formatted. with effort, tools, and time. This data type includes web clickstream data such as Google. searches.
Examples:Clickstream data - Webpages a user has visted , what is the order of the visit
What is Big Data?
- Big Data is a phrase used to mean a massive volume of data (Both structured and Unstructured data) that is so large that it is difficult to process using traditional database and software techniques.
- Different social media contribution behind the exploding data
- YouTube - 300 hours of new videos are uploaded by users (Every Minute)
- Instagram - 4,166,667 post and 1,736,111 likes (Every Minute)
- Facebook - Users genrate 4 million likes (Every Minute)
- Twitter - 350,000 Tweets (Every Minute)
- Pinterest 9723 articles are pinned (Every Minute)
- Big Data Examples:
- New York Stock Exchange generates 1 Terabyte trade data (Per day)
- Facebook generates 500 + Terabytes of data everyday
- A single Jet engine can generate 10 + Terabytes of data in 30 minutes of flight time.
What are the 6 V’s that Characterises Big data?
-
Velocity
- Speed at which incredible amount of data is being generated.
-
Volume
- The amount of data from myriad soruces
-
Variety
- The types of data; Structure, Semi-Structured, Unstructured.
-
Veracity
- The degree to which big data can be trusted (Quality of data)
-
Value
- The business value of the data collected.
- Variabilty
- The ways in which the big data can be used and formatted.
What is Big data Analytics?
Big Data Analytics is the complex process of inspecting the Big Data to uncover information that can help organization make inform business decisions. Items which big data can assist to uncover:
- Hidden Patterns
- Unknwon correlations
- Market Trends
- Customer Preferences
What are the Benefits of Big Data?
- Big Data helps the orginizations to create new growth opportunities by helping to increase their efficiency and in taking better decisions.
-
Understand Market
- With enhancement in analytical system such as using in-memory analytics having ability to analyze new sources of data, companies are able to understand the market much better way which in turn helps them to make smarter decisions on the go.
-
Reducing time for decision making
- In order to meet the ever-changing customer demands, organizations streamline their operational processes and give insight to take quick business decision.
-
Cost saving
- Big data rechnologies helps companies to reduce the cost of storing large amount of data.
-
New product development services
- Using big data analytics, organizations become familiar with customer’s needs and create new products to meet the customer’s needs.
-
Manage Online Reputation
- Big data tools can do sentiment analysis using which one can get feedback about their online reputation and in turn can manage.
-
Understand Market
What are the Use Cases for Big Data Analytics?
- Customer sentiment analysis
- Behavioral Analytics
- Predictive support
- Fraud detection
- Customer segmentation
What are the Kafka Features?
-
High Throughput
- Spoort for millions of messages per second
-
Durability
- Provides support to persisting messages on disk
-
Data Loss
- Ensures no data loss
- Provides compression and Security
-
Replication
- Messages can be replicated accorss clusters, which supports mulitple subscribers.
-
Stream Processing
- Kafka can be used along with Spark and Storm
-
Scalability
- Highly scalable distributed systems with no downtime.
What are the Kafka Use Cases - Messaging?
-
Messaging - Kafka is good solution for large scale event processing applications.
- High Throughput
- Built-in partitioning
- Fault-tolerance
- Durability
What are the Kafka Use Cases - Activity Tracking?
- Kafka was originally designed at LinkedIn to Track user activity
- Following site activities are published with one topic per activity type:
- Page views
- Searches
What are the Kafka Use Cases - Matrics and Logging?
- Collect application’s and system metrics and logs
- Kafka is used for monitoring data:
- Accumulating measurements from distributed applications.
- Producing centrilized feeds of operational data
What are the Kafka Use Cases - Log Aggregation?
- Log aggregation collects physical log files from servers and puts them in a central place(a file server or HDFS perhaps) for processing
- Kafka abstracts the details of files
- Gives log/event data as stream of messages which is a cleaner abstraction
- In comparison to log-centric systems(Scribe or Flume), Kafka offers:
- Equally good performance
- Strong durability
- Lower end-to-end latency