Big Data Flashcards
What is Big Data
- information that cannot be processed using traditional methods due to:
- volume - extemely large datasets,
- variety - of sources
- velocity - often need to be real-time
Often
- raw – unclear how to obtain value
- Unstructured / Semi structured
Today companies have the ability to store any data they generate, but don’t know what to do with it.
more data = more processing
key drivers of Big Data
This trend started when
- enterprises created more data from their operations
- Web and social media content explosion
- IoT
- networked devices
3 reasons for the expontential data growth
Instrumentation
- more sensors
- more storage
Interconnection
- more things are interconnected
Intelligence
- computers have become cheap
- software has become powerful
- Systems on a Chip (SoC) are dense, fast, cheap
3 Big Data Characteristics
- Volume
- Variety
- Velocity
- (Veracity - trustworthness)
- (Value)
What is the Blind Zone?
We create more data than we can process.
Blind Zone = WE don’t know
- Couldbe a great opportunity
- Could be nothing
- But we don’t know and we don’t have the capacity to find out.
Can data warehouses handle Big Data and why?
- They’re great with structured data, ideally relational
- Struggle with Big Data due to variety
How much data an organisation creates, is cleansed, transformed and loaded into Data Warehouse?
Only 20% of data that could be used.
- albeit the very important 20%
80% of data is Raw, Unstructured or Semi Structured.
3 categories of data based on its form in the primary source
- Structured: transactional data from enterprise applications
- Semi-structured: machine data from the IOT
- Unstructured: text, audio and video from social media and Web applications
Relational databases only work well with structured data.
How to handle unstructured data?
NoSQL databases (“Not only” SQL databases)
NoSQL Databases Attributes
- Significant installed base of systems, particularly websites, using a NoSQL database
- Supports distributed, scalable, and real-time data updates
- Schema-free design that provides flexibility to start loading data and then changing it later
- Provides BASE rather than ACID support.
- Basically available: high availability of 24/7 (often demanded for most transactional systems), is relaxed
- Soft state: database may be inconsistent at any point in time
- Eventually consistent
NoSQL database falls into several technology architecture categories:
- Key-Value
- Column-Family
- Document
- Graph
Relational Databases Attributes
- Large installed base of applications, often running key business processes within an enterprise
- Large pool of experience people with skills such as DBA, application developer, architect, and business analyst
- Increasing scalability and capability due to advances in relational technology and underlying infrastructure
- Large pool of BI, data integration, and related tools that leverage the technology
- Requires a schema with tables, columns, and other entities to load and query database
- For transactional data it provides ACID support to guarantee transactional integrity and reliability.
- Atomic: Entire transaction succeeds or ii is rolled back
- Consistent: A transaction needs to be in a consistent state to be completed
- Isolated: Transactions are independent of each other
- Durable: Transactions persist after they are completed
define data velocity
How fast data is generated, flows, is stored, retrieved and analysed.
key characteristics of stream analytics + 2 use cases
- Data has a short shelf life
- Spot trend, opportunity, or problem in microseconds
- algorithmic traders
- fraud detection
Basically, what is required to make Big Data valuable?
Need to be able to process a massive volume of disparate types of data and analyse it to produce insight in a time frame driven by the business need.
- The algorithms and models haven’t changed.
- We are still doing correlation and link analysis and prediction.
- It’s just that the volume of data we run the models against have become much larger.
- Machine learning has an increasing part to play
Are DW trusted? Need?
Businesses need trust.
- Data in a data warehouse is trusted.
- It goes through a rigorous process of cleansing, formatting, enrichment, meta data attachment etc.
- It’s high quality
- Quality is expensive
- Data in a DW is high value.
Need
- Government regulations require high quality data.
- CEO CFO of companies publically traded on US based stock exchanges must certify accuracy of their financial statements. This also applies to their non-US operations.