BigData Flashcards
The major characteristics used to define big data are
volume, variety and velocity.
distributed computing allows us to
process big data because it divides it into more manageable chunks and distributed the work among computers that can process the data.
When we think in terms of big data processing, there are two types of data that we process - batch and streaming data.
differences between batch and streaming data?
Batch data is data that we have in storage and that we process all at once, or in a batch.
Streaming data is data that is being continually produced by one or more sources and therefore must be processed incrementally as it arrives.
Data warehouse
data lakes
unified data platform
Data warehouse technology emerged in the 1980’s and provides a centralized repository for storing all of an organization’s data. Data warehouses can be on-premises or in the cloud.
Unlike data warehouses which usually take clean data, data lakes store data in its raw format. Data lakes can store unstructured as well as structured data, and are known to be more horizontally scalable (in other words, it’s easy to keep adding more data into data lakes).
Finally, a data storage system that is quickly gaining popularity today is the unified data platform. These provide all of the benefits of data lakes, with the addition of some data warehousing capabilities, all wrapped up in a platform that your data teams can work in together.
Data storage systems:
Data warehouse
data lakes
unified data platform
the whole point of working with big data
to be able to extract insights that can help drive business decisions.
artificial intelligence, machine learning, deep learning and data science
Artificial intelligence (AI) is a branch of computer science in which computer systems are developed to perform tasks that would typically need human intelligence. AI is a broad field, and it encapsulates many techniques within its umbrella.
Machine learning (ML) is a subset of artificial intelligence that works very well with structured data. The goal behind machine learning is for machines to learn patterns in your data without you explicitly programming them to do so. There are a few types of machine learning; the most commonly used type is called supervised machine learning.
Deep learning (DL) is a subset of machine learning that uses neural networks or sets of algorithms modeled by the structure of the human brain. They are much more complex than most machine learning models, and require significantly more time and effort to build. Unlike machine learning which plateaus after a certain amount of data, deep learning continues to improve as the data size increases. It performs well on complex datasets like images, sequences and natural language.
Data science is a field that combines tools and workflows from disciplines like math and statistics, computer science and business, to process, manage and analyze data. Data science is very popular in businesses today as a way to extract insights from big data to help inform business decisions.
data science workflow
The data science workflow is a series of steps that data practitioners follow to work with big data. It is a cyclical process that often starts with identifying business problems and ends with delivering business value.
The data science workflow
- Identifying business needs
- Data ingestion
- Data cleansing / preparation
- Data analysis (Although machine learning and deep learning aren’t the only types of analyses that can be applied to your data, it is becoming more and more popular today, especially when it comes to big data.)
- Sharing insights
Big data refers to data that …
is nearly impossible to process using traditional methods, like a single computer, because there’s so much of it, being generated so quickly, in many different formats.
Velocity?
refers to the speed at which new data is generated and the speed at which data moves around.