1. Introduction Flashcards
What are the 4 Vs of Big Data?
Volume, velocity, veracity, and variety
What does volume refer to in terms of big data?
The size of the data being processed
For example, YouTube processes 72 hours of video uploads every minute
What are the challenges associated with the volume of big data?
- Big data takes a lot of storage
- It is computationally expensive to access and process a large amount of data
- As volume increases, performance decreases, and costs increase
What does variety refer to in terms of big data?
Refers to the complexity of the data. It can come in many different forms with varying levels of structure
What are issues associated with the variety of big data?
- It is harder to ingest a variety of data
- More difficult to create common storage
- Difficult to compare and match data
- Difficult to integrate
What does velocity refer to in terms of big data?
The speed at which data is created, stored, and analyzed. The amount of data per time
Why is it important that we use real-time processing to keep up with the velocity of big data?
Real-time processing reduces the risk of missing opportunities due to moving slowly. The faster we can act on data insights, the better results we can get
What does veracity refer to in terms of big data?
It is the quality of the data including its validity and volatility. Also includes reliability of the data source
What are the 5 steps of the data science process?
Acquire, prepare, analyze, report, act
What is included in the acquire data step of the data science process?
- Identify data sets
- Identify suitable data
- Retrieve data
- Query data
Exact process can vary depending on structure and source of data
What are the 2 parts of the prepare step of the data science process?
A. Explore data
B. Pre-process data
What is included in the explore data step of the data science process?
- Perform preliminary analysis to understand nature of the data.
- Get correlations, general trends, and outliers
What is included in the pre-process data step of the data science process?
- Clean the data from inconsistent values, missing values, etc
- Dimensionality reduction through removing and combining features
- Scaling values
- Packaging data to prepare for analysis. Important because garbage in-garbage out
What is included in the analyze data step of the data science process?
Select analytical techniques and build models and run the models on the data set to achieve output
What are some common analysis techniques?
- Classification
- Clustering
- Regression
- Association Analysis
- Graph Analytics