1. Introduction Flashcards
What are the 4 Vs of Big Data?
Volume, velocity, veracity, and variety
What does volume refer to in terms of big data?
The size of the data being processed
For example, YouTube processes 72 hours of video uploads every minute
What are the challenges associated with the volume of big data?
- Big data takes a lot of storage
- It is computationally expensive to access and process a large amount of data
- As volume increases, performance decreases, and costs increase
What does variety refer to in terms of big data?
Refers to the complexity of the data. It can come in many different forms with varying levels of structure
What are issues associated with the variety of big data?
- It is harder to ingest a variety of data
- More difficult to create common storage
- Difficult to compare and match data
- Difficult to integrate
What does velocity refer to in terms of big data?
The speed at which data is created, stored, and analyzed. The amount of data per time
Why is it important that we use real-time processing to keep up with the velocity of big data?
Real-time processing reduces the risk of missing opportunities due to moving slowly. The faster we can act on data insights, the better results we can get
What does veracity refer to in terms of big data?
It is the quality of the data including its validity and volatility. Also includes reliability of the data source
What are the 5 steps of the data science process?
Acquire, prepare, analyze, report, act
What is included in the acquire data step of the data science process?
- Identify data sets
- Identify suitable data
- Retrieve data
- Query data
Exact process can vary depending on structure and source of data
What are the 2 parts of the prepare step of the data science process?
A. Explore data
B. Pre-process data
What is included in the explore data step of the data science process?
- Perform preliminary analysis to understand nature of the data.
- Get correlations, general trends, and outliers
What is included in the pre-process data step of the data science process?
- Clean the data from inconsistent values, missing values, etc
- Dimensionality reduction through removing and combining features
- Scaling values
- Packaging data to prepare for analysis. Important because garbage in-garbage out
What is included in the analyze data step of the data science process?
Select analytical techniques and build models and run the models on the data set to achieve output
What are some common analysis techniques?
- Classification
- Clustering
- Regression
- Association Analysis
- Graph Analytics
What is the goal of classification analysis?
Predicting a category based on data. Supervised learning technique
What is the goal of regression analysis?
Predicting a numeric value based on data. Supervised learning technique
What is the goal of clustering for analysis?
Organizing similar items into groups. Unsupervised learning technique
What is the goal of association analysis?
To find rules to capture associations between items. Unsupervised learning technique
What is the goal of graph analytics?
To use graph structures to find connections between entities. Unsupervised learning technique
What are the 3 steps of the data analysis process?
- Select a technique
- Build a model
- Evaluate the model
What is included in the report step of the data science process?
Communicating results in an effective way including things that were unexpected or went wrong to increase the chances for learning
What is included in the act step of the data science process?
Apply results to take action. Determine the next steps which may include revisiting the model depending on results
What are the 4 characteristics of patterns and models that are the goal of data mining?
- Valid: hold on new data with some certainty
- Useful: Is possible to act on the item
- Unexpected: non-obvious to the system
- Understandable: humans should be able to interpret the pattern