1. Introduction Flashcards

Question 1

Q

What are the 4 Vs of Big Data?

Answer

A

Volume, velocity, veracity, and variety

Question 2

Q

What does volume refer to in terms of big data?

Answer

A

The size of the data being processed
For example, YouTube processes 72 hours of video uploads every minute

Question 3

Q

What are the challenges associated with the volume of big data?

Answer

A

Big data takes a lot of storage
It is computationally expensive to access and process a large amount of data
As volume increases, performance decreases, and costs increase

Question 4

Q

What does variety refer to in terms of big data?

Answer

A

Refers to the complexity of the data. It can come in many different forms with varying levels of structure

Question 5

Q

What are issues associated with the variety of big data?

Answer

A

It is harder to ingest a variety of data
More difficult to create common storage
Difficult to compare and match data
Difficult to integrate

Question 6

Q

What does velocity refer to in terms of big data?

Answer

A

The speed at which data is created, stored, and analyzed. The amount of data per time

Question 7

Q

Why is it important that we use real-time processing to keep up with the velocity of big data?

Answer

A

Real-time processing reduces the risk of missing opportunities due to moving slowly. The faster we can act on data insights, the better results we can get

Question 8

Q

What does veracity refer to in terms of big data?

Answer

A

It is the quality of the data including its validity and volatility. Also includes reliability of the data source

Question 9

Q

What are the 5 steps of the data science process?

Answer

A

Acquire, prepare, analyze, report, act

Question 10

Q

What is included in the acquire data step of the data science process?

Answer

A

Identify data sets
Identify suitable data
Retrieve data
Query data
Exact process can vary depending on structure and source of data

Question 11

Q

What are the 2 parts of the prepare step of the data science process?

Answer

A

A. Explore data
B. Pre-process data

Question 12

Q

What is included in the explore data step of the data science process?

Answer

A

Perform preliminary analysis to understand nature of the data.
Get correlations, general trends, and outliers

Question 13

Q

What is included in the pre-process data step of the data science process?

Answer

A

Clean the data from inconsistent values, missing values, etc
Dimensionality reduction through removing and combining features
Scaling values
Packaging data to prepare for analysis. Important because garbage in-garbage out

Question 14

Q

What is included in the analyze data step of the data science process?

Answer

A

Select analytical techniques and build models and run the models on the data set to achieve output

Question 15

Q

What are some common analysis techniques?

Answer

A

Classification
Clustering
Regression
Association Analysis
Graph Analytics

Question 16

Q

What is the goal of classification analysis?

Answer

Study These Flashcards

A

Predicting a category based on data. Supervised learning technique

Question 17

Q

What is the goal of regression analysis?

Answer

Study These Flashcards

A

Predicting a numeric value based on data. Supervised learning technique

Question 18

Q

What is the goal of clustering for analysis?

Answer

Study These Flashcards

A

Organizing similar items into groups. Unsupervised learning technique

Question 19

Q

What is the goal of association analysis?

Answer

Study These Flashcards

A

To find rules to capture associations between items. Unsupervised learning technique

Question 20

Q

What is the goal of graph analytics?

Answer

Study These Flashcards

A

To use graph structures to find connections between entities. Unsupervised learning technique

Question 21

Q

What are the 3 steps of the data analysis process?

Answer

Study These Flashcards

A

Select a technique
Build a model
Evaluate the model

Question 22

Q

What is included in the report step of the data science process?

Answer

Study These Flashcards

A

Communicating results in an effective way including things that were unexpected or went wrong to increase the chances for learning

Question 23

Q

What is included in the act step of the data science process?

Answer

Study These Flashcards

A

Apply results to take action. Determine the next steps which may include revisiting the model depending on results

Question 24

Q

What are the 4 characteristics of patterns and models that are the goal of data mining?

Answer

Study These Flashcards

A

Valid: hold on new data with some certainty
Useful: Is possible to act on the item
Unexpected: non-obvious to the system
Understandable: humans should be able to interpret the pattern

1. Introduction Flashcards

(24 cards)