1. Introduction Flashcards

1
Q

What are the 4 Vs of Big Data?

A

Volume, velocity, veracity, and variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does volume refer to in terms of big data?

A

The size of the data being processed
For example, YouTube processes 72 hours of video uploads every minute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the challenges associated with the volume of big data?

A
  1. Big data takes a lot of storage
  2. It is computationally expensive to access and process a large amount of data
  3. As volume increases, performance decreases, and costs increase
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does variety refer to in terms of big data?

A

Refers to the complexity of the data. It can come in many different forms with varying levels of structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are issues associated with the variety of big data?

A
  1. It is harder to ingest a variety of data
  2. More difficult to create common storage
  3. Difficult to compare and match data
  4. Difficult to integrate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does velocity refer to in terms of big data?

A

The speed at which data is created, stored, and analyzed. The amount of data per time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is it important that we use real-time processing to keep up with the velocity of big data?

A

Real-time processing reduces the risk of missing opportunities due to moving slowly. The faster we can act on data insights, the better results we can get

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does veracity refer to in terms of big data?

A

It is the quality of the data including its validity and volatility. Also includes reliability of the data source

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 5 steps of the data science process?

A

Acquire, prepare, analyze, report, act

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is included in the acquire data step of the data science process?

A
  1. Identify data sets
  2. Identify suitable data
  3. Retrieve data
  4. Query data
    Exact process can vary depending on structure and source of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 2 parts of the prepare step of the data science process?

A

A. Explore data
B. Pre-process data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is included in the explore data step of the data science process?

A
  1. Perform preliminary analysis to understand nature of the data.
  2. Get correlations, general trends, and outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is included in the pre-process data step of the data science process?

A
  1. Clean the data from inconsistent values, missing values, etc
  2. Dimensionality reduction through removing and combining features
  3. Scaling values
  4. Packaging data to prepare for analysis. Important because garbage in-garbage out
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is included in the analyze data step of the data science process?

A

Select analytical techniques and build models and run the models on the data set to achieve output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some common analysis techniques?

A
  1. Classification
  2. Clustering
  3. Regression
  4. Association Analysis
  5. Graph Analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the goal of classification analysis?

A

Predicting a category based on data. Supervised learning technique

17
Q

What is the goal of regression analysis?

A

Predicting a numeric value based on data. Supervised learning technique

18
Q

What is the goal of clustering for analysis?

A

Organizing similar items into groups. Unsupervised learning technique

19
Q

What is the goal of association analysis?

A

To find rules to capture associations between items. Unsupervised learning technique

20
Q

What is the goal of graph analytics?

A

To use graph structures to find connections between entities. Unsupervised learning technique

21
Q

What are the 3 steps of the data analysis process?

A
  1. Select a technique
  2. Build a model
  3. Evaluate the model
22
Q

What is included in the report step of the data science process?

A

Communicating results in an effective way including things that were unexpected or went wrong to increase the chances for learning

23
Q

What is included in the act step of the data science process?

A

Apply results to take action. Determine the next steps which may include revisiting the model depending on results

24
Q

What are the 4 characteristics of patterns and models that are the goal of data mining?

A
  1. Valid: hold on new data with some certainty
  2. Useful: Is possible to act on the item
  3. Unexpected: non-obvious to the system
  4. Understandable: humans should be able to interpret the pattern