Introduction to big data Flashcards

1
Q

What is big data?

A

A term used to refer to data sets that are too large or complex for traditional data processing application software to adequately deal with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the three characteristics of big data?

A
  1. High volume
  2. High velocity
  3. High variety
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between data and information?

A

Data is a raw and unorganized fact(s) that need to be processed to make it meaningful
Information is a set of data which have been processed in a meaningful way according to the given requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why might a conventional data engineering system not be suitable for a particular application? (4)

A
  1. Insufficient hard disk storage
  2. The speed at which data can be retrieved does not meet the volume of requests
  3. The speed at which instructions are executed by the processing unit is too low to produce on time the desired results for the volume of data or volume of requests
  4. The volume of data that needs processing does not fit in the RAM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the slowest part of any pipeline?

A

I/0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is scaling up/vertical scaling?

A

improving the performance of data processing systems by increasing the processing, storage and I/O of a single machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is scaling out/horizontal scaling?

A

increasing the performance of our data engineering system is to by adding more machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how are cluster nodes interconnected?

A

through fast networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the difference between concurrent and parallel processing?

A

Parallel processing is the use of more than one processor in parallel to complete tasks
Concurrent processing is the use of the same processor to virtually complete more than one task at a time on the CPU and context is switched between the tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In parallel processing, how are multi-threaded processes handled in term of the threads and the use of multiple CPUs?

A

an app must have more than one thread running - and each thread must run on separate CPUs / CPU cores / graphics card GPU cores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is necessary for parallel processing?

A

Synchronized or coordinated processes and some method of communication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what are some challenges associated with parallel processing? (3)

A
  1. Many algorithms are hard to be divided into subtasks or cannot be divided at all
  2. Subtasks might use results from each other, so coordinating the different tasks might be difficult
  3. The communication network is the main bottleneck - the data exchange between the processors can overwhelm the shared network
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are some examples of parallelism used for data analysis (3)

A
  1. Sentiment analysis
  2. Text and URL crawling
  3. Bio-informatics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what are possible outputs of the big data pipeline? (4)

A
  1. another big data pipeline
  2. small data analysis
  3. query system
  4. visualisation or an action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what are some examples of data sources? (5)

A
  1. Mobile and web apps
  2. Websites
  3. IoT devices
  4. Databases
  5. Output of other big data pipelines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what are the data actions in big data pipeline? (3)

A
  1. Ingestion
  2. Storage (not always required, ingested data can be stored or directly processed)
  3. Processing
17
Q

what is data ingestion in the big data pipeline?

A

gathering and pre-processes the incoming data from a variety of data sources and
making the data readily usable by the onward stages, can include transformations operations on raw data

18
Q

what can be used during data ingestion?

A

communication protocols such as HTTP, MQT or FTP because data is moved from the original sources across a network

19
Q

what is data storage in the big data pipeline?

A

where data are usually stored in the target processing nodes for future processing

20
Q

why are distributed storage solutions often used?

A

Due to high-volumes of data - these solutions need to provide flexibility and fast retrieval of high-volumes of data

21
Q

what is data processing in the big data pipeline?

A

running the algorithms intended to process the big data

22
Q

what happens to the results of data processing?

A

they can be stored back in the storage system or be used as the input of the pipeline

23
Q

what are the types of data processing? (3)

A
  1. Batch
  2. Stream
  3. Graph
24
Q

what is data analysis?

A

the process of examining data to find facts, relationships, patterns, insights and/or trends

25
Q

what are the four general categories of data analytics?

A
  1. Descriptive analytics
  2. Diagnostic analytics
  3. Predictive analytics
  4. Prescriptive analytics