Question 1

What is big data?

Accepted Answer

A term used to refer to data sets that are too large or complex for traditional data processing application software to adequately deal with

Question 2

What are the three characteristics of big data?

Accepted Answer

1. High volume 2. High velocity 3. High variety

Question 3

What is the difference between data and information?

Accepted Answer

Data is a raw and unorganized fact(s) that need to be processed to make it meaningful Information is a set of data which have been processed in a meaningful way according to the given requirements

Question 4

Why might a conventional data engineering system not be suitable for a particular application? (4)

Accepted Answer

1. Insufficient hard disk storage 2. The speed at which data can be retrieved does not meet the volume of requests 3. The speed at which instructions are executed by the processing unit is too low to produce on time the desired results for the volume of data or volume of requests 4. The volume of data that needs processing does not fit in the RAM

Question 5

what is the slowest part of any pipeline?

Accepted Answer

I/0

Question 6

what is scaling up/vertical scaling?

Accepted Answer

improving the performance of data processing systems by increasing the processing, storage and I/O of a single machine

Question 7

what is scaling out/horizontal scaling?

Accepted Answer

increasing the performance of our data engineering system is to by adding more machines

Question 8

how are cluster nodes interconnected?

Accepted Answer

through fast networks

Question 9

what is the difference between concurrent and parallel processing?

Accepted Answer

Parallel processing is the use of more than one processor in parallel to complete tasks
Concurrent processing is the use of the same processor to virtually complete more than one task at a time on the CPU and context is switched between the tasks

Question 10

In parallel processing, how are multi-threaded processes handled in term of the threads and the use of multiple CPUs?

Accepted Answer

an app must have more than one thread running - and each thread must run on separate CPUs / CPU cores / graphics card GPU cores

Question 11

What is necessary for parallel processing?

Accepted Answer

Synchronized or coordinated processes and some method of communication

Question 12

what are some challenges associated with parallel processing? (3)

Accepted Answer

1. Many algorithms are hard to be divided into subtasks or cannot be divided at all 2. Subtasks might use results from each other, so coordinating the different tasks might be difficult 3. The communication network is the main bottleneck - the data exchange between the processors can overwhelm the shared network

Question 13

what are some examples of parallelism used for data analysis (3)

Accepted Answer

1. Sentiment analysis 2. Text and URL crawling 3. Bio-informatics

Question 14

what are possible outputs of the big data pipeline? (4)

Accepted Answer

1. another big data pipeline 2. small data analysis 3. query system 4. visualisation or an action

Question 15

what are some examples of data sources? (5)

Accepted Answer

1. Mobile and web apps 2. Websites 3. IoT devices 4. Databases 5. Output of other big data pipelines

Introduction to big data Flashcards

(25 cards)