Introduction to big data Flashcards
What is big data?
A term used to refer to data sets that are too large or complex for traditional data processing application software to adequately deal with
What are the three characteristics of big data?
- High volume
- High velocity
- High variety
What is the difference between data and information?
Data is a raw and unorganized fact(s) that need to be processed to make it meaningful
Information is a set of data which have been processed in a meaningful way according to the given requirements
Why might a conventional data engineering system not be suitable for a particular application? (4)
- Insufficient hard disk storage
- The speed at which data can be retrieved does not meet the volume of requests
- The speed at which instructions are executed by the processing unit is too low to produce on time the desired results for the volume of data or volume of requests
- The volume of data that needs processing does not fit in the RAM
what is the slowest part of any pipeline?
I/0
what is scaling up/vertical scaling?
improving the performance of data processing systems by increasing the processing, storage and I/O of a single machine
what is scaling out/horizontal scaling?
increasing the performance of our data engineering system is to by adding more machines
how are cluster nodes interconnected?
through fast networks
what is the difference between concurrent and parallel processing?
Parallel processing is the use of more than one processor in parallel to complete tasks
Concurrent processing is the use of the same processor to virtually complete more than one task at a time on the CPU and context is switched between the tasks
In parallel processing, how are multi-threaded processes handled in term of the threads and the use of multiple CPUs?
an app must have more than one thread running - and each thread must run on separate CPUs / CPU cores / graphics card GPU cores
What is necessary for parallel processing?
Synchronized or coordinated processes and some method of communication
what are some challenges associated with parallel processing? (3)
- Many algorithms are hard to be divided into subtasks or cannot be divided at all
- Subtasks might use results from each other, so coordinating the different tasks might be difficult
- The communication network is the main bottleneck - the data exchange between the processors can overwhelm the shared network
what are some examples of parallelism used for data analysis (3)
- Sentiment analysis
- Text and URL crawling
- Bio-informatics
what are possible outputs of the big data pipeline? (4)
- another big data pipeline
- small data analysis
- query system
- visualisation or an action
what are some examples of data sources? (5)
- Mobile and web apps
- Websites
- IoT devices
- Databases
- Output of other big data pipelines