L6 Flashcards

Question 1

Q

def Big Data

Answer

A

data sets that are too large or complex to be dealt with by traditional data-processing application software
- size or type
- data is coming from multiple sources (sensors, devices, web, log files)
- includes structured semi-structured and unstructured data

Question 2

Q

5 V’s of Big Data

Answer

A

volume
variety
veracity
value
velocity
variability

Question 3

Q

Variety in BD

Answer

A

various formats
structured, semi-structured, unstructured

Question 4

Q

Structured data

Answer

A

transactional file in traditional DBMS rows and columns

Question 5

Q

Semi-structured data

Answer

A

combination of structure and unstructured
eg) email, csv files, log files, NoSQL database

Question 6

Q

Unstructured data

Answer

A

text, images, videos, social media

Question 7

Q

Variability in Big Data

Answer

A

need for being variable because of everchanging datapatterns, etc
inconsistencies and outliers need to be detected
data can be used in many different ways, for different purposes

Question 8

Q

4 BD Technologies

Answer

A

Data storage: Apache (Hadoop, Cassandra)
Data mining: RapidMiner
Data analytics: Apache Spark
Data visualization: Tableau

Question 9

Q

Apache

Answer

A

open source
amercian nonprofit formed in 1999

Question 10

Q

Apache Hadoop

Answer

A

written in java
supports processing of large data sets
can store large volume of all kind of structured data in distributed file systems
parallel processing

Question 11

Q

Cassandra

Answer

A

apache NoSQL database management system
handles large amounts of data across many commodity servers
high availability with no point of failure

Question 12

Q

Commodity servers

Answer

A

use of lots of already-available computing components for parallel computing
get low computation at low cost

Question 13

Q

Hadoop vs Cassandra

Answer

A

cassandra is NoSQL and good for
- high-speed, online transactional data
- online web-mobile applications
eg) e-commerce and inventory management, personlization, recommend, IoT and edge computing
hadoop is good for
- big data analytics
- warehousing
- data lake uses
- cold and historical data
eg) retail analytics, financial risk analysis, trading, forecastin, social media with very high volumes

-> can complement each other

Question 14

Q

Apache Spark

Answer

A

unified analytics engine for large-scale data processing
leading platform for large scale SQL, batch processing, stream processing and ML

Question 15

Q

Hadoop vs Spark

Answer

A

spark:
- in-memory processing (much faster)
- both batch processing and real-time
- fewer lines of code
- easier authentification
- fault tolerance
- better analytics
- more flexibility

hadoop:
- slow data processing
- only batch processing
- more lines of code
- written in java, takes longer to execute
- difficult to manage authentification -> kerberos

Question 16

Q

RapidMiner

Answer

Study These Flashcards

A

comprehensive data science platform with visual workflow design and full automation

Question 17

Q

Tableau

Answer

Study These Flashcards

A

data visualization and BI tool used for reporting and analyzing vast volumes of data

Question 18

Q

Big Data Life Cycle (5)

Answer

Study These Flashcards

A

collection
preprocessing
storage
analysis
interpretation

Question 19

Q

BD preprocessing (4)

Answer

Study These Flashcards

A

integration: uniform
cleaning: outliers
reduction: dimensions
transformation: format

Question 20

Q

def Data Science

Answer

Study These Flashcards

A

interdisciplinary field that uses sicentific mathods, processes, algorithms and systmes to extract knowledge and insight from structural and unstructured data