L6 Flashcards

1
Q

def Big Data

A

data sets that are too large or complex to be dealt with by traditional data-processing application software
- size or type
- data is coming from multiple sources (sensors, devices, web, log files)
- includes structured semi-structured and unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

5 V’s of Big Data

A

volume
variety
veracity
value
velocity
variability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Variety in BD

A
  • various formats
  • structured, semi-structured, unstructured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Structured data

A

transactional file in traditional DBMS rows and columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Semi-structured data

A

combination of structure and unstructured
eg) email, csv files, log files, NoSQL database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Unstructured data

A

text, images, videos, social media

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Variability in Big Data

A
  • need for being variable because of everchanging datapatterns, etc
  • inconsistencies and outliers need to be detected
  • data can be used in many different ways, for different purposes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

4 BD Technologies

A
  1. Data storage: Apache (Hadoop, Cassandra)
  2. Data mining: RapidMiner
  3. Data analytics: Apache Spark
  4. Data visualization: Tableau
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Apache

A
  • open source
  • amercian nonprofit formed in 1999
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Apache Hadoop

A
  • written in java
  • supports processing of large data sets
  • can store large volume of all kind of structured data in distributed file systems
  • parallel processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cassandra

A
  • apache NoSQL database management system
  • handles large amounts of data across many commodity servers
  • high availability with no point of failure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Commodity servers

A

use of lots of already-available computing components for parallel computing
get low computation at low cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Hadoop vs Cassandra

A

cassandra is NoSQL and good for
- high-speed, online transactional data
- online web-mobile applications
eg) e-commerce and inventory management, personlization, recommend, IoT and edge computing
hadoop is good for
- big data analytics
- warehousing
- data lake uses
- cold and historical data
eg) retail analytics, financial risk analysis, trading, forecastin, social media with very high volumes

-> can complement each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Apache Spark

A
  • unified analytics engine for large-scale data processing
  • leading platform for large scale SQL, batch processing, stream processing and ML
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hadoop vs Spark

A

spark:
- in-memory processing (much faster)
- both batch processing and real-time
- fewer lines of code
- easier authentification
- fault tolerance
- better analytics
- more flexibility

hadoop:
- slow data processing
- only batch processing
- more lines of code
- written in java, takes longer to execute
- difficult to manage authentification -> kerberos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

RapidMiner

A

comprehensive data science platform with visual workflow design and full automation

17
Q

Tableau

A

data visualization and BI tool used for reporting and analyzing vast volumes of data

18
Q

Big Data Life Cycle (5)

A
  1. collection
  2. preprocessing
  3. storage
  4. analysis
  5. interpretation
19
Q

BD preprocessing (4)

A
  1. integration: uniform
  2. cleaning: outliers
  3. reduction: dimensions
  4. transformation: format
20
Q

def Data Science

A

interdisciplinary field that uses sicentific mathods, processes, algorithms and systmes to extract knowledge and insight from structural and unstructured data

21
Q

Data Science Life Cycle (7)

A
  1. business understanding
  2. data collection
  3. data preparation
  4. exploratory data analysis
  5. modelling
  6. model evaluation
  7. model deployment
22
Q

Big Data Challenges

A
  1. lack of understanding and expertise
  2. poor data quality and data silos
  3. issues in scaling
  4. variety of BD technologies
  5. incorrect integration
  6. expensive
  7. real-time big data problems
  8. data verification
  9. organizational resistance
  10. security and privacy
23
Q

Applications

A

any industry:
banking, media, insurance, healthcare, transportation, energy, education, manufacturing