College 1: Introduction Flashcards

Question 1

Q

Types of data

Answer

A

Structured: data warehouse repository, relational data
Unstructured: wikipedia, news articles (no predefined format or organization, often human-generated)
Graph data: social networks, road networks, knowledge bases
Semi-structured: log files, product reviews, json, cml (some level of organization by tags, labels, marker)

Question 2

Q

Large scale data:

Answer

A

Very large collection (terrabytes to exabytes): scientific, business, or user generated.

Question 3

Q

Streaming data

Answer

A

Sensors, feeds, continous analytics. Response times in seconds to nanoseconds.

Question 4

Q

Heterogeneous data

Answer

A

a collection of data consisting of diverse types, formats, or structures. Multiple data sources, data elements varying in terms of representation, organization, or underlying technology. (Computer or human generated)

Question 5

Q

Private data

Answer

A

Correlating data from multiple sources. This poses risks, so privacy and accountability is important. Data such as credit card history, mobile phone usage, GPS tracking.

Question 6

Q

Uncertain data

Answer

A

Inprecision, inconsistencies, incompleteness, ambiguities, latency, deception, approximations, privacy preserving transformations. Process, data and model are uncertain.

Question 7

Q

Data processing pipeline

Answer

A

Acquisition & Storage
Cleaning, Annotation, Integration, Aggregation
Exploration & Querying
Modeling & Data Analysis

Question 8

Q

Question 9

Q

Knowledge Discovery Process (KDD)

Answer

A

**Data Sources ** - Data Cleaning - **Data Warehouse ** - Selection - Task-relevant data - Data mining - Pattern Evaluation -> KNOWLEDGE

Question 10

Q

3V of Big Data Challenges

Answer

A

Volume: size of data
Velocity: speed at which data is generated, collected, and processed
Variety: Refers to the diverse types and formats of data that are encountered in big data (heterogenity)

Question 11

Q

TF-IDF

Answer

A

TF(t,d) = number of times t appears in d/total number of terms in d
IDF(t) = log(N/1+df)
TF-IDF(t,d) = TF(t,d) * IDF(t)

N = total N documents, df = n docs which contain t

Question 12

Q

Clustered vs. Unclustered index

Answer

A

A clustered index is used to define the order or to sort the table or arrange the data by alphabetical order just like a dictionary. A non-clustered index collects the data at one place and records at another place.

Question 13

Q

Hash function

Answer

A

last 2 bits in its binary representation

Question 14

Q

Opportunities data exploitation

Answer

A

Energy saving, financial systemic risk analysis, security, computer security, democracy (citizens understanding how government operates), education, healthcare, urban planning, intelligent transportation, environmental modeling