College 1: Introduction Flashcards

1
Q

Types of data

A
  • Structured: data warehouse repository, relational data
  • Unstructured: wikipedia, news articles (no predefined format or organization, often human-generated)
  • Graph data: social networks, road networks, knowledge bases
  • Semi-structured: log files, product reviews, json, cml (some level of organization by tags, labels, marker)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Large scale data:

A

Very large collection (terrabytes to exabytes): scientific, business, or user generated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Streaming data

A

Sensors, feeds, continous analytics. Response times in seconds to nanoseconds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Heterogeneous data

A

a collection of data consisting of diverse types, formats, or structures. Multiple data sources, data elements varying in terms of representation, organization, or underlying technology. (Computer or human generated)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Private data

A

Correlating data from multiple sources. This poses risks, so privacy and accountability is important. Data such as credit card history, mobile phone usage, GPS tracking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Uncertain data

A

Inprecision, inconsistencies, incompleteness, ambiguities, latency, deception, approximations, privacy preserving transformations. Process, data and model are uncertain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data processing pipeline

A
  1. Acquisition & Storage
  2. Cleaning, Annotation, Integration, Aggregation
  3. Exploration & Querying
  4. Modeling & Data Analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Knowledge Discovery Process (KDD)

A

**Data Sources ** - Data Cleaning - **Data Warehouse ** - Selection - Task-relevant data - Data mining - Pattern Evaluation -> KNOWLEDGE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

3V of Big Data Challenges

A
  1. Volume: size of data
  2. Velocity: speed at which data is generated, collected, and processed
  3. Variety: Refers to the diverse types and formats of data that are encountered in big data (heterogenity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

TF-IDF

A

TF(t,d) = number of times t appears in d/total number of terms in d
IDF(t) = log(N/1+df)
TF-IDF(t,d) = TF(t,d) * IDF(t)

N = total N documents, df = n docs which contain t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Clustered vs. Unclustered index

A

A clustered index is used to define the order or to sort the table or arrange the data by alphabetical order just like a dictionary. A non-clustered index collects the data at one place and records at another place.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Hash function

A

last 2 bits in its binary representation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Opportunities data exploitation

A

Energy saving, financial systemic risk analysis, security, computer security, democracy (citizens understanding how government operates), education, healthcare, urban planning, intelligent transportation, environmental modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly