College 1: Introduction Flashcards
Types of data
- Structured: data warehouse repository, relational data
- Unstructured: wikipedia, news articles (no predefined format or organization, often human-generated)
- Graph data: social networks, road networks, knowledge bases
- Semi-structured: log files, product reviews, json, cml (some level of organization by tags, labels, marker)
Large scale data:
Very large collection (terrabytes to exabytes): scientific, business, or user generated.
Streaming data
Sensors, feeds, continous analytics. Response times in seconds to nanoseconds.
Heterogeneous data
a collection of data consisting of diverse types, formats, or structures. Multiple data sources, data elements varying in terms of representation, organization, or underlying technology. (Computer or human generated)
Private data
Correlating data from multiple sources. This poses risks, so privacy and accountability is important. Data such as credit card history, mobile phone usage, GPS tracking.
Uncertain data
Inprecision, inconsistencies, incompleteness, ambiguities, latency, deception, approximations, privacy preserving transformations. Process, data and model are uncertain.
Data processing pipeline
- Acquisition & Storage
- Cleaning, Annotation, Integration, Aggregation
- Exploration & Querying
- Modeling & Data Analysis
Knowledge Discovery Process (KDD)
**Data Sources ** - Data Cleaning - **Data Warehouse ** - Selection - Task-relevant data - Data mining - Pattern Evaluation -> KNOWLEDGE
3V of Big Data Challenges
- Volume: size of data
- Velocity: speed at which data is generated, collected, and processed
- Variety: Refers to the diverse types and formats of data that are encountered in big data (heterogenity)
TF-IDF
TF(t,d) = number of times t appears in d/total number of terms in d
IDF(t) = log(N/1+df)
TF-IDF(t,d) = TF(t,d) * IDF(t)
N = total N documents, df = n docs which contain t
Clustered vs. Unclustered index
A clustered index is used to define the order or to sort the table or arrange the data by alphabetical order just like a dictionary. A non-clustered index collects the data at one place and records at another place.
Hash function
last 2 bits in its binary representation
Opportunities data exploitation
Energy saving, financial systemic risk analysis, security, computer security, democracy (citizens understanding how government operates), education, healthcare, urban planning, intelligent transportation, environmental modeling