Data Intensive Ch3 - Storage & Retrieval Flashcards
Full-text-search-engine secret sauce
Term dictionary
Each term from query is looked up in reverse dict which maps term to doc ids that contain it
Key-value store of word to IDs (postings list)
Term dictionary is kept in a SSTable like structure
Bloom filter
In memory structure
Approximates content of key set to quickly tell if key exists in SSTable
SSTables
Sorted string table
AKA LSM-tree (Log Structured Merge Tree) + SST
Log structured as append only segments
Merge tree because of merging segments and traveling from latest to oldest segments
Old segments are immutable
Append-only sequential writes = high throughput
Each segment contains keys in sorted order
Index is a hashmap key-offset
Keys are sparse (every N kilobytes of keys. So if segment contains records for keys a, b, c, d, e, f, g and each record is 2kB and we store index. every 6kB then hash map contains offsets for a, d and g keys)
Latest segment is stored in memory only (memtable using red black tree fex)
When reading memtable is read first and if key is not found segments are checked in latest-oldest order one-by-one until hit or miss if key is not found
Merging and compaction can be run periodically (merge sort of adjacent segments)
Compaction of segments for which key is not stored?
Range queries can be done
Cons: key lookup which does not exist takes long (go over all segments in vain)
For durability in case of failure - write-ahead-log is used so data can be recovered
ETL
extract transform load
(Usually periodic) Process of getting data from different data sources into data warehouse
Star & Snowflake schemas
Data model optimized for analytics in data warehouse
Consists of “central” table of facts
Each record is a business relevant event in time like purchase of a product
Some columns are regular attributes like price
Other are references to “dimension” tables
Dimension table answer question who, what, when, where, how, why
Example: sold product (with its description, category etc) or buyer. special deal offer etc
Date and time can also be a dimension - this allows to add metadata about holidays etc easily
Structure - fact table in the middle and dimensions protrude outwards around like a star
Snowflake - same as star except dimensions can have subdimensions
Column oriented storage
OLAP system DBs with star/snowflake data model queries usually refer to small subset of columns and the rest is ignored
Number of rows is usually of high orders of magnitude
Keeping rows close together is inefficient - large rows, lots of IO read where most of the read data is discarded, lots of scanning over disk as multiple row fetches are required
Idea - keep data column-wise close together
Each column is stored in the same order
Enables column compression
Enums can be turned into bitmaps (bit equals 1 means row has a value coded by ith index)
Materialized aggregates
Most queries of data warehouse revolve around count, sum, avg, max functions
Most frequent ones can be pre-computed over some possible dimensions and cached into materialized view
Each data change requires refresh of the view