Final Review Sheet Flashcards
index
table only has two attributes, a key identifier attribute and a pointer attribute that tells you which block of data it is in
B+tree index
it follows the pattern of a tree the pointers have a top level then they go to the next level and to the next level until you get to the lowest grain of data
used for items with a large cardinality, it is more geared towards transactional systems
block
logical unit of data transferred between disk and memory is block (4 bytes)
bitmap index
a collection of bit vectors where there are only 1’s and 0’s
it is used for boolean operations and in data warehouses
Data Staging Steps
Extraction Data cleansing Data Integration Transformation Loading Maintenance
Extraction
take it from the source, ex: operational systems, flat files, web
Data Staging
data is placed in a staging area away from the original system to cleanse it
Data Cleansing:Missing Values
- Exclude the record
- Exclude the attribute/field
- Replaced by a global constant
- Replaced by the attribute mean
- replaced by the most probable value
6 Manual correction - Apply specific algorithm
Data Integration
data from different sources with different data formats need to be integrated into one data warehouse
problems: same names between different attributes or same attribute with different names
1. Get source and target schemas
2. Integrate source schemas and map to data warehouse schema with the help from tools and business
Transformation
Prepare data for loading into the data warehouse
- change data formats
- Create derived attributes and tables
- Aggregate - make aggregate fact table
- Create warehouse keys
Loading
Load cleansed, integrated, transform data into data warehouse
Data integration tools
Pentaho Oracle SQL Commercial lower maintenance costs, easier to use, metadata is useful by default Scripting/SQL more productive, specific to your needs more difficult to maintain
DW Maintenance
DW refreshment, is the process of keeping data in DW consistent/current with data in source systems,
can be based on user driven policy where it is only refreshed when asked for
also it can be based on a warehouse driven policy that has a schedule of refreshing it
Trends in Data Warehousing
Online forums are used to predict product defects example: honda
social networks are used to predict stock behavior example: stocktwitter
Marketing trends are predicted by who you are in contact with on your phone
mapreduce
a programming model invented at google. uses thousands of commodity pcs connected by ethernet
map reads an input and produces a key value pair, all pairs are associated with the same key and are grouped then passed to reduce. multiple computers do these things
reduce: receive a group of pair and merge(aggregate) them