10/9 Class Flashcards
difference between datawarehouse and database
the way the data is organized
step 1
decide on the data format:
data type
length
null or not null
char(size)
size is the number of characteristics to store
varchar(size)
size is the number of characters to store, between variable length string
integer
stores ranges of integer values
decimal(p,s)
P is the precision and s is the scale
ex: deimal (7,2) is a number that has 5 digits before the decimal and 2 digits after the decimal
Step 2
decide index attributes
date time
stores year, month, day, hours, minutes, and seconds
disk
is the slowest part of our index
logical unit of transfer data(block)
4k bytes, 0.01 seconds
index table
only has two attributes
key: unique key, pointer: shows which block it is in
B+ tree index
where there are multiple indexes in our records
height of the tree
log2n the two is based on a two block table
time process structure
4min non index
12s simple index
.21 s b+ tree
oracle command for creating B+tree index
create index pidindex ON product(product_key)
bitmap index
collection of bit vectors, one for each possible value of A. The vector for value v has 1 in position i if the ith record has v
more common in data warehouse
aggregation
can be an alternative or a complement to indexes on tables
data cleaning(missing values)
- exclude the record
- exclude the attribute/field
- replaced by a global constant
- replaced by the attribute mean
5 Replaced by the most probable value
6 Manual correction
7 Apply specific algorithm
imputation
used to find a missing value, it uses regression
data integration common challenges
same attribute with different name
different attribute with same name
data integration process
- Get source and target schemas
- Integrate source schema and map to data warehouse schema with the help from: business domain experts, source system tech experts, ETL tools
maintenance
DW refreshment, is the process of keeping data in DW consistent/current with data in source systems