Data-warehousing/analysis/mining Flashcards
What do queries run by big stores or hospitals tend to feature?
- they tend to be complex: use aggregates and other advanced features
- they tend to examine large parts of the data
- they tend to block large parts of the database
What types of queries should we avoid executing on DBMS that serve a lot of customers simultaneously?
- Running very complex queries that examine and block large parts of the database
What are data warehouses?
- Systems designed to support data analysis to answer very complex queries that examine large parts of the database
How are data warehouses updated?
- By getting the updated/new information from the smaller servers
What is OLAP?
(Online Analytic processing) OLAP refers to the process of analysing complex data stored in a data warehouse
what is an OLAP query?
- An SQL query that is for an OLAP process?
What is OLTP?
- Online transaction processing is used in traditional DBMS
- OLTP queries only touch small (and often different) parts of the database that probably don’t overlap so you can therefore make a lot of fast queries simultaneously
What do OLAP applications typically feature?
- A unique fact cube that represents events and objects of interest for the analysis.
Describe an example fact table with schema Sales(productNo, date, store, price)?
- if it only has 3 attributes it has 3 dimensions so we can represent it at a data cube
- so on one side of the cube the height could be productNo, the length could be date and the width could be price
- this way, each point in the cube corresponds to the sale of a product
What do we get back when we query a data cube (fact table)with a range for each edge?
- as it’s represented as a cube, we get a slice of the cube back that all satisfies the ranges in the query
- we can then further query this smaller section of information
What are star schemas?
- they have unique fact tables (contains points in the data cube)
- Dimension tables: describe values along each axis of the cube
- use star schemas to represent dimension tables
What are star schemas made up of?
- name of the table
- keys to virtual fact tables that answer queries (known as dimensions)
- a dependent attribute that there is no other information about elsewhere
- dimensions are attributes we do have other information about elsewhere
What exactly does a star schema a describe?
A database consisting of:
a fact table R(A1,…An, B1…Bm)
-What are star schemas?
- In essence A1 to An are foreign keys
- Each dimension can also have further dimensions of their own
- B1…Bm are dependent attributes
What are the Characteristics of Star Schemas?
- They are denormalised (they have duplicate data)
- gains: they don’t require many joins
- faster and easier aggregation of data
Why don’t we mind that the data is denormalised in star schemas?
We don’t mind storing data in multiple places because we care more about making these type of queries fast
What is Dicing?
- Dicing: When we query the schema and get this small slice of the cube back, then we cut it into even smaller pieces. Each small piece telling you something about a specific model
What is the difference between slicing and dicing?
- Slicing is done by the where clause to get a slice of the data cube
- Dicing is done by the group by clause to cut up the slice of data into groups
What is data mining?
Can be seen as extended form of OLAP
- Is more that you have a lot of data and you’re trying to find answers to questions that you care about. For instance instead of making sequences of queries, we just want a way to get this information directly.
Why do we use data mining?
To discover patterns/knowledge in data
- can use data mining to determine the reason behind a relationship that may not be obvious (for example nappies and beer)
If we have a table of films and which watchers enjoyed the films what could this help us to predict?
- Streaming services may be interested in which viewers frequently like the same films.
- As this can allow them to accurately recommend films to people
What are some applications of data mining?
- Deviation Detection: Identify anomalies (e.g., intruders trying to break into a system)
- Link Analysis: Try to discover links between attributes (e.g., association rules)
- Predictive Modelling - Try to predict future behaviour of certain attributes in the data based on past behaviour
- Database Segmentation: Group data by similar “behaviour” to better target their needs
What are the types of Discovered Knowledge from data mining?
- Association rules
- Classification hierarchies
Uses tree structures - if you’ve a similar path to others then you may behave similarly in the future (and continue following the same path of leaves) - Sequential patterns: Example: “If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he/she is likely to suffer from kidney failure within the next 18 months.”
- Clustering - Example: Group treatment data on a disease based on similarity of side effects.
What are the basics of the market-basket model
- Data can be described as a set of all the different items I
- and a set of baskets B: Each basket b is subset of the items in I
What is Frequent-Itemset Mining?
- frequence depends on the application in question and what information is wanted
- but a support threshold (percentage) is set which defines if an object is frequent.
- we want to check whether two items appear frequently together and if they do we can exploit this fact in some way
How do we define whether an item appears frequently?
Use formula to determine how frequently a subset appears
- frequence = times item appears / number of rows
If the support threshold = 1/2 and an item appears in 3/4 baskets what does this mean?
- It means that the item is frequent because it appears >= support threshold (frequency level)
How does Frequent-Itemset Mining work with subsets of items?
- if we also want to check if a subset is frequent we can use the same formula:
- number of times subset appears / number of rows (baskets)
- and then check if this percentage is >= to the frequency level
How do we calculate baskets if we have more complex information on the data such as the identities of the people who bought a basket of shopping (not just all the different baskets of shopping)?
A basket is the union of all of a specific customers’ shops, so if they bought {eggs, bread} in one shop and {milk} in the next shop, then their basket = {eggs, bread, milk}
- we can do group by on the customers’ identities and then discard the purchase ID column
Why is a table with baskets and specific customers more useful than a table with just every separate purchase?
- It’s more accurate since we can calculate how many different people bought a certain item
- we can also recommend items to specific customers using this method
- We can also say whether items are frequent with respect to customers, not just frequent with all shops
What are association rules?
- if a subset of items appear very frequently bought together but not bought at all on their own, we can say the two items have an association
- because if one is bought, the other is more likely to be bought
What is the confidence?
the percentage of baskets of subsets of items containing an item
How do we calculate the confidence?
number of times the subset appears / number of times the item appears on it’s own
Why is it important that the association is high and that the support of {i} on it’s own is much lower that the support of {i,j}?
- This association should be high (their link) for it to be worth exploiting (acting upon it to try to improve sales for example)
- It should differ significantly from the fraction of baskets containing J because if it doesn’t then it’s more likely that J and I are independent of each other and not worth trying to exploit
If {BS} appears 4/8 times and {BS} appears 5/8 times, what is the confidence of
confidence for {BS} –> HP1 is (4/8) / (5/8) = 4/5
What is the goal of the A-Priori Algorithm?
Compute all the itemsets J with support >= s
What rule do we use to compute the A-Priori Algorithm?
- that if J has a support >= s then all subsets of J have support >= S
- You just write all the possibilities of subsets and then do the same with the subsets
You just keep doing this until you reach however a large subset you’re interested in
What does the A-Priori Algorithm have poor dependency on?
It has quite poor dependency on Q as it will take an exponential time and very many queries to complete