Data-warehousing/analysis/mining Flashcards
What do queries run by big stores or hospitals tend to feature?
- they tend to be complex: use aggregates and other advanced features
- they tend to examine large parts of the data
- they tend to block large parts of the database
What types of queries should we avoid executing on DBMS that serve a lot of customers simultaneously?
- Running very complex queries that examine and block large parts of the database
What are data warehouses?
- Systems designed to support data analysis to answer very complex queries that examine large parts of the database
How are data warehouses updated?
- By getting the updated/new information from the smaller servers
What is OLAP?
(Online Analytic processing) OLAP refers to the process of analysing complex data stored in a data warehouse
what is an OLAP query?
- An SQL query that is for an OLAP process?
What is OLTP?
- Online transaction processing is used in traditional DBMS
- OLTP queries only touch small (and often different) parts of the database that probably don’t overlap so you can therefore make a lot of fast queries simultaneously
What do OLAP applications typically feature?
- A unique fact cube that represents events and objects of interest for the analysis.
Describe an example fact table with schema Sales(productNo, date, store, price)?
- if it only has 3 attributes it has 3 dimensions so we can represent it at a data cube
- so on one side of the cube the height could be productNo, the length could be date and the width could be price
- this way, each point in the cube corresponds to the sale of a product
What do we get back when we query a data cube (fact table)with a range for each edge?
- as it’s represented as a cube, we get a slice of the cube back that all satisfies the ranges in the query
- we can then further query this smaller section of information
What are star schemas?
- they have unique fact tables (contains points in the data cube)
- Dimension tables: describe values along each axis of the cube
- use star schemas to represent dimension tables
What are star schemas made up of?
- name of the table
- keys to virtual fact tables that answer queries (known as dimensions)
- a dependent attribute that there is no other information about elsewhere
- dimensions are attributes we do have other information about elsewhere
What exactly does a star schema a describe?
A database consisting of:
a fact table R(A1,…An, B1…Bm)
-What are star schemas?
- In essence A1 to An are foreign keys
- Each dimension can also have further dimensions of their own
- B1…Bm are dependent attributes
What are the Characteristics of Star Schemas?
- They are denormalised (they have duplicate data)
- gains: they don’t require many joins
- faster and easier aggregation of data
Why don’t we mind that the data is denormalised in star schemas?
We don’t mind storing data in multiple places because we care more about making these type of queries fast
What is Dicing?
- Dicing: When we query the schema and get this small slice of the cube back, then we cut it into even smaller pieces. Each small piece telling you something about a specific model
What is the difference between slicing and dicing?
- Slicing is done by the where clause to get a slice of the data cube
- Dicing is done by the group by clause to cut up the slice of data into groups
What is data mining?
Can be seen as extended form of OLAP
- Is more that you have a lot of data and you’re trying to find answers to questions that you care about. For instance instead of making sequences of queries, we just want a way to get this information directly.
Why do we use data mining?
To discover patterns/knowledge in data
- can use data mining to determine the reason behind a relationship that may not be obvious (for example nappies and beer)
If we have a table of films and which watchers enjoyed the films what could this help us to predict?
- Streaming services may be interested in which viewers frequently like the same films.
- As this can allow them to accurately recommend films to people
What are some applications of data mining?
- Deviation Detection: Identify anomalies (e.g., intruders trying to break into a system)
- Link Analysis: Try to discover links between attributes (e.g., association rules)
- Predictive Modelling - Try to predict future behaviour of certain attributes in the data based on past behaviour
- Database Segmentation: Group data by similar “behaviour” to better target their needs
What are the types of Discovered Knowledge from data mining?
- Association rules
- Classification hierarchies
Uses tree structures - if you’ve a similar path to others then you may behave similarly in the future (and continue following the same path of leaves) - Sequential patterns: Example: “If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he/she is likely to suffer from kidney failure within the next 18 months.”
- Clustering - Example: Group treatment data on a disease based on similarity of side effects.
What are the basics of the market-basket model
- Data can be described as a set of all the different items I
- and a set of baskets B: Each basket b is subset of the items in I
What is Frequent-Itemset Mining?
- frequence depends on the application in question and what information is wanted
- but a support threshold (percentage) is set which defines if an object is frequent.
- we want to check whether two items appear frequently together and if they do we can exploit this fact in some way