- Online transaction processing is used in traditional DBMS - OLTP queries only touch small (and often different) parts of the database that probably don't overlap so you can therefore make a lot of fast queries simultaneously

Data-warehousing/analysis/mining Flashcards by Phoebe Warner

What do queries run by big stores or hospitals tend to feature?

they tend to be complex: use aggregates and other advanced features
they tend to examine large parts of the data
they tend to block large parts of the database

How well did you know this?

Not at all

Perfectly

What types of queries should we avoid executing on DBMS that serve a lot of customers simultaneously?

Running very complex queries that examine and block large parts of the database

How well did you know this?

Not at all

Perfectly

What are data warehouses?

Systems designed to support data analysis to answer very complex queries that examine large parts of the database

How well did you know this?

Not at all

Perfectly

How are data warehouses updated?

By getting the updated/new information from the smaller servers

How well did you know this?

Not at all

Perfectly

What is OLAP?

(Online Analytic processing) OLAP refers to the process of analysing complex data stored in a data warehouse

How well did you know this?

Not at all

Perfectly

what is an OLAP query?

An SQL query that is for an OLAP process?

How well did you know this?

Not at all

Perfectly

What is OLTP?

Online transaction processing is used in traditional DBMS
OLTP queries only touch small (and often different) parts of the database that probably don’t overlap so you can therefore make a lot of fast queries simultaneously

How well did you know this?

Not at all

Perfectly

What do OLAP applications typically feature?

A unique fact cube that represents events and objects of interest for the analysis.

How well did you know this?

Not at all

Perfectly

Describe an example fact table with schema Sales(productNo, date, store, price)?

if it only has 3 attributes it has 3 dimensions so we can represent it at a data cube
so on one side of the cube the height could be productNo, the length could be date and the width could be price
this way, each point in the cube corresponds to the sale of a product

How well did you know this?

Not at all

Perfectly

What do we get back when we query a data cube (fact table)with a range for each edge?

as it’s represented as a cube, we get a slice of the cube back that all satisfies the ranges in the query
we can then further query this smaller section of information

How well did you know this?

Not at all

Perfectly

What are star schemas?

they have unique fact tables (contains points in the data cube)
Dimension tables: describe values along each axis of the cube
use star schemas to represent dimension tables

How well did you know this?

Not at all

Perfectly

What are star schemas made up of?

name of the table
keys to virtual fact tables that answer queries (known as dimensions)
a dependent attribute that there is no other information about elsewhere
dimensions are attributes we do have other information about elsewhere

How well did you know this?

Not at all

Perfectly

What exactly does a star schema a describe?

A database consisting of:
a fact table R(A1,…An, B1…Bm)
-What are star schemas?
- In essence A1 to An are foreign keys
- Each dimension can also have further dimensions of their own
- B1…Bm are dependent attributes

How well did you know this?

Not at all

Perfectly

What are the Characteristics of Star Schemas?

They are denormalised (they have duplicate data)
gains: they don’t require many joins
faster and easier aggregation of data

How well did you know this?

Not at all

Perfectly

Why don’t we mind that the data is denormalised in star schemas?

We don’t mind storing data in multiple places because we care more about making these type of queries fast

How well did you know this?

Not at all

Perfectly

What is Dicing?

Dicing: When we query the schema and get this small slice of the cube back, then we cut it into even smaller pieces. Each small piece telling you something about a specific model

What is the difference between slicing and dicing?

Slicing is done by the where clause to get a slice of the data cube
Dicing is done by the group by clause to cut up the slice of data into groups

What is data mining?

Can be seen as extended form of OLAP
- Is more that you have a lot of data and you’re trying to find answers to questions that you care about. For instance instead of making sequences of queries, we just want a way to get this information directly.

Why do we use data mining?

To discover patterns/knowledge in data
- can use data mining to determine the reason behind a relationship that may not be obvious (for example nappies and beer)

If we have a table of films and which watchers enjoyed the films what could this help us to predict?

Streaming services may be interested in which viewers frequently like the same films.
As this can allow them to accurately recommend films to people

What are some applications of data mining?

Deviation Detection: Identify anomalies (e.g., intruders trying to break into a system)
Link Analysis: Try to discover links between attributes (e.g., association rules)
Predictive Modelling - Try to predict future behaviour of certain attributes in the data based on past behaviour
Database Segmentation: Group data by similar “behaviour” to better target their needs

What are the types of Discovered Knowledge from data mining?

Association rules
Classification hierarchies
Uses tree structures - if you’ve a similar path to others then you may behave similarly in the future (and continue following the same path of leaves)
Sequential patterns: Example: “If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he/she is likely to suffer from kidney failure within the next 18 months.”
Clustering - Example: Group treatment data on a disease based on similarity of side effects.

What are the basics of the market-basket model

Data can be described as a set of all the different items I
and a set of baskets B: Each basket b is subset of the items in I

What is Frequent-Itemset Mining?

frequence depends on the application in question and what information is wanted
but a support threshold (percentage) is set which defines if an object is frequent.
we want to check whether two items appear frequently together and if they do we can exploit this fact in some way

How do we define whether an item appears frequently?

Use formula to determine how frequently a subset appears - frequence = times item appears / number of rows

If the support threshold = 1/2 and an item appears in 3/4 baskets what does this mean?

- It means that the item is frequent because it appears >= support threshold (frequency level)

How does Frequent-Itemset Mining work with subsets of items?

- if we also want to check if a subset is frequent we can use the same formula: - number of times subset appears / number of rows (baskets) - and then check if this percentage is >= to the frequency level

How do we calculate baskets if we have more complex information on the data such as the identities of the people who bought a basket of shopping (not just all the different baskets of shopping)?

A basket is the union of all of a specific customers' shops, so if they bought {eggs, bread} in one shop and {milk} in the next shop, then their basket = {eggs, bread, milk} - we can do group by on the customers' identities and then discard the purchase ID column

Why is a table with baskets and specific customers more useful than a table with just every separate purchase?

- It's more accurate since we can calculate how many different people bought a certain item - we can also recommend items to specific customers using this method - We can also say whether items are frequent with respect to customers, not just frequent with all shops

What are association rules?

- if a subset of items appear very frequently bought together but not bought at all on their own, we can say the two items have an association - because if one is bought, the other is more likely to be bought

What is the confidence?

the percentage of baskets of subsets of items containing an item

How do we calculate the confidence?

number of times the subset appears / number of times the item appears on it's own

Why is it important that the association is high and that the support of {i} on it's own is much lower that the support of {i,j}?

- This association should be high (their link) for it to be worth exploiting (acting upon it to try to improve sales for example) - It should differ significantly from the fraction of baskets containing J because if it doesn’t then it’s more likely that J and I are independent of each other and not worth trying to exploit

If {BS} appears 4/8 times and {BS} appears 5/8 times, what is the confidence of

confidence for {BS} --> HP1 is (4/8) / (5/8) = 4/5

What is the goal of the A-Priori Algorithm?

Compute all the itemsets J with support >= s

What rule do we use to compute the A-Priori Algorithm?

- that if J has a support >= s then all subsets of J have support >= S - You just write all the possibilities of subsets and then do the same with the subsets You just keep doing this until you reach however a large subset you’re interested in

What does the A-Priori Algorithm have poor dependency on?

It has quite poor dependency on Q as it will take an exponential time and very many queries to complete