Data Analysis and Mining Flashcards
Use Cases
Prompts the use of data analysis.
Data Technology
Prompts the use of data warehouses.
Aggregates in SQL
- count()
- max()
- min()
- avg()
Complex SQL Queries
Typically contain the use of aggregates, and examines a large amount of the database.
Data warehouses
Type of database systems to support data analysis.
Typically is not updated constantly, but could be updated every day or so.
Extract transform load
Transforms data into a specific schema/pattern.
OLAP
Online analytic processing.
Refers to the process of analysing complex data stored in a data warehouse.
OLAP Query
Complicated queries that touch a lot of data, discover trends and patterns in data and typically take a large amount of time to compute.
OLTP
Online transactions processing.
Typical DBMS tasks, where queries are typically fast and touch a small portion of the database.
Unique Fact Tables
Used in OLAPs to store events and objects of interest for the analysis.
May be thought as representing a data cube, where the length, width and depth all represent different variables, like product, or dates, or store etc.
Star Schemas
One of the more common data warehouse architectures, containing unique fact tables which contain points in a data cube.
We could also have dimensional tables.
Dimensional Tables
Describe values along each axis.
Star Schema Example
Lets say we have the fact table Sales(productNo, date, store, price). We can split this further:
- productNo comes from Products(productNo, type, model)
- date comes from Days(date, day, week, month, year)
- store comes from Stores(name, city, country, phone) where “name” is being changed to store.
- price can be its own dependent attribute.
All of these attributes are called dimensions, and the dimension tables are what make up the fact table itself.
Denormalised Schemas
We save the data in multiple areas because we care more about making complex queries faster.
The main data is in one table, the fact table, and the rest of the data can be joined with fact table very quickly.
And all in all, joins are not required as much.
Slice and Dice technique
Another way to find data from a data cube.
Slicing
Narrows down our search of the data cube by slicing a section.
Dicing
Dices the selected section further into different sections of data.
Data mining
Extended form of OLAP.
Data mining objective
Data mining takes a lot of data and tries to get answers to questions that you care about.
Typical data mining queries
“Find factors that have had the most influence over sales of product X” rather than SQL queries.
Essentially, they are pure English queries where we don’t specify what we want, but we discover things directly from data.
Data mining applications
- Deviation detection -> identify anomalies.
- Link analysis -> discover links between attributes.
- Predictive modelling -> predict the future behaviour of certain attributes based upon past behaviour.
- Database segmentation -> group data by similar behaviour.
Types of discovered knowledge
- association rules
- classification hierarchies
- sequential patterns
- clustering
Classification hierarchies
An example could be mutual funds based on performance data characteristics such as growth, income etc. Essentially, it is ordering what we care about.
Sequential Patterns
An specific set of patterns, lets say A, B and C, leading to an outcome D.
Clustering
Grouping data together in clusters.
Market-basket model
A type of data mining technique, which uses Market-Basket Data.
Market-basket Data
Market-Basket Data can be described by a set of items I, and a set of baskets B, with each basket being a subset of I.
Market-basket data example
Purchase ID ; Items Brought
101 ; milk, bread, cookies, juice
792 ; milk, juice
1130 ; milk, bread, eggs
1735 ; bread, cookies, coffee
Items (I) = {milk, bread, cookies, juice, eggs, coffee}
Baskets (B) = b1, b2, b3, b4, where bn represents each collection of “items bought”.
Frequent-itemset mining
Questions like “Which items occur frequently together in a basket?”
Frequent-itemset mining support
number of baskets in B containing all items j / number of baskets in b.
Frequent-itemset mining example part 1
We first define “how frequently is frequent?”
We can do this by setting some arbitrary number to s. So lets say we wanted to figure out how often bread is in a basket, and we wanted to say “one in every two”, we could set s = 0.5.
Frequent-itemset mining example part 2
So lets say we have s = 0.5, which is us saying whatever we specify should be once every two times. Now lets go back to our other example:
101 ; milk, bread, cookies, juice
792 ; milk, juice
1130 ; milk, bread, eggs
1735 ; bread, cookies, coffee
If we were to ask “is buying milk and juice together frequent”, we would then figure out the support for J = {milk, juice}, which calculates to:
2/4, since 101 and 792 both contain all items.
This meets out threshold, 0.5, and we can deem it as frequent.
More complex tables
Data in tables is more complex, typically with useless information at times. For example, our previous table can look like this instead:
Purchase ID ; Customer ID ; Items Brought
101 ; A ; milk, bread, cookies, juice
792 ; B ; milk, juice
1130 ; A ; milk, bread, eggs
1735 ; C ; bread, cookies, coffee
where we add the additional Customer ID colun.
WRT (with example)
Now we can include With Respect To, and decide which values take priority.
Using our ongoing example, we could have Purchase ID taking priority, and our table would be non-different (except for the removal of the table Customer ID) since there are no common elements.
However, using Customer ID leads to A appearing twice, changing the table to
A ; milk, bread, cookies, juice, eggs
B ; milk, juice
C ; bread, cookies, coffee
Using this table can result in different frequency answers. For example, using {milk, juice} will now result in 2/3, making it even more frequent.
Association Rules
General questions which focus on {i1, i2, …} -> j.
In plain English, for example, we could have “customers who buy diapers frequently also buy beer”, or “people who buy game of throne and harry potter also buy twin peaks”.
Association Rule Properties
- support of {i1, i2, …, j} -> ideally, we want a high support.
- confidence
Confidence
Confidence is the percentage of baskets for {i1, i2, …} containing j, and is wrote as:
support of {i1, … in, ij} / support of {i1, … in}
This should also be high and should differ significantly from the fraction of baskets containing j. If they are relatively similar, then it is close to independent whether you buy this item or not.
Association Rule Example
Question is {milk} -> juice
Using:
101 ; A ; milk, bread, cookies, juice
792 ; B ; milk, juice
1130 ; A ; milk, bread, eggs
1735 ; C ; bread, cookies, coffee
Support:
{milk, juice} = 2/4 = 0.5
Confidence:
support of {milk, juice} / support of {milk}
(2/4) / (3/4) = 2/3 = 0.67
Result:
67 percent of all customers who bought milk also bought juice.
A-Priori Algorithm
Compute all itemsets J with support >= s.
If J has support >= s, then all subsets of J have support >= s.