Data Analysis and Mining Flashcards
Use Cases
Prompts the use of data analysis.
Data Technology
Prompts the use of data warehouses.
Aggregates in SQL
- count()
- max()
- min()
- avg()
Complex SQL Queries
Typically contain the use of aggregates, and examines a large amount of the database.
Data warehouses
Type of database systems to support data analysis.
Typically is not updated constantly, but could be updated every day or so.
Extract transform load
Transforms data into a specific schema/pattern.
OLAP
Online analytic processing.
Refers to the process of analysing complex data stored in a data warehouse.
OLAP Query
Complicated queries that touch a lot of data, discover trends and patterns in data and typically take a large amount of time to compute.
OLTP
Online transactions processing.
Typical DBMS tasks, where queries are typically fast and touch a small portion of the database.
Unique Fact Tables
Used in OLAPs to store events and objects of interest for the analysis.
May be thought as representing a data cube, where the length, width and depth all represent different variables, like product, or dates, or store etc.
Star Schemas
One of the more common data warehouse architectures, containing unique fact tables which contain points in a data cube.
We could also have dimensional tables.
Dimensional Tables
Describe values along each axis.
Star Schema Example
Lets say we have the fact table Sales(productNo, date, store, price). We can split this further:
- productNo comes from Products(productNo, type, model)
- date comes from Days(date, day, week, month, year)
- store comes from Stores(name, city, country, phone) where “name” is being changed to store.
- price can be its own dependent attribute.
All of these attributes are called dimensions, and the dimension tables are what make up the fact table itself.
Denormalised Schemas
We save the data in multiple areas because we care more about making complex queries faster.
The main data is in one table, the fact table, and the rest of the data can be joined with fact table very quickly.
And all in all, joins are not required as much.
Slice and Dice technique
Another way to find data from a data cube.
Slicing
Narrows down our search of the data cube by slicing a section.
Dicing
Dices the selected section further into different sections of data.
Data mining
Extended form of OLAP.
Data mining objective
Data mining takes a lot of data and tries to get answers to questions that you care about.
Typical data mining queries
“Find factors that have had the most influence over sales of product X” rather than SQL queries.
Essentially, they are pure English queries where we don’t specify what we want, but we discover things directly from data.
Data mining applications
- Deviation detection -> identify anomalies.
- Link analysis -> discover links between attributes.
- Predictive modelling -> predict the future behaviour of certain attributes based upon past behaviour.
- Database segmentation -> group data by similar behaviour.
Types of discovered knowledge
- association rules
- classification hierarchies
- sequential patterns
- clustering
Classification hierarchies
An example could be mutual funds based on performance data characteristics such as growth, income etc. Essentially, it is ordering what we care about.
Sequential Patterns
An specific set of patterns, lets say A, B and C, leading to an outcome D.