Revision Flashcards
What is a Data Cube?
- Data cubes are the building blocks of multidimensional models on which a data warehouse is based
- A data cube allows data to be modelled and viewed in multiple dimensions.
- A data cube is defined by dimensions and facts. -
- Dimensions are perspectives or entities that an organization wants to keep. For example, if our data model was related to Sales, we might have dimension tables such as item (item_name, brand, type) or time(day, week, month, quarter, year).
- The fact table will contain measures such as Euros_sold and keys to each of the related dimension tables. A data cube is really a lattice of cuboids.
What is a base cuboid?
An n-dimensional base cube (e.g. (time, item, location, supplier)
What is the apex cuboid?
Topmost 0-D cuboid which holds the highest level of summarization (all).
What is a data cube in terms of cuboids?
A lattice of cuboids.
What is a Fact Table?
A fact table is a large central table with the bulk of the data, contains no redundancy and is connected to a set of smaller attendant tables (dimension tables).
What is a distributive measure?
If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning, e.g. count(), sum(), min(), max()
What is an algebraic measure?
If it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. E.g avg(), min_N(), standard_deviation()
What is a holistic measure?
If there is no constant bound on the storage size needed to describe a sub-aggregate. In other words, we need to look at all the data. E.g. median(), mode(), rank()
A distributive measure
An aggregate function is distributive if it can be computed in a distributed manner as follows. Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire dataset (without partitioning), the function can be computed in a distributed manner.
An example of a distributive measure.
For example, sum() can be computed for a data cube by first partitioning the cube into a set of subcubes, computing sum() for each subcube, and then summing up the counts obtained for each subcube. Hence, sum() is a distributive aggregate function. For the same reason, count(), min(), and max() are distributive aggregate function.
A measure is distributive if…
it is obtained by applying a distributive aggregate function.
Can distributive measures be computed efficiently?
Yes, because of the way the computation can be partitioned.
An example of an algebraic measure.
For example, avg() can be computed by sum()/count() where both sum() and count() are distributive aggregate functions.
Another example of an algebraic measure.
standard_deviation()
A measure is algebraic if…
it is obtained by applying an algebraic aggregate function.
Describe a holistic function.
An aggregate function is holistic if there is no constant bound on the storage size needed to describe a subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include median(), mode(), and rank().
A measure is holistic if…
it is obtained by applying a holistic aggregate function.
Explain why holistic measures are not desirable when designing a data warehouse.
Most large data cube applications require efficient computation of distributive and algebraic measures. Many efficient techniques exist for this. In contrast, it is difficult to compute holistic measures efficiently. Efficient techniques to approximate the computation of some holistic measures, however, do exist. For example, rather than computing the exact median(), techniques can be used to estimate the approximate median value for a large data set. In many cases, such techniques are sufficient to overcome the difficulties of efficient computation of holistic measures.
What are 2 costs of the Apriori algorithm?
The bottleneck of Apriori is candidate generation. Two costs:
- Possibly huge candidate sets
- Requires multiple scans of the database to count supports for each candidate by pattern matching
How does the Frequent Pattern Growth approach avoid the two costly problems of Apriori?
- Compresses a large database into a compact frequent pattern tree structure - highly condensed but complete for frequent pattern mining
- Avoids costs database scans
- Avoids candidate generation: sub-database test only
Why is the FP growth method compact?
- Reduces irrelevant information - infrequent items are gone.
- Frequency descending ordering - more frequent items are more likely to be shared
- Can never be larger than the original database (if not count node-links and counts)
Notion of ‘closeness’ in K-NN.
Measure of distance between K-observations and the test object. For numeric attributes, it is usually Euclidean distance.
What are the 3 main steps of a Data Mining process?
Assuming we have defined the problem around which we are developing a DM solution for, we can describe the 3 main steps of DM as:
- Data gathering + preparation
- Model building and evalution
- Knowledge deployment
Describe the tasks that need to be performed at each step of the DM process.
- Data gathering + preparation = data access, data sampling, data transformation
- Model building + evaluation = create model, test model, evaluate model, interpret model
- Knowledge deployment = Apply model, custom reports, external applications
What is spatio-temporal data?
Spatiotemporal data are data that relate to both space and time. Spatiotemporal data mining refers to the process of discovering patterns and knowledge from spatiotemporal data. Typical examples of spatiotemporal data mining include:
- discovering the evolutionary history of cities and lands
- uncovering weather patterns
- predicting earthquakes and hurricanes
- determining global warming trends
Give an example of spatio-temporal data
Moving-object data i.e. data about moving objects. E.g. telemetry equipment on wildlife to analyze ecologicla behaviour, mobility managers embed GPS monitors in cars to better monitor and guide vehicles, and meterologists use weather satellites and radars to observe hurricanes.
What is sequence data?
- time series
- symbolic sequence
- biological sequences
What is Sequential pattern mining?
A symbolic sequence consists of an ordered set of elements or events recorded with or without a concrete notion of time. E.g. customer shopping sequence data, web click streams, biological sequences, etc. Because biological sequence data carry very complicated and hidden semantic meaning and pose many challenging research issues, most investigations are conducted in the field of bioinformatics.
What is Sequential pattern mining?
A symbolic sequence consists of an ordered set of elements or events recorded with or without a concrete notion of time. E.g. customer shopping sequence data, web click streams, biological sequences, etc. Because biological sequence data carry very complicated and hidden semantic meaning and pose many challenging research issues, most investigations are conducted in the field of bioinformatics. And so, sequential pattern mining has focused mostly on mining symbolic sequences. A sequential pattern is a frequent subsequence existing in a single sequence or a set of sequences.
Mining of sequential patterns involves mining the set of subsequences that are frequent in one sequence or a set of sequences.
What is Text Mining?
An interdisciplinary field that draws on Information Retrieval, Data Mining, Machine Learning, Statistics & Computational Linguistics.
List an important goal of Text Mining.
To define high-quality information from text. High-quality usually refers to a combination of:
- relevance
- novelty
- interestingness
What is Information Retrieval.
IR is a field developed in parallel with database systems. Information is organised into (a large number) of documents.
IR problem: locating relevant documents based on user input, such as keywords or example documents.
Typical IR systems:
- Online library catalogue systems
- Online document management systems
IR vs DB systems
Some DB problems are not present in IR. e.g. update, transaction management, complex objects, concurrency control, recovery.
Some IR problems are not addressed well in DBMS. e.g. unstructured documents, approximate search using keywords and the notion of relevance.
Database Systems vs IR.
Database systems include the creation, maintenance and use of databases for organizations and end-users.
DB systems:
- highly recognized principles in data models, query languages, query processing and optimization methods, data storage, indexing and accessing methods.
- well-known for their scalability in processing very large, relatively structured data
IR:
- IR is the science of searching for documents or information in documents (documents can be text or multimedia, and may reside on the Web)
The difference between IR and traditional DB systems are twofold:
1) IR assumes that the data under search are unstructured
2) in IR, the queries are formly mainly by keywords which do not have complex structures (unlike SQL queries in database systems)
Database Systems vs IR.
Database systems include the creation, maintenance and use of databases for organizations and end-users.
DB systems:
- highly recognized principles in data models, query languages, query processing and optimization methods, data storage, indexing and accessing methods.
- well-known for their scalability in processing very large, relatively structured data
IR:
- IR is the science of searching for documents or information in documents (documents can be text or multimedia, and may reside on the Web)
The difference between IR and traditional DB systems are twofold:
1) IR assumes that the data under search are unstructured
2) in IR, the queries are formed mainly by keywords, which do not have complex structures (unlike SQL queries in DB systems).
What is a Data Warehouse?
A data warehouse integrates data from multiple sources and various timeframes. It consolidates data in multidimensional space to form partially materialized data cubes. The data cube model not only facilitates OLAP in multidimensional databases but also promotes multidimensional data mining.
What is a language model in IR?
The typical approaches in IR adopt probabilistic models. For example, a text document can be regarded as a bag of words, that is, a multiset of words appearing in the document. The document’s language model is the probability density function that generates the bag of words in the document. The similarity between two documents can be measured by the similarity between their corresponding language models.
What is a topic model in terms of IR?
A topic in a set of text documents can be modelled as the probability distribution over the vocabulary. A text document, which may involve one or multiple models, can be regarded as a mixture of multiple topic models.
By integrating IR models and DM techniques, we can find the major topics in a collection of documents, and for each document in the collection, the major topics involved.
What is a data warehouse?
Loosely speaking, a data warehouse refers to a data repository that is maintained separately from an organization’s operational databases.
- Data warehouse systems allow for integration of a variety of application systems.
- They support information processing by providing a solid platform of consolidating historic data for analysis.
- Data warehouses generalize and consolidate data in a multidimensional space
Define a data warehouse.
A data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model.
- provides online analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining.
How can business analysts benefit from a data warehouse?
DWs provide the architecture and tools for business executives to systemtically organize, understand, and use their data to make strategic decisions.
What’s a data warehouse?
It is a data repository architecture - a repository of multiple heterogenous data sources organized under a unified schema at a single site to facilitate management-decision making.
William H. Inman’s definition of a data warehouse
“A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.”
What are the main characteristics of a data warehouse/
- Subject-oriented
- Integrated
- Time-variant
- Non-volatile
What does ‘subject-oriented’ in terms of data warehouses mean?
A data warehouse is organized around major subjects, e.g. customer, supplier, product, sales
It excludes data that is not relevant to the decision-making process.
What does ‘integrated’ mean in terms of data warehouses?
That data comes from multiple heterogenous sources, e.g. flat files, relational databases, online transaction records. These data usually require data cleaning and integration techniques.
What does ‘time-variant’ in terms of data warehouses mean?
That data are stored to provide information from a historical perspective, e.g. past 5-10 years. Also, data is typically summarized, e.g. organized per item type or per region.
What does ‘non-volatile’ mean in terms of a data warehouses?
A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. All that’s required for the data warehouses is the initial loading and access of data.
Differences between traditional DBs and Data Warehouses.
DBs. || Data Warehouses
- OLTP vs OLAP
- Day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll, registration, accounting VS. serves users or knowledege workers in the role of data analysis and decision-making
- Customer-oriented (used for transaction and query processing by clerks, clients, and IT professionals) VS. Market-oriented (used for data analysis by knowledge workers including managers, executives & analysts)
- Current data (typically too detailed to be easily used for decision-making) VS. Historic data
- Adopts ER model and application-oriented db design VS. Adopts a STAR or SNOWFLAKE model and a subject-oriented database design
- Short, atomic transactions VS. Facilitation for summarization & aggregation and stores & manages info at different levels of granularity
- Requires concurrency control and recovery mechanisms VS. Allowing read-only operations
KDD
Knowledge discovery from data
What is meant by OLAP processing?
Analysis techniques with functionalities such as summarization, aggregation and consolidation, as well as the ability to view information from different angles.
What is Data Mining?
- misnomer -> should have been more appropriately named ‘knowledge mining from data’
- synonym for KDD
What are the steps involved in the KDD process?
- Data cleaning
- Data integration
- Data selection
- Data Transformation
- Data mining
- Pattern evaluation
- Knowledge representation
Data mining is just one step of the KDD process.
Enterprise Data Warehouse
Collects all info about subjects spanning the entire organization.
Data Mart
Subset of corporate-wide data that is of value to a specific group of users. It’s scope is confined to specific, selected groups, such as marketing data mart.
Virtual Warehouse
A set of views over operational databases -> requires excess capacity on databases
Only some of the possible summary views may be materialized.
Define Data Mining.
DM is the extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) information or patterns from large datasets.
Data mIning
The process of discovering interesting patterns and knowledge from large amounts of data.
What is DM not?
- (Deductive) query processing
2. Expert systems or small ML/statistical programs
What kind of data is used in DM
- DBs
- DW
- Transaction data
- Other kinds of data
What is the process of constructing a Data Warehouse?
Data cleaning, integration, transformation, loading and periodic data refreshing.
What is a Data Cube?
A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales amount)
A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data.
Why are OLAP operations useful?
OLAP. Online analytical
processing operations make use of background knowledge regarding the domain of
the data being studied to allow the presentation of data at different levels of abstraction.
Such operations accommodate different user viewpoints
Potential Applications of DM
Database analysis & decision support
> market analysis & management
> risk analysis & management
> fraud detection and management
Other applications
> Text mining (news group, email, documents) and Web analyses
> Intelligent query answering
What types of data do we perform DM on?
Besides relational database data, data warehouse data, and transaction data, there are many other kinds of data that have versatile forms and structures and rather different semantic meanings. These include: - time-related or sequence data - data streams - spatial data - engineering design data - hypertext or multimedia data - graph and network data - the Web
List some DM functionalities
- Characterization & discrimination
- frequent pattern mining, associations and correlations
- classification & regression
- clustering analysis
- outlier analysis
Interesting patterns represent…?
Knowledge
What are frequent patterns?
- Patterns that occur frequently in the data
Many kinds:
- frequent itemsets
- frequent subsequences (sequential patterns)
- frequent substructures
What is a frequent itemset?
A set of items that often appear together in a transactional dataset, e.g. milk and bread are frequently bought together in grocery stores by many customers
What is frequent subsequence?
e.g the pattern that customers tend to purchase a laptop first, then a digital camera, then a memory card is a frequent sequential pattern.
What is a frequent sub-structure?
A substructure can refer to different structural forms (e.g. graphs, trees, lattices) that may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern.
Define support as a measure.
For example,
(buys.X, “computer”) => (buys.X, “software”) [support = 1%, confidence = 50%]
In this scenario, support refers to the fact that 1% of all transactions under analysis show that computer and software are purchased together.