Week 1 Flashcards

1
Q

L 1.1.x

What are some alternative names for the process of data mining?

A

What are some alternative names for the process of data mining?

  • Knowledge Discovery in Databases
  • Knowledge Extraction
  • Data/Pattern Analysis
  • Data Archeology
  • Data Dredging
  • Information Harvesting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we need data mining?

A

Why do we need data mining?

To handle the explosive growth of data. By 2025 ~100 zettabytes of data will be generated worldwide.

1 zettabyte is 1 billion terabytes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is one of the purposes of data mining?

A

What is one of the purposes of data mining?

To extract knowledge from the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are Jim Gray’s four paradigms of Science?

A

What are Jim Gray’s four paradigms of Science?

  1. Empirical/Experimental Science <1600’s
  2. Theoretical Science (1600 - 1950’s)
  3. Computational Science (1950’s - 1990’s)
  4. Data-Intensive Science “eScience” >2000’s
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is another name for data mining?

A

What is another name for data mining?

Knowledge Discovery from Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sunita Sarawagi’s definition of data mining?

A

Sunita Sarawagi’s definition of data mining?

The process of semi-automatically analyzing large databases to find patterns that are:
• Valid: hold on new data with some certainty
• Novel: non-obvious to the system
• Useful: should be possible to act on the item
• Understandable: humans should be able to interpret the pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Jiawei Han’s definition of data mining?

A

Jiawei Han’s definition of data mining?

The extraction of interesting [non-trivial, implicit, previously unknown, potentially useful] patterns or knowledge from huge amounts of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Vipin Kumar’s definition of data mining?

A

Vipin Kumar’s definition of data mining?

Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Not everything is data mining, give some examples.

A

Not everything is data mining, give some examples.

  • Looking up a phone number in a phone directory.
  • Querying a search engine for data.
  • Deductive or Expert Systems.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

L 1.2.x

Name concepts related to but different from data mining.

A

Name concepts related to but different from data mining.

  • Machine Learning and Pattern Recognition
    • Techniques utilized in data mining processes.
  • RDBMS, Data Warehouses
    • The systems that support data mining.
  • Big Data Analytics, Data Science
    • Data mining is a key component of these broad fields.
  • Business Intelligence
    • The application of data mining in business
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the database view of data mining?

A

What is the database view of data mining?

The processes and techniques that connect data warehouses to the discovery of patterns.

  • Transactional/Operational databases to
  • Data Cleaning to
  • Data Warehousing to
  • Task-Relevant Data/Data Selection to
  • DATA MINING to
  • Pattern discovery to
  • Evaluation to
  • Knowledge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the machine learning view?

A

What is the machine learning view?

  • Input Data - Pre-Processing
    • Data Integration
    • Extraction
    • Normalization
    • Feature Selection
    • Dimension Reduction
  • Model Training
    • Supervised Learning
    • Learning
    • Semi-supervised learning
    • Reinforcement Learning
  • Post-Processing
    • Model testing/evaluation
    • Model selection
    • Model interpretation
    • Model visualization
  • Classification/Prediction/Ranking
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Business Intelligence view?

A

What is the Business Intelligence view?

Business Intelligence (bottom up)
• Lower level
	• Data sources
	• Preprocessing/ Integration/Warehousing
• Middle level
	• Data Mining
• Higher level
	• Knowledge evaluation
	• Presentation
	• Business Decisions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Human-Centered view (as presented by UM)?

A

What is the Human-Centered view?

  • For the People
    • Data mining for social good
    • End-user applications
  • Of the People
    • Bring users into the loop
    • Wisdom of the crowd
  • By the People
    • Information about people
    • Generated data
    • Process owned by people

• Selection, Detection, Characterization,
Explanation, Prediction, Intervention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the first dimension of data mining?

A

What is the first dimension of data mining?

The Data to be mined (inputs).

  • type/representation: vectors/matrices, sequences, time-series, spatiotemporal, data streams, or graphs
  • genre/application: transactional data, text and web, multimedia, social and information networks, biological data, or user behaviors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the second dimension of data mining?

A

What is the second dimension of data mining?

Knowledge to be discovered

  • Data mining functionalities
    • lower-level output, such as patterns of data, the similarity of data, or association of data.
    • Decision-driven output, such as classification, clustering, trend/deviation, prediction, and outlier analysis.

• Can be descriptive or predictive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the third dimension of data mining?

A

What is the third dimension of data mining?

Techniques Utilized

  • Data cubing, machine learning
  • statistics, pattern recognition,
  • user modeling, visualization,
  • data-intensive computing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the fourth dimension of data mining?

A

What is the fourth dimension of data mining?

Application of Data Mining

  • Retail: advertising, marketing
  • Telecommunications: spam call detection
  • Banking: loan approvals, credit score estimates
  • Social Networks: Facebook, Twitter
  • Scientific Discovery: Biological data mining
  • Web Search: smart question answering
  • Stock Market Analysis: stock picking
  • Text Mining: Natural Language Processing
  • Clinical: health informatics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

L 1.3.x

What is the name of the object that bridges the gap between typical data structures and those required for data mining?

A

What is the name of the object that bridges the gap between typical data structures and those required for data mining?

Data Representation (DR).

DR is a mathematical way to represent data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the three V’s of big data?

A

What are the three V’s of big data?

Volume, Variety, and Velocity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What more about Data Formulation?

A

What more about Data Formulation?

  • There are more data science applications that one might expect.
  • There are not so many basic data types.
  • How do we abstract, formulate, and represent the data in real applications?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What questions does a Data Scientist look to answer when observing data?

A

What questions does a Data Scientist look to answer when observing data?

  • What is the basic object of information?
  • What are the properties/attributes of the data object?
  • How are the attributes structured?
  • How are values assigned to the attributes?
  • How are the different data objects related?

A Data Scientist must be able to answer these questions in a mathematical way!

• This is the task and purpose of data representation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Name several types of data representation.

A

Name several types of data representation.

  • Item Set
  • Vector / Matrix
  • Sequences
  • Time Series
  • Spatial
  • Spatiotemporal
  • Graph / Network
  • Stream
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe an Item Set.

A

Describe an Item Set.

  • Data Object: shopping basket, text, BoD
  • : a product, a word (bag of words, word cloud), a person
    • X = {x1, x2, x3, …, xk}
      • belongs to X if and only if that categorical item appears in the set.
      • order or count do not matter
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Describe Vector data.

A

Describe Vector data.

• Data Object: product ratings, course grades
Describe Vector data.

  • Attributes: numerical properties/values representing the DO
    • X = [x1, x2, x3, …, xn] (X is arrowed)
    • xi is the numerical value of X at the ith dimension (attribute)
    • attributes are unique to each dimension, order is specific.
    • Vector attributes may also represent text or categorical objects and numeric values associated with the object such as count or rank.
26
Q

What is the difference between a vector and a matrix?

A

What is the difference between a vector and a matrix?

A matrix is a collection of vectors.

27
Q

Describe Sequence data.

A

Describe Sequence data.

  • Data Object: curriculum paths, DNA sequences, a session of search queries, a sentence of words, a trace of user actions.
  • Attributes: pairs of positions and categorical items, in a sequential order.
    • Intro to Python -> Python for Data Science -> Data Mining I -> Data Mining II
    • X = { (x1, 1), (x2, 2), …, (xk, k)}
28
Q

Describe Time Series data.

A

Describe Time Series data.

  • Data Object: growth chart, stock price over time, battery life over time
  • Attribute: the measurement of a numerical property observed at a specific point in time.
  • X = {(x1, t1), (x2, t2), …, (xn, tn)} or x = f(t)
  • xi is the numerical measurement of a property of X observed at timestamp ti.
  • Order is mandatory
29
Q

Describe Spatial and Spatiotemporal data.

A

Describe Spatial and Spatiotemporal data.

  • Object: GPS trajectory of a vehicle, the spread of disease, a heat map
  • Attribute: the numerical measurement of a property at a specific location is spatial data.
    • If time is also included in the measurement, the data is spatiotemporal.
    • 1D = name of location
    • 2D = LON and LAT (lambda = LON, phi = LAT)
      • X = {(x1, lambda1, phi1), (x2, lambda2, phi2), …, (xn, lambdan, phin)}
      • x = f(lambda, phi)
    • 3D = LON, LAT, time
      • X = {(x1, lambda1, phi1,t1), (x2, lambda2, phi2, t2), …, (xn, lambdan, phin, tn)}
      • x = f(lambda, phi, t)
    • 3D = LON, LAT, altitude
      • X = {(x1, lambda1, phi1, a1), (x2, lambda2, phi2, a2), …, (xn, lambdan, phin, an)}
      • x = f(lambda, phi, a)
30
Q

Describe Graph or Network data.

A

Describe Graph or Network data.

  • Data Object: social network, the Internet
  • Attributes: nodes and links (edges)
    • Text can be represented in the nodes of a network.
  • G = (V, E)
    • V is a set of nodes (vertices, entities)
      • V = {v1, v2, …, vn}
      • A node can be a categorical item or a complex data object.
    • E is the set of links or edges (relations)
      • E = {(vi, vj), …, }
      • Each pair in E represents the nodes that are being linked.
31
Q

Describe Stream data.

A

Describe Stream data.

  • Object: continuous arrivals with timestamps
    • email, news feeds,
  • Attributes: arrival time or order as one specific attribute
  • Formulation: (tk <= tk+1 <= tk+2)
  • D = {…, (Xk, tk), (Xk+1, tk+1), …, (Xn, tn), …}
    • Uppercase X as the object can be complex.
  • Xk can be any simple or complex data object.
32
Q

What to know about Data Representations?

A

What to know about Data Representations?

  • Data Formulation is the first task of data mining.
  • Different representations of data may be applied.
  • How to representation data as item sets, vectors, matrices, time-series, sequence, networks, or streams
  • The particular choice depends on the task and application.
33
Q

What is the first task in data mining?

A

What is the first task in data mining?

Data formulation is usually the first task of data mining.

34
Q

Recall two open questions in data mining.

A

Recall two open questions in data mining.

  1. Are there “basic” functionalities of data mining that most or all data mining tasks care about?
  2. Are there general ways to implement complex functionalities of data mining through basic functionalities?
35
Q

Describe two ways of decision-making.

A

Describe two ways of decision-making.

• Make decisions (classification, prediction, clustering, ranking, …) about an object
• Because it has a particular pattern similar to another object(s)
– or –
• Because it is similar or comparable to another object(s)

36
Q

Pattern and similarity are two basic outputs of data mining that can be what?

A

Pattern and similarity are two basic outputs of data mining that can be what?

  • applied to almost all data representations
  • can be used to build almost all other functionalities
    • May not be optimal!
37
Q

What is a pattern?

A

What is a pattern?

  • A structure of attributes that represents the intrinsic and important properties of data objects.
  • The particular mathematical formulation depends on the data representation (vector, item set, …)
38
Q

Name four similar concepts to patterns.

A

Name four similar concepts to patterns.

  • Property
  • Characteristic
  • Regularity
  • Feature
39
Q

In what ways can patterns can be used in classification?

A

In what ways can patterns can be used in classification?

  • Given labeled training examples, assign a new object to a class(es).
  • Examine multiple patterns to classify X.
  • Features can be combined with a machine learning algorithm.
40
Q

How are patterns used in clustering?

A

How are patterns used in clustering?

• To group data objects into classes with no predefined class or training examples.

41
Q

What is similarity?

A

What is similarity?

  • Similarity is the measure of how much two data objects are alike.
  • Distance measures the opposite: how much two objects differ from each other.
  • The measurement of similarity/distance depends on the data representation (items sets or vectors)
42
Q

Name three methods for which similarity is useful.

A

Name three methods for which similarity is useful.

Similarity in Classification
• K-Nearest Neighbors

Similarity in Clustering
• minimize in-group distance
• maximize cross-group distance

Similarity in Ranking
• Query: objects closer to the query rank higher
• e.g. web query (search) results
• No Query: object X ranks higher if closer to many objects versus fewer.

43
Q

What to know about patterns and similarity

A

What to know about patterns and similarity

  • Patterns and similarity are two basic data mining outputs.
  • Complex functionalities can be produced by patterns and similarities.
    • Classification, Clustering, Ranking
  • How the K-Nearest Neighbor classifier works.
  • The definition of pattern or similarity depends on the data representation.
    • Itemsets, vectors, etc.
44
Q

How might a pattern be defined or described?

A

How might a pattern be defined or described?

A pattern can be generally described as the structure of attributes that represents the intrinsic and important property of your data objects.

The formulation is dependent on the data representation (vector, itemset, etc.)

45
Q

What is similarity in terms of data mining?

A

What is similarity in terms of data mining?

Similarity is a measure of how much two data objects are like each other.

47
Q

What does Support represent?

A

What does Support represent?

The proportion of X (items, objects, transactions, etc.), given all X and Y, that satisfy a rule requirement.

49
Q

How is Confidence defined?

A

How is Confidence defined?

The degree of certainty of the detected association.

50
Q

How is Confidence represented mathematically?

A

How is Confidence represented mathematically?

confidence (X->Y) = P(Y|X)

51
Q

So what is the similarity of data mining?

A

So what is the similarity of data mining?

Similarity is a measure of how much two data objects are like each other.

52
Q

Describe an important measure used in establishing similarity between data objects.

A

Describe an important measure used in establishing similarity between data objects.

Mathematical Distance between objects is a way of describing their similarity. The greater the distance measure, the less the similarity.

54
Q

What processes are typically involved in data mining activities?

A

What processes are typically involved in data mining activities?

As a knowledge discovery process, data mining typically involves data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation, and knowledge presentation.

55
Q

What makes a pattern interesting?

A

What makes a pattern interesting?

A pattern is interesting if it is valid on test data with some degree of certainty, novel, potentially useful (e.g., can be acted on or validates a hunch about which the user was curious), and easily understood by humans.

56
Q

What are the more advanced data types?

A

What are the more advanced data types?

Advanced data types include time-related or sequence data, data streams, spatial and spatiotemporal data, text and multimedia data, graph and networked data, and Web data.

57
Q

Define “Data Warehouse.”

A

Define “Data Warehouse.”

A data warehouse is a repository for long-term storage of data from multiple sources, organized so as to facilitate management decision making.

58
Q

How is a data warehouse typically used?

A

How is a data warehouse typically used?

The data are stored under a unified schema and are typically summarized. Data warehouse systems provide multidimensional data analysis capabilities, collectively referred to as online analytical processing.

59
Q

What are data mining functionalities?

A

What are data mining functionalities?

Data mining functionalities are used to specify the kinds of patterns or knowledge to be found in data mining tasks.

60
Q

What are the types of data mining functionalities?

A

What are the types of data mining functionalities?

The functionalities include characterization and discrimination; the mining of frequent patterns, associations, and correlations; classification and regression; cluster analysis; and outlier detection.

61
Q

Data mining is comprised of which other domain technologies?

A

Data mining is comprised of which other domain technologies?

These technologies include:
• Statistics
• Machine Learning
• Database and Data Warehouse systems
• Information retrieval
62
Q

What are some of the challenges associated with data mining research?

A

What are some of the challenges associated with data mining research?

  • Challenge areas include:
  • Mining methodology
  • User interaction
  • Efficiency and scalability
  • Issues related to diverse data types
63
Q

Why are typical computer language data structures (python, js, vb) insufficient for data mining tasks?

A

Why are typical computer language data structures (python, js, vb) insufficient for data mining tasks?

Real-world data objects are far too complex to be represented within the limited confines of core language data structures.

64
Q

What inverse measure is used to quantify similarity?

A

What inverse measure is used to quantify similarity?

Distance is an inverse measure of similarity between two objects. The greater the distance the more dissimilar.

65
Q

What anomaly can distance be used to find?

A

What anomaly can distance be used to find?

Finding objects within a dataset that are beyond a threshold distance, and most other objects of the dataset.