Week 1 Flashcards

Question

Describe Vector data.

Answer 1

Describe Vector data. • Data Object: product ratings, course grades Describe Vector data. * Attributes: numerical properties/values representing the DO * X = [x1, x2, x3, …, xn] (X is arrowed) * xi is the numerical value of X at the ith dimension (attribute) * attributes are unique to each dimension, order is specific. * Vector attributes may also represent text or categorical objects and numeric values associated with the object such as count or rank.

Answer 2

What is the difference between a vector and a matrix? A matrix is a collection of vectors.

Answer 3

Describe Sequence data. * Data Object: curriculum paths, DNA sequences, a session of search queries, a sentence of words, a trace of user actions. * Attributes: pairs of positions and categorical items, in a sequential order. * Intro to Python -> Python for Data Science -> Data Mining I -> Data Mining II * X = { (x1, 1), (x2, 2), …, (xk, k)}

Answer 4

Describe Time Series data. * Data Object: growth chart, stock price over time, battery life over time * Attribute: the measurement of a numerical property observed at a specific point in time. * X = {(x1, t1), (x2, t2), …, (xn, tn)} or x = f(t) * xi is the numerical measurement of a property of X observed at timestamp ti. * Order is mandatory

Answer 5

Describe Spatial and Spatiotemporal data. * Object: GPS trajectory of a vehicle, the spread of disease, a heat map * Attribute: the numerical measurement of a property at a specific location is spatial data. * If time is also included in the measurement, the data is spatiotemporal. * 1D = name of location * 2D = LON and LAT (lambda = LON, phi = LAT) * X = {(x1, lambda1, phi1), (x2, lambda2, phi2), …, (xn, lambdan, phin)} * x = f(lambda, phi) * 3D = LON, LAT, time * X = {(x1, lambda1, phi1,t1), (x2, lambda2, phi2, t2), …, (xn, lambdan, phin, tn)} * x = f(lambda, phi, t) * 3D = LON, LAT, altitude * X = {(x1, lambda1, phi1, a1), (x2, lambda2, phi2, a2), …, (xn, lambdan, phin, an)} * x = f(lambda, phi, a)

Answer 6

Describe Graph or Network data. * Data Object: social network, the Internet * Attributes: nodes and links (edges) * Text can be represented in the nodes of a network. * G = (V, E) * V is a set of nodes (vertices, entities) * V = {v1, v2, …, vn} * A node can be a categorical item or a complex data object. * E is the set of links or edges (relations) * E = {(vi, vj), …, } * Each pair in E represents the nodes that are being linked.

Answer 7

Describe Stream data. * Object: continuous arrivals with timestamps * email, news feeds, * Attributes: arrival time or order as one specific attribute * Formulation: (tk <= tk+1 <= tk+2) * D = {…, (Xk, tk), (Xk+1, tk+1), …, (Xn, tn), …} * Uppercase X as the object can be complex. * Xk can be any simple or complex data object.

Answer 8

What to know about Data Representations? * Data Formulation is the first task of data mining. * Different representations of data may be applied. * How to representation data as item sets, vectors, matrices, time-series, sequence, networks, or streams * The particular choice depends on the task and application.

Answer 9

What is the first task in data mining? Data formulation is usually the first task of data mining.

Answer 10

Recall two open questions in data mining. 1. Are there "basic" functionalities of data mining that most or all data mining tasks care about? 2. Are there general ways to implement complex functionalities of data mining through basic functionalities?

Answer 11

Describe two ways of decision-making. • Make decisions (classification, prediction, clustering, ranking, …) about an object • Because it has a particular pattern similar to another object(s) -- or -- • Because it is similar or comparable to another object(s)

Answer 12

Pattern and similarity are two basic outputs of data mining that can be what? * applied to almost all data representations * can be used to build almost all other functionalities * May not be optimal!

Answer 13

What is a pattern? * A structure of attributes that represents the intrinsic and important properties of data objects. * The particular mathematical formulation depends on the data representation (vector, item set, …)

Answer 14

Name four similar concepts to patterns. * Property * Characteristic * Regularity * Feature

Answer 15

In what ways can patterns can be used in classification? * Given labeled training examples, assign a new object to a class(es). * Examine multiple patterns to classify X. * Features can be combined with a machine learning algorithm.

Answer 16

How are patterns used in clustering? • To group data objects into classes with no predefined class or training examples.

Answer 17

What is similarity? * Similarity is the measure of how much two data objects are alike. * Distance measures the opposite: how much two objects differ from each other. * The measurement of similarity/distance depends on the data representation (items sets or vectors)

Answer 18

Name three methods for which similarity is useful. Similarity in Classification • K-Nearest Neighbors Similarity in Clustering • minimize in-group distance • maximize cross-group distance Similarity in Ranking • Query: objects closer to the query rank higher • e.g. web query (search) results • No Query: object X ranks higher if closer to many objects versus fewer.

Answer 19

What to know about patterns and similarity * Patterns and similarity are two basic data mining outputs. * Complex functionalities can be produced by patterns and similarities. * Classification, Clustering, Ranking * How the K-Nearest Neighbor classifier works. * The definition of pattern or similarity depends on the data representation. * Itemsets, vectors, etc.

Answer 20

How might a pattern be defined or described? A pattern can be generally described as the structure of attributes that represents the intrinsic and important property of your data objects. The formulation is dependent on the data representation (vector, itemset, etc.)

Answer 21

What is similarity in terms of data mining? Similarity is a measure of how much two data objects are like each other.

Answer 22

What does Support represent? The proportion of X (items, objects, transactions, etc.), given all X and Y, that satisfy a rule requirement.

Answer 23

How is Confidence defined? The degree of certainty of the detected association.

Answer 24

How is Confidence represented mathematically? confidence (X->Y) = P(Y|X)

Answer 25

So what is the similarity of data mining? Similarity is a measure of how much two data objects are like each other.

Answer 26

Describe an important measure used in establishing similarity between data objects. Mathematical Distance between objects is a way of describing their similarity. The greater the distance measure, the less the similarity.

Answer 27

What processes are typically involved in data mining activities? As a knowledge discovery process, data mining typically involves data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation, and knowledge presentation.

Answer 28

What makes a pattern interesting? A pattern is interesting if it is valid on test data with some degree of certainty, novel, potentially useful (e.g., can be acted on or validates a hunch about which the user was curious), and easily understood by humans.

Answer 29

What are the more advanced data types? Advanced data types include time-related or sequence data, data streams, spatial and spatiotemporal data, text and multimedia data, graph and networked data, and Web data.

Answer 30

# Define "Data Warehouse." A data warehouse is a repository for long-term storage of data from multiple sources, organized so as to facilitate management decision making.

Answer 31

How is a data warehouse typically used? The data are stored under a unified schema and are typically summarized. Data warehouse systems provide multidimensional data analysis capabilities, collectively referred to as online analytical processing.

Answer 32

What are data mining functionalities? Data mining functionalities are used to specify the kinds of patterns or knowledge to be found in data mining tasks.

Answer 33

What are the types of data mining functionalities? The functionalities include characterization and discrimination; the mining of frequent patterns, associations, and correlations; classification and regression; cluster analysis; and outlier detection.

Answer 34

Data mining is comprised of which other domain technologies? ``` These technologies include: • Statistics • Machine Learning • Database and Data Warehouse systems • Information retrieval ```

Answer 35

What are some of the challenges associated with data mining research? * Challenge areas include: * Mining methodology * User interaction * Efficiency and scalability * Issues related to diverse data types

Answer 36

Why are typical computer language data structures (python, js, vb) insufficient for data mining tasks? Real-world data objects are far too complex to be represented within the limited confines of core language data structures.

Answer 37

What inverse measure is used to quantify similarity? Distance is an inverse measure of similarity between two objects. The greater the distance the more dissimilar.

Answer 38

What anomaly can distance be used to find? Finding objects within a dataset that are beyond a threshold distance, and most other objects of the dataset.

Week 1 Flashcards

(62 cards)