Midterm Flashcards

1
Q

5 V’s of Big Data

A

Value - Turning big data into value
Velocity - Speed at which data is emanating and changes are occurring between the diverse data sets
Volume - The amount of data being generated
Variety - Can use structures as well as unstructured state
Veracity - Data reliability and trust

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Mining

A

Extraction of interesting patterns or knowledge from huge amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Web Mining Framework

A

Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation

AKA

Data pre-processing
Data Mining
Post-processing
Patterns, Info, Knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Mining on what data?

A
  • Text files
  • Database-oriented data sets and applications
  • Advanced data sets and advanced applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Supervised learning (classification)

A

Supervision: The trained data are accompanied by labels indicating the class of the observations
- New data based on the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Unsupervised learning (clustering)

A
  • The class labels of training data is unknown
  • Given a set of measurements, observations, etc. - try to establish the existence of classes or clusters in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification and label prediction

A
  • construct models based on some training examples
  • describe and distinguish classes or concepts for future prediction
  • predict the class, classify the new example
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Regression

A
  • Predict a value of a given continuous-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Attribute

A

A property or characteristic of an object (columns)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Object

A

A collection of attributes describe an object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of Data sets

A

Record (Data matrix, documents, transactions)
Graph ( World Wide Web, molecular structures)
Ordered (spatial data, temporal data, sequential data, genetic sequence data)
Structured vs unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Important characteristics of structured data

A

Dimensionality - Many attributes per object
Sparsity - only presence counts
Resolution - Patterns depend on the scale
Distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of Attributes

A

Nominal - ID numbers, gender, zip codes
Ordinal - rankings, grades, height

Numeric Attribute Types:
Interval - measures on a scale of equal-sized units
Ratio - Inherent zero-point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Properties of Attribute Values

A

The type of an attribute depends on which of the following properties/operations it possesses:
Distinctness
Order
Differences are meaningful
Ratios are meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Discrete vs Continuous Attributes

A

Discrete Attribute - Has only a finite or countably infinite set of values
- Sometimes represented as integer variables
- countable
- number of students, shoe size

Continuous attribute - measurable
- height, weight, length
- represented as floating-point variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Similarity and Dissimilarity Measures

A

Similarity - numerical measure of how alike two data objects are
- Value is higher when objects are more alike
- Often falls in the range [0,1]

Dissimilarity - numerical measure of how different two data objects are
- Value is lower when objects are more alike
- minimum dissimilarity is often 0

Proximity refers to a similarity or dissimilarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Cosine Similarity

A

Cosine measure can be used to measure the similarity between 2 document vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is frequent pattern analysis?

A

Frequent pattern: a pattern that occurs frequently in a data set
Motivation: Finding inherent regularities in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

(absolute) support, or support count of X is

A

The frequency or occurance of an itemset X

20
Q

(relative) support

A

Is the fraction of transactions that contains X

21
Q

An itemset X is frequent IF

A

X’s support is no less than a minimum support threshold

22
Q

support s is the probability that

A

a transaction contains X or Y

23
Q

confidence c, conditional probability that a transaction

A

Having x also contains Y

24
Q

Frequent itemsets

A
  • An itemset that contains k items is a k-itemset
  • rules that satisfy the minimum support and minimum confidence thresholds are considered strong rules
25
Q

Basic association rule process

A
  1. Find all frequent itemsets - each of these itemsets must occur at least as frequently as predetermined by the minimum support count
  2. Generate strong association rules from the frequent itemsets: These rules must satisfy the minimum support and minimum confidence
26
Q

Apriori: A candidate generation and test approach

A

If there is any itemset which is infrequent, its superset should not be generated/tested
- in other words, all subsets of a frequent itemset must be frequent

27
Q

General apriori method:

A
  • scan dataset to get frequent 1-itemsets
  • generate length (k+1) candidate itemsets from length k frequent itemsets
  • test the candidates against dataset to obtain support counts
  • terminate when no frequent or candidate set can be generated
28
Q

Major Tasks in Data Preprocessing

A

Data cleaning
- Fill in missing values, smooth noisy data, identify or remove
Data Integration
- Integration of multiple databases, data streams or files
Data reduction
- Dimensionality reduction
- Numerosity reduction
Data transformation and data discretization
- Normalization

29
Q

Data Cleaning

A

incomplete, noisy, inconsistent

30
Q

How to handle missing data?

A

Ignore the record or fill it automatically with a constant like NA, the attribute mean, or the attribute mean for all samples belonging to the same class - the smartest approach

31
Q

How to handle noisy data?

A
  • Binning - first sort data and partition into equal frequency bins
    -Regression - smooth by fitting the data into regression funcitons
  • Clustering - detect and remove outliers
  • Combined computer and human inspection - detect suspicious values and manually check
32
Q

What is data integration?

A

Combining data from multiple sources into a coherent dataset
Schema integration - integrate metadata from different sources

33
Q

Handling Redundancy in Data Integration

A

Object identification
Derivable data
Redundant attributes may be detected by correlation analysis and covariance analysis

34
Q

Correlation Analysis (Nominal Data)

A

chi-squared test
SUM OF (O-E)^2/E
The larger the X^2, the more likely the variables are related
CORRELATION DOESNT IMPLY CAUSALITY

35
Q

Covariance

A

How much do attributes change together
Positive covariance - If coA,B>) then A and B both tend to be larger than their expected values
Negative covariance - If CovA,B<0 then A is larger than its expected value, B is likely to be smaller than its expected value
Independence CovA,B=0

36
Q

Data Reduction

A

Obtain a reducted representation of the data set that is much smaller in volume, but produces the same analytical results

37
Q

Normalization is

A

Scaling data to fall within a smaller, more specified range

38
Q

Sampling

A

Main technique for data reduction
- Used because obtaining the entire set of data of interest is too expensive or time consuming
- Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming

39
Q

Types of Sampling

A
  • Simple random sampling - There is an equal probability of selecting any particular item
  • Sampling without replacement - Once an object is selected, it is removed from the population
  • Sampling with replacement
  • Stratified sampling
  • Partition data set and draw samples from each partition
40
Q

Curse of Dimesionality

A

When dimensionality increases, data becomes increasingly sparse in the space that it occupies

41
Q

Discretization

A
  • The process of converting a continuous attribute into an ordinal attribute
  • A potentially infinite number of values are mapped into a small number of categories
  • Discretization is commonly used in classification
42
Q

Binning

A

Partition based on set bin width, partition based on frequency in bin

43
Q

Unsupervised discretization

A

Finds breaks in the data values

44
Q

Supervised discretization

A

Uses class labels to find breaks

45
Q

Binarization

A
  • Maps a continuous or categorical attribute into one or more binary values
  • Typically used for association analysis
  • continuous to categorical then categorical to binary
  • Association analysis needs asymmetric binary attributes