Week 5 Flashcards

1
Q

what is data mining?

A

its focused on better understanding characteristics and patterns among variables in large databases using a variety of analytical and statistical tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is classification?

A

its an approach of data mining which is the process of analysing data to predict how to classify a new data element, eg spam filtering in an email

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is cluster analysis?

A

also known as data segmentation, which is a collection of techniques that seek to group or segment a collection of observations into subsets which have a high amount of similarity
- cluster analysis is a data reduction technique that can take a large number of observations such as surveys which can be reduce the information into smaller, same groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is data exploration and reduction?

A

often involves identifying groups in which elements of the groups in some way are similar, often used to understand differences amongst people
- this techniques often breaks down large data sets into smaller samples that provide clearer insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is dirty data?

A

data sets that have missing values or errors are ‘dirty’ and need to be ‘cleaned’ prior to analysing them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how can dirty data be cleaned?

A
  • eliminate the records than have missing data
  • estimate reasonable values for missing observations eg the mean
  • errors can be identified by looking at outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is an outlier?

A
  • Can arise for a variety of reasons, e.g., incorrect recording of an observation
  • Can make a significant difference in the statistical analysis and results
  • Should not blindly be eliminated as it might indicate a deficiency with the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is hierarchial clustering?

A

the data isn’t partitioned into a particular cluster in a single step, instead a series of partitions takes place which may run from a single cluster containing all objects to n clusters, each containing a single object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is k-means clustering?

A

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
- used in marketing anf healthcare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what are the elements of effective segmentaion?

A
  • Measurable: be able to quantify the size
  • Substantial: large enough to warrant separate treatment
  • Differentiable: Exclusive, each segment reacts differently
  • Actionable: Should be possible to develop actions (e.g.,
    sales and marketing) for different segments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the process of k-means clustering?

A

k-means Clustering
Simple algorithm:
1. Randomly assign each observations one of the k
clusters
2. Iterate until cluster assignments stop changing:
i. For each of the k clusters, compute the
centroid
ii. Reassign each observation to the closest
centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is the euclidean distance?

A

The most commonly used measure of distance between objects is Euclidean distance . This is an extension of the way in which the distance between two points on a plane is computed as the hypotenuse of a right triangle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is classification?

A

these methods seek to classify a categorical outcome into one of two or more categories based on various data attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

before building a model how can we partition data?

A

into a training data set or validation data set?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is a training data set?

A

they have known outcomes and are used to teach a data mining algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a validation data set?

A

this data set is often used to find-tune models
- Some data miners additional use a test data set to assess performance

17
Q

what are the three different data-mining approaches used for classification?

A

k-nearest neighbours, discriminant analysis and logistic regression?

18
Q

what is the k-nearest neighbours (K-NN) algorithm?

A
  • is a classification scheme that attempts to find records in a database that are similar to one we wish to classify. Similarity is based on the “closeness” of a record to numerical predictors in the other records.
    eg: In the Credit Approval Decisions database, we have the predictors Homeowner, Credit Score, Years of Credit History, Revolving Balance , and Revolving Utilization . We seek to classify the decision to approve or reject the credit application.
19
Q

how do you work out the K-nearest neighbour?

A
  • For a new data point you want to classify, check the K nearest neighbours
  • Distance is measured using the Euclidean distance
  • The majority of categories of the neighbours is assigned as the category of the new data point
20
Q

what is discriminant analysis?

A

a technique for classifying a set of observations into predefined classes,

21
Q

what is logistic regression?

A

Logistic regression is a variation of ordinary regression in which the dependent variable is categorical. The independent variables may be continuous or categorical, as in the case of ordinary linear regression. However, whereas multiple linear regression seeks to predict the numerical value of the dependent variable Y based on the values of the dependent variables, logistic regression seeks to predict the probability that the output variable will fall into a category based on the values of the independent (predictor) variables. This probability is used to classify an observation into a category.

22
Q

what is association rule mining?

A

Association rule mining , often called affinity analysis , seeks to uncover interesting associations and/or correlation relationships among large sets of data. Association rules identify attributes that occur frequently together in a given data set. A typical and widely used example of association rule mining is market basket analysis.