Week 5 Flashcards
what is data mining?
its focused on better understanding characteristics and patterns among variables in large databases using a variety of analytical and statistical tools
what is classification?
its an approach of data mining which is the process of analysing data to predict how to classify a new data element, eg spam filtering in an email
what is cluster analysis?
also known as data segmentation, which is a collection of techniques that seek to group or segment a collection of observations into subsets which have a high amount of similarity
- cluster analysis is a data reduction technique that can take a large number of observations such as surveys which can be reduce the information into smaller, same groups
what is data exploration and reduction?
often involves identifying groups in which elements of the groups in some way are similar, often used to understand differences amongst people
- this techniques often breaks down large data sets into smaller samples that provide clearer insights
what is dirty data?
data sets that have missing values or errors are ‘dirty’ and need to be ‘cleaned’ prior to analysing them
how can dirty data be cleaned?
- eliminate the records than have missing data
- estimate reasonable values for missing observations eg the mean
- errors can be identified by looking at outliers
what is an outlier?
- Can arise for a variety of reasons, e.g., incorrect recording of an observation
- Can make a significant difference in the statistical analysis and results
- Should not blindly be eliminated as it might indicate a deficiency with the model
what is hierarchial clustering?
the data isn’t partitioned into a particular cluster in a single step, instead a series of partitions takes place which may run from a single cluster containing all objects to n clusters, each containing a single object
what is k-means clustering?
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
- used in marketing anf healthcare
what are the elements of effective segmentaion?
- Measurable: be able to quantify the size
- Substantial: large enough to warrant separate treatment
- Differentiable: Exclusive, each segment reacts differently
- Actionable: Should be possible to develop actions (e.g.,
sales and marketing) for different segments
what is the process of k-means clustering?
k-means Clustering
Simple algorithm:
1. Randomly assign each observations one of the k
clusters
2. Iterate until cluster assignments stop changing:
i. For each of the k clusters, compute the
centroid
ii. Reassign each observation to the closest
centroid
what is the euclidean distance?
The most commonly used measure of distance between objects is Euclidean distance . This is an extension of the way in which the distance between two points on a plane is computed as the hypotenuse of a right triangle
what is classification?
these methods seek to classify a categorical outcome into one of two or more categories based on various data attributes
before building a model how can we partition data?
into a training data set or validation data set?
what is a training data set?
they have known outcomes and are used to teach a data mining algorithm