Week 5 Flashcards
What does it mean that the miner is often the end user?
Data mining is carried out by knowledge persons within the different business units.
What is the output of regression data mining (belongs to predictions)
A number
What does link analysis try to achieve?
Find patterns in relationship to each other.
What does the robustness of data mining refer to?
Its ability to overcome noisy data to make somewhat accurate predictions.
What does the accuracy of data mining refer to?
Its ability to predict the outcome of a previously unknown data set accurately.
In estimating the accuracy of data mining (or other) classification models, the true positive rate is
the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives.
When would the iteration of steps 3 and 4 stop in K-means clustering?
2 awnsers
- When the recalculation of center points does not lead to a reassignment of data points anymore.
- When a pre-defined number of iterations have been carried out.
Would the algorithm always show the same results if we keep K the same and all other parameters the same?
No, because the initial selection of cluster center points is random.
What is the output variable in Association?
There is no output variable.
What is the Euclidian distance
Ordinary distance between two points that one would measure with a ruler.
What is the manhattan distance?
rectilinear distance, or taxicab distance, between two points)
Its the total travel distance if one can only move along grid lines.
What can cluster analysis be used for?
Cluster analysis can be used for automatic identification of natural groupings of things.
What kind of learning does cluster analysis use?
Unsupervised learning
What kind of data set does supervised learning use?
A labeled data set
How does clustering work?
It works by learning the clusters of things form past data, then assigning new instances.
What are some of the use cases of clustering?
- Identify natural groupings of customers;
- Identify rules for assigning new cases to classes for targeting/diagnostic purposes;
- Provide characterization, definition, labeling of populations,
- Decrease the size and complexity of problems for other data mining methods;
- Identify outliers in a specific domain.
How is the optimal amount of clusters determined?
There is no optimal way to calculate the amount of clusters, hence heuristics are often used.
What does K-means clustering mean?
K-means clustering means that there is a pre-determined number of clusters.
What are the steps of K-means clustering?
- Determine the value of k;
- Randomly generate k random points as initial cluster centers;
- Assign each point to the nearest cluster center;
- Re-compute the new cluster centers;
- Repeat steps 3 and 4 until some convergence criterion is met.
When is the conversion criteria met?
The convergence criterion is usually met when the assignment of points to clusters becomes stable.
What is association rule mining?
It finds interesting relationships (affinities) between variables (items or events).
What kind of machine learning does association utilize?
Unsupervised.
What is the input of association?
Point-of-sale transaction data
What is the output of association?
Most frequent affinities among items.
What are the use cases for association for business and medicine?
Business: cross-marketing, cross-selling, store design, catalog design, e-commerce, site design, optimization of online advertising, product pricing, and sales promotion/configuration.
Medicine: relationships between symptoms and illnesses, diagnosis and patient characteristics and treatments, and genes and their functions.
What is support in association?
How often X and Y go together
Baskets with both X and Y divided by total baskets
What is the confidence in association?
How often Y goes together with X
SuPP X–>Y divided by Supp X
Why is text mining needed to stay competitive?
Because 85-90 percent of corporate data is unstructured.
What is text mining?
A semi-automatic process of extracting knowledge from unstructured data sources.
What is the difference between data mining and text mining?
Text mining uses unstructured data.
What does text mining do?
Text mining imposes structure to the data and mines the structured data
What is structured data?
Composed of clearly defined data types whose pattern makes them easily searchable.
What is unstructured data?
Everything else from structured. There is some internal structure but are not structured via pre-defined models or schemas. Usually ont easily searchable.
What is semi-structured data?
Data have some internal structure but also unstructured elements (emails, xml files)
What is the corpus?
A large and structured set of texts prepared for the purpose of conducting knowledge discovery.
What is tokenizing?
A token is categorized block of text in a sentence. The assignment of meaning to blocks of text is known as tokenizing.
What are terms?
A term is a single word or multi word phrase extracted directly from the corpus of a specific domain by means of natural language processing (NLP) methods.
What are concepts?
Concepts are features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology. Compared to terms, concepts are the result of higher-level abtraction.
What is stemming?
THe process of reducing inflected words to their stem (or base or root) form.
What are stop words (and include words)
Words that are filtered out prior to or after processing of natural language data. Usually a list of verbs that are removed.
What are synonyms?
Syntactically different words with identical or similar meanings.
What are polysemes?
Syntactically identical words with different meanings.
What is the term dictionary?
A collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus.
What is word frequency?
The number of times a word is found in a specific document.
What is natural language processing (NLP)
A subfield of AI and computational linguistics that studies the understanding of the natural human language.
What is bag-of-words?
Disregarding grammar or order of words?
What are the applications of NLP?
- Language translation applications;
- Checking grammatical accuracy;
- Personal assistant applications.
What are the steps in the text mining process?
- Establish the corpus
- Create the term-document matrix
- Extract patterns/knowledge
What is the term-by-document Matrix (TDM)?
This matrix describes how often terms are mentioned in each document. Decisions such as which words to include (stop words, synonyms/homonyms and stemming) and kind of indices to use.
The dimensionality of the TDM can be reduced manually by an export. The matrix is transformed using singular value decomposition (SVD).