text_mining Flashcards
what is text mining
The discovery of knowledge trough text analysis
what are the text characteristics
High dimensionality, unstrcutured form, not readily accessible to be use by computer , huge collection of document, words have position and position matters
what are the steps of text mining
preprocessing, feature extraction, feature selection, discovery and interpretation
provide examples of preprocessing
stop words removal, stemming, punctuation marks
what is transformation normalization
this include document representation in the vector space model
invese document frequency, Normalizeword frequency over documents
frequency damping, normalize word frequency within a document
a normalized frequency of a word if tf-idf this norlized frequency can later by used in similarut measurement
what is captured by th-idf
meaning a word is less frequent in the corpus but frequent in a document then it is interesting
in the inverse document frequency
the more a word appear in the document the less interesting it is or we may judge they are interesting so we use a frequency damping and we take the log