Chapter 2 Flashcards
Characteristics of data warehousing
- Subject oriented
- Integrated
- Time-variant (time series)
- Nonvolatile
- Web based
- Relational/multi-dimensional
- Client/server
- Real-time
- Include metadata
What is data?
A collection of facts usually obtained as the result of experiences, observations or experiments.
- the lowest level of abstraction
Data in Analytics can be categorized into:
- structured data
- unstructured or semi-structured data
Structured data can be categorized into:
- categorical
- numerical
Categorical data can be cateorized into:
-nominal
- ordinal
Numerical data can be categorized into:
- interval
-ratio
Unstructured or semi-structured data can be categorized into:
- textual
- multimedia
- XML/JSON
Multimedia data can be cateorized
- image
- audio
- video
What are the measures of centrality?
- arithmetic mean
- mean
- mode
What are the measures if dispersion?
- range
- variance
- standard deviation
- mean absolute deviation
- quartiles
- box plots
- shape distribution: skewness, kurtosis
Define data visualization
use of visual representations to explore, make and communicate data
What is the role of dashboards?
they provide visual displays of important information that is consolidated and arranged on a single screen
What are the best practices in dashboard design?
- Benchmark KPIs with industry standards
- Warp metrics with contextual metadata
- Validated design by usability specialist
- Prioritizte and rank alerts and exceptions
- Pick the right visual constructs
6- Provide guided analytics
What are types of Information Retrieval?
- Document Matching
- Link Analysis
- Search Engines
What are types of Web Mining?
- Web content mining
- Web structure mining
- Web usage mining
What are types of data mining?
- Classification
- Clusering
- Association
What are types of Natural Language Processing?
- POS Tagging
- Lemmatization
- Word Disambiguation
What are types of Text Mining?
- Web Mining
- Data Mining
- Information Retrieval
- Natural Language Porcessing
Why is text difficult?
- often “unstructured”
- linguistic nature intended for humans, not for computers
- text is relatively “dirty”
- context is important
- goal is to rune text into feature-vector form
What is a Document?
one piece of text, no matter how large or small
What do individual tokens and term compose?
a document
What is a collection of documents called?
a corpus
What are representation techniques?
- bag of words
- term frequency
- inverse document frequency
- TFIDF
- N-gramms
Bag of words: what does it involve?
creating a “bag” or set of words from a text document
Bag of words: what does it only consider?
the presence or absence of words, not their sequence
What is term frequency?
measure of how often a term (word) appears in a document
What does Invesre document frequency measure?
how important a term is across a collection of documents
What is TFIDF?
numerical statistic that combines the TF and IDF scores to reflect the importance of a term in a document within a larger collection
What is the role of N-gramms?
they capture local patterns and relationships between adjacent elements in a sequence