BIS II - Data, Dashboards & Visual Analytics with Tableau Flashcards
Performance Dashboards
- Provide visual displays of important information
- That is consolidated and arranged on a single screen
- So that the information can be easily digested at one glance & easily drilled in an further explored
What to look for in a dashboard
- Use of visual components to highlight data and exceptions
- Transparent to the user, i.e. require minimal training and are easy to use
- Combine data from a variety of systems into a single, unified view of the business
- Enable drill-down or drill-through to underlying data sources/reports
- Present a dynamic, real-world view with timely data
- Require little coding to implement/deploy/maintain
Best practices in Dashboard Design
- Benchmark KPIs with industry standards
- Wrap the metrics with contextual data
- Validate the design by a usability specialist
- Prioritize and rand alters and exceptions
- Enrich dashboard with business-user levels
- Present information in three different levels
- Pick the right visual constructs
- Provide for guided analytics
Text Mining & Text Analytics
Data in DSS – Data
- Is a collection of facts usually obtained as the result of experiences, observations or experiments
- May consist of numbers, words, images, …
- Is the lowest level of abstraction (from which information and knowledge are derived)
Why text is important
• Text is everywhere
o Many legacy applications still produce text
o Medical records (hand-written)
o Consumer complaint logs)
o Customer reviews
• 85% of corporate data is stored in some unstructured form, doubling every 18 months
• Text filtering can be applied in many contexts, e.g. impact of online word of mouth on sales, or classifying and filtering junk-email or NLP (natural language processing – Alexa)
Why text is difficult
• Often referred to as “unstructured data”
• Has a linguistic structure – intended for human consumption, not for computers.
o Words have varying lengths
o Text fields can have varying numbers of words
o Sometimes word order matters, sometimes not
• Text is relatively “dirty”, because people…
o Write ungrammatically
o Misspell words
o Abbreviate unpredictably
o … etc.
• Context is important
Representation of Textual Data
• Goal: to turn the text into a feature-vector form
• General strategy: to use the simplest (least expensive) technique
• Terminology that is borrowed from information retrieval:
o Document = one piece of text, no matter how large or small
o Document is composed of individual tokens and terms. E.g. a word is a token.
o Corpus = a collection of documents
• Representation techniques:
1. Bag of words
2. Term frequency
3. Sparseness: inverse document frequency, TFIDF
4. N-grams
1) Bag of Words
Approach:
• Treats each document as just a collection of individual words
• Ignores grammar, word order, sentence structure and punctuation
• Straightforward and inexpensive to generate
• Works well for many tasks
Application case: Spam filtering:
• Represent e-mail messages as unordered bag of words
• Then compare them to the typical “spam” bag of words, e.g. containing “Viagra”, “Stock”, “buy”
• Where there is a big overlap, the message is classified as spam e-mail
2) Term frequency
Next step is to use the word count in the document instead of just a zero or one
Usually, the following steps are performed:
• Normalization: every term is in lowercase. E.g. iPhone, iphone, IPHONE -> iphone
• Stemming: suffixes are removed, plurals are turned to the singular forms. E.g. announces, announced, announcing -> announc; directors -> director
• Removal of stopwords: very common words in language being parsed. E.g. the, and, of, on in English
3) Sparseness: inverse document frequency - IDF and TFIDF
a. Take into account the distribution of the term over a corpus as well
b. The term should not be too rare and not too common
c. Impose upper & lower limits of term frequency
d. Inverse document frequency:
i. IFD(t) = 1+ log(Total # of documents/#of documents containing t)
ii. IDF may be thought of as the boost a term gets for being rare
e. TFIDF: is a product of term frequency (TF) and inverse document frequency (IDF)
i. Is a specific to a single document whereas IDF depends on entire corpus
4) N-gram sequences
a. Used when word order is important & the information about it should be preserved
b. N-grams = Sequences of adjacent words, are included as terms
c. E.g. “The quick brown fox jumps” -> {quick, brown, fox, jumps}, {quick_brown, brown_fox; fox_jumps}, {quick_brown_fox, brown_fox_jumps}
d. Advantage: easy to generate, require no linguistic knowledge
e. Disadvantage: greatly increase the size of the feature sets -> needs some special consideration for dealing with massive numbers of features and computational storage
Sentiment Analysis
• Sentiment: belief, view, opinion, conviction
• Sentiment analysis: opinion mining, subjectivity analysis, and appraisal extraction
• Goal: to answer the question: “What do people feel about a certain topic?”
• Explicit vs. implicit sentiment
• Sentiment polarity
o Positive vs. negative vs. neutral
• E.g. Linguistic Inquiry and Word Count (LIWC) Program:
o Counts percentage of words that reflect different emotions, thinking styles, social concerns, etc. in a text, to capture people’s social and psychological states.
o Words are categorized into different sections, e.g. swearing, or past, and it is counted how many words of that specific category were found