BIS II - Data, Dashboards & Visual Analytics with Tableau Flashcards

Question 1

Q

Performance Dashboards

Answer

A

Provide visual displays of important information
That is consolidated and arranged on a single screen
So that the information can be easily digested at one glance & easily drilled in an further explored

Question 2

Q

What to look for in a dashboard

Answer

A

Use of visual components to highlight data and exceptions
Transparent to the user, i.e. require minimal training and are easy to use
Combine data from a variety of systems into a single, unified view of the business
Enable drill-down or drill-through to underlying data sources/reports
Present a dynamic, real-world view with timely data
Require little coding to implement/deploy/maintain

Question 3

Q

Best practices in Dashboard Design

Answer

A

Benchmark KPIs with industry standards
Wrap the metrics with contextual data
Validate the design by a usability specialist
Prioritize and rand alters and exceptions
Enrich dashboard with business-user levels
Present information in three different levels
Pick the right visual constructs
Provide for guided analytics

Question 4

Q

Text Mining & Text Analytics

Data in DSS – Data

Answer

A

Is a collection of facts usually obtained as the result of experiences, observations or experiments
May consist of numbers, words, images, …
Is the lowest level of abstraction (from which information and knowledge are derived)

Question 5

Q

Why text is important

Answer

A

• Text is everywhere
o Many legacy applications still produce text
o Medical records (hand-written)
o Consumer complaint logs)
o Customer reviews
• 85% of corporate data is stored in some unstructured form, doubling every 18 months
• Text filtering can be applied in many contexts, e.g. impact of online word of mouth on sales, or classifying and filtering junk-email or NLP (natural language processing – Alexa)

Question 6

Q

Why text is difficult

Answer

A

• Often referred to as “unstructured data”
• Has a linguistic structure – intended for human consumption, not for computers.
o Words have varying lengths
o Text fields can have varying numbers of words
o Sometimes word order matters, sometimes not
• Text is relatively “dirty”, because people…
o Write ungrammatically
o Misspell words
o Abbreviate unpredictably
o … etc.
• Context is important

Question 7

Q

Representation of Textual Data

Answer

A

• Goal: to turn the text into a feature-vector form
• General strategy: to use the simplest (least expensive) technique
• Terminology that is borrowed from information retrieval:
o Document = one piece of text, no matter how large or small
o Document is composed of individual tokens and terms. E.g. a word is a token.
o Corpus = a collection of documents
• Representation techniques:
1. Bag of words
2. Term frequency
3. Sparseness: inverse document frequency, TFIDF
4. N-grams

Question 8

Q

1) Bag of Words

Answer

A

Approach:
• Treats each document as just a collection of individual words
• Ignores grammar, word order, sentence structure and punctuation
• Straightforward and inexpensive to generate
• Works well for many tasks

Application case: Spam filtering:
• Represent e-mail messages as unordered bag of words
• Then compare them to the typical “spam” bag of words, e.g. containing “Viagra”, “Stock”, “buy”
• Where there is a big overlap, the message is classified as spam e-mail

Question 9

Q

2) Term frequency

Answer

A

 Next step is to use the word count in the document instead of just a zero or one
 Usually, the following steps are performed:
• Normalization: every term is in lowercase. E.g. iPhone, iphone, IPHONE -> iphone
• Stemming: suffixes are removed, plurals are turned to the singular forms. E.g. announces, announced, announcing -> announc; directors -> director
• Removal of stopwords: very common words in language being parsed. E.g. the, and, of, on in English

Question 10

Q

3) Sparseness: inverse document frequency - IDF and TFIDF

Answer

A

a. Take into account the distribution of the term over a corpus as well
b. The term should not be too rare and not too common
c. Impose upper & lower limits of term frequency
d. Inverse document frequency:
i. IFD(t) = 1+ log(Total # of documents/#of documents containing t)
ii. IDF may be thought of as the boost a term gets for being rare
e. TFIDF: is a product of term frequency (TF) and inverse document frequency (IDF)
i. Is a specific to a single document whereas IDF depends on entire corpus

Question 11

Q

4) N-gram sequences

Answer

A

a. Used when word order is important & the information about it should be preserved
b. N-grams = Sequences of adjacent words, are included as terms
c. E.g. “The quick brown fox jumps” -> {quick, brown, fox, jumps}, {quick_brown, brown_fox; fox_jumps}, {quick_brown_fox, brown_fox_jumps}
d. Advantage: easy to generate, require no linguistic knowledge
e. Disadvantage: greatly increase the size of the feature sets -> needs some special consideration for dealing with massive numbers of features and computational storage

Question 12

Q

Sentiment Analysis

Answer

A

• Sentiment: belief, view, opinion, conviction
• Sentiment analysis: opinion mining, subjectivity analysis, and appraisal extraction
• Goal: to answer the question: “What do people feel about a certain topic?”
• Explicit vs. implicit sentiment
• Sentiment polarity
o Positive vs. negative vs. neutral
• E.g. Linguistic Inquiry and Word Count (LIWC) Program:
o Counts percentage of words that reflect different emotions, thinking styles, social concerns, etc. in a text, to capture people’s social and psychological states.
o Words are categorized into different sections, e.g. swearing, or past, and it is counted how many words of that specific category were found