Session 2 Flashcards

Question

what is skewness?

Answer 1

quantififes extend and direction of departure from horizontal symmetry in a data set? --> positive is to the left --> negative is to the right

Answer 2

dashbaords provide visiual displays of important information, that is consolidated and arranged on a single screen so that the information can be easily digested at one glance and easily drilled in and further explored

Answer 3

1. visual components to highlight data and exceptions that require an action 2. transparency to the user, so that it required minimal training and ie easy to use (especially for execs) 3. combining data from a variety of systems into a single, summarized and unified view of the business 4. enabling drill-down (go from monthly to weekly to daily) or drill through (go deeoer into data point) for underlyinh data sources and reports 5. preseneting a dynamic, real-world view with timely data 6. requiring little coding to implement /deploy / maintain

Answer 4

1. benchmark KPIs with industry standards to see where the company stands 2. wrap the metrics with contextual metadata: include which data is inlcuded and when last update was made 3. validate the design by usability specialist 4. prioritze the rank alerts and exceptions, but be careful, that there are only important warnings 5. pick the right visual constructs: histograms for distribution, pie chats for market shares 6. provide guided analytics

Answer 5

information retrieval, natural language processing, text mining, web mining and data mining

Answer 6

it is unstructured data: - it has linguistic strcuture intended fir human consumption, not for computers - its is relatively dirty: spelling errors, emojis, abbreviations, grammatical errors, or sarcasm - context is important

Answer 7

by turning it into a feature-vector form (vector with values, we get varibales from the text)

Answer 8

small indivudal elemnts that compose a document, may be words, sentences or paragraphs depending on our definition

Answer 9

one piece of text regardless of how large or small

Answer 10

a collection of doduments (all of wikipedia, all tweets etc.)

Answer 11

1. bag of words 2. term frequency (TF) 3. inverse document frequency (IDF) 4. TFIDF 5. N-grams

Answer 12

- treat every document as just a collection of individual words - ignore grammar, word order, sentence structure and punctuation - inexpensive and straigtfoward - we can check if word exists (0 or 1) example: used in spam filters

Answer 13

examine word count in three steps: 1. normalization: every term becomes lowercase 2. stemming: suffixes are removed and plurals are turned into singulars (only leave stem) 3. removal of stop words: very common words in the respective leanguage are being parsed, which do not have any useful meaning (the, and, of etc.)

Answer 14

- next to TF also look at the distribution of a term over a corpus: the term should neither be too rare nor too common - IDF: boost a term gets for being rare as the measure increases the more rare the term is

Answer 15

we produce TF and and IDF -> it is specific to a single doument, whereas IDF depends on an entire corpus

Answer 16

- in some cases, word order is important and we want to preserve some information about it - so we include sequences of adjacent words, called n-grams as terms (1.000 words, we have 1.000 ns) example:

Answer 17

+: easy to generate, required no linguistic knowledge -: greatly increases with the size of the feature set, so it needs some special considerations for dealing with massive numbers of features and computational storage space

Answer 18

goal is to asnwer the question what people feel about a certain topic (LIWC as dictionary for sentiments and words)

Session 2 Flashcards

(42 cards)