4: Big Data Flashcards
Big Data differs from traditional data sources by set characteristics:
- Volume: huge quantities
- Variety: data sources
- Velocity: speed
- (Veracity): trustworthiness/
- dependability
Preprocessing for structured data:
- Extraction
- Aggregation
- Filtration
- Selection
- Conversion
Algorithms with errors in assumptions leads to:
High bias w/ poor approximation
Leading to: underfitting & high in-sample error
The degree to which a model fits the training data
Bias Error
Unstable models lead to:
Picking up noise and producing high variance,
Resulting in: overfitting and out-of-sample error
Harmonic mean of precision and recall
F1 score
Most useful in situations where the chance of rejecting the null, when it is true, is costly
Precision, helps Type 1 errors
High precision= reduce risk of Type 1 error
Most useful in situations where the chance of accepting the null, when it is false, is costly
Recall, helps with Type 2 errors
High recall= reduce risk of Type 2 error
Training a model that a degree of noise or randomness is mistaken for patterns and relationships, leads to:
Overfitting
Receiver Operating Characteristic (ROC), shows the tradeoff between:
False positives
&
True positives
Root mean square is used when:
Target Variable is continuous
When there is unequal class distributions in the data set, the best measure of accuracy is:
F1; harmonic mean of precision & recall
Bag of words occurs in Feature Selection, which is after the text has been cleansed & normalized so BOW would be concise:
Text cleansing & wrangling has been completed, so BOW would be free of:
* stemming
* lemmatization
* lower casing
* stop words
Unstructured data
What is removed in the text processing/cleansing stage?
- HTML tags
- punctuation
- numbers
- white space
What occurs in the text wrangling/preprocessing/ normalization stage?
- stemming
- lemmatization
- lower casing
- stop words