4: Big Data Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Big Data differs from traditional data sources by set characteristics:

A
  1. Volume: huge quantities
  2. Variety: data sources
  3. Velocity: speed
  4. (Veracity): trustworthiness/
  5. dependability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Preprocessing for structured data:

A
  • Extraction
  • Aggregation
  • Filtration
  • Selection
  • Conversion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Algorithms with errors in assumptions leads to:

A

High bias w/ poor approximation

Leading to: underfitting & high in-sample error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The degree to which a model fits the training data

A

Bias Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unstable models lead to:

A

Picking up noise and producing high variance,
Resulting in: overfitting and out-of-sample error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Harmonic mean of precision and recall

A

F1 score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Most useful in situations where the chance of rejecting the null, when it is true, is costly

A

Precision, helps Type 1 errors

High precision= reduce risk of Type 1 error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Most useful in situations where the chance of accepting the null, when it is false, is costly

A

Recall, helps with Type 2 errors

High recall= reduce risk of Type 2 error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Training a model that a degree of noise or randomness is mistaken for patterns and relationships, leads to:

A

Overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Receiver Operating Characteristic (ROC), shows the tradeoff between:

A

False positives
&
True positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Root mean square is used when:

A

Target Variable is continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When there is unequal class distributions in the data set, the best measure of accuracy is:

A

F1; harmonic mean of precision & recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Bag of words occurs in Feature Selection, which is after the text has been cleansed & normalized so BOW would be concise:

A

Text cleansing & wrangling has been completed, so BOW would be free of:
* stemming
* lemmatization
* lower casing
* stop words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Unstructured data

What is removed in the text processing/cleansing stage?

A
  • HTML tags
  • punctuation
  • numbers
  • white space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What occurs in the text wrangling/preprocessing/ normalization stage?

A
  • stemming
  • lemmatization
  • lower casing
  • stop words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly