8.1 Data Analysis Steps Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is the main difference between a “Traditional ML model” versus a “Textual ML model”?

A

A Traditional ML model uses data inputs; a textural ML model uses text inputs
.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 5 steps involved in analyzing data for financial forecasting?

A
  1. Conceptualization of the modeling task.
  2. Data collection.
  3. Data preparation and wrangling.
  4. Data exploration.
  5. Model training.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data collection (or “________”) involves determining the sources of the data to be used.

A

curation;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data _________ deals with reducing errors in the raw data.

A

cleansing;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data _________ involving processing data for model use.

A

wrangling;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Addressing the following are attributes of what process?

  • Missing values;
  • Invalid values (i.e., data outside of a meaningful range);
  • Inaccurate values;
  • Non-uniform values due to wrong use of format or unit of measurement;
  • Duplicate observations;
A

data cleansing;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Addressing the following are attributes of what process?

  • Extraction (e.g., extracting number of years employed based on dates provided);
  • Aggregation (consolidating two related variables into one, using appropriate weights);
  • Filtration (removing irrelevant observations);
  • Selection (removing features, e.g., data columns, not needed for processing);
  • Conversion of data of diverse types (e.g., nominal, ordinal, etc.).
A

data wrangling;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the term for removing outliers whereby the highest and lowest x% of observations are excluded?

A

trimming;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the term for replacing extreme values (e.g., a person’s height) by the maximum value allowable for that variable?

A

winsorization;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which scaling method scales variable values between 0 and 1 in order to describe them as a normal distribution, and what is the formula?

A

normalization;

normalized xi = [x - min(x)] / [[max(x) - min(x)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which scaling method centers the variables at 0 and scales them as units of standard deviations from the mean, and what is the formula?

A

standardization;

standardized xi = [x - x-bar] / std. deviation of x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Addressing the following are attributes of what process?

  • Remove HTML tags;
  • Remove punctuations;
  • Remove numbers;
  • Remove white spaces;
A

text preparation or cleansing;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Addressing the following are attributes of what process?

  • Lowercasing;
  • Removal of stop words (e.g., “is”, “the”, etc.);
  • Stemming;
  • Lemmatization;
A

normalization of cleansed text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Give an example of “stemming”.

A

Integrate, integration, integrating are all assigned a common value of “integrat”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the term for the conversion of inflected forms of a word into their “lemma” (i.e., morphological root). Similar to stemming but more computationally advanced and resource intensive.

A

lemmatization;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In text wrangling, a ______ is a word, and ________ is the process of splitting a sentence into ________.

A

token;
tokenization;
tokens;

17
Q

The following are examples of _______ :

“the_market”; “market_is”; “is_up”; “up_today”;

A

bigrams, or more generally “N-grams”.