8.1 Data Analysis Steps Flashcards
What is the main difference between a “Traditional ML model” versus a “Textual ML model”?
A Traditional ML model uses data inputs; a textural ML model uses text inputs
.
What are the 5 steps involved in analyzing data for financial forecasting?
- Conceptualization of the modeling task.
- Data collection.
- Data preparation and wrangling.
- Data exploration.
- Model training.
Data collection (or “________”) involves determining the sources of the data to be used.
curation;
Data _________ deals with reducing errors in the raw data.
cleansing;
Data _________ involving processing data for model use.
wrangling;
Addressing the following are attributes of what process?
- Missing values;
- Invalid values (i.e., data outside of a meaningful range);
- Inaccurate values;
- Non-uniform values due to wrong use of format or unit of measurement;
- Duplicate observations;
data cleansing;
Addressing the following are attributes of what process?
- Extraction (e.g., extracting number of years employed based on dates provided);
- Aggregation (consolidating two related variables into one, using appropriate weights);
- Filtration (removing irrelevant observations);
- Selection (removing features, e.g., data columns, not needed for processing);
- Conversion of data of diverse types (e.g., nominal, ordinal, etc.).
data wrangling;
What is the term for removing outliers whereby the highest and lowest x% of observations are excluded?
trimming;
What is the term for replacing extreme values (e.g., a person’s height) by the maximum value allowable for that variable?
winsorization;
Which scaling method scales variable values between 0 and 1 in order to describe them as a normal distribution, and what is the formula?
normalization;
normalized xi = [x - min(x)] / [[max(x) - min(x)]
Which scaling method centers the variables at 0 and scales them as units of standard deviations from the mean, and what is the formula?
standardization;
standardized xi = [x - x-bar] / std. deviation of x
Addressing the following are attributes of what process?
- Remove HTML tags;
- Remove punctuations;
- Remove numbers;
- Remove white spaces;
text preparation or cleansing;
Addressing the following are attributes of what process?
- Lowercasing;
- Removal of stop words (e.g., “is”, “the”, etc.);
- Stemming;
- Lemmatization;
normalization of cleansed text.
Give an example of “stemming”.
Integrate, integration, integrating are all assigned a common value of “integrat”.
What is the term for the conversion of inflected forms of a word into their “lemma” (i.e., morphological root). Similar to stemming but more computationally advanced and resource intensive.
lemmatization;