Flash cards made by gpt based on summary

Question

What is Skewness in data?

Answer 1

Skewness measures the asymmetry of a data distribution. Positive skew indicates right-skewed data, negative skew left-skewed, and zero skew indicates symmetry.

Answer 2

Outliers can be identified and managed through methods like trimming or winsorizing. Missing values can be replaced, discarded, or analyzed further.

Answer 3

Machine learning is used to estimate a prediction function relating inputs to outputs, using training data and various methods for feature extraction and model evaluation.

Answer 4

Loss functions assess the quality of prediction functions; in classification, it’s binary (wrong or correct), and in regression, it's based on the squared difference between predictions and actual values.

Answer 5

The bias-variance tradeoff is the balance between a model's complexity (variance) and its performance on training data (bias). High bias can lead to underfitting, while high variance can lead to overfitting.

Answer 6

the training set is used to train the model, and the test set, which contains unseen data, is used to evaluate the model’s performance and predict its real-world application.

Answer 7

In K-fold cross-validation, data is divided into K equal parts. The model is trained on K-1 parts and tested on the remaining part. This process repeats K times, averaging the results for stability.

Answer 8

Regularization involves techniques to calibrate machine learning models, minimizing adjusted loss functions to prevent overfitting and underfitting, enhancing model’s generalizability.

Answer 9

Predicting earnings changes requires considering the limited impact of financial statement data alone, as earnings are influenced by various factors beyond the data

Answer 10

Challenges include difficulty estimating when the number of variables is close to or larger than the number of observations, leading to overfitting or high variance, and reduced interpretability.

Answer 11

Methods include subset selection, shrinkage (or regularization), and dimension reduction, which help improve predictive accuracy and manage model complexity.

Answer 12

Forward stepwise selection gradually adds the most significant variables, while backward stepwise selection starts with all variables and removes the least significant ones iteratively.

Answer 13

Hybrid methods combine forward and backward steps, adding significant variables and removing non-significant ones to balance model complexity and predictive performance.

Answer 14

AIC helps determine the best model by penalizing complexity; a lower AIC value indicates a better balance between model fit and complexity. Cp = 1/n(RSS + 2dσ^2) (sum of residual sum of squares and the product of number of predictors and sigma squared(variance measure) divided by the nr of observations)

Answer 15

It allows for identifying the most predictive variables, but can be computationally intensive, potentially lead to overfitting, and may not account for multicollinearity.

Answer 16

SQL is scalable, offers fast processing, and is often embedded into other programming languages.

Answer 17

A relational database allows access to related data points in different tables, using SQL for data manipulation.

Answer 18

"CREATE TABLE" is used to define a new table in SQL, specifying its columns, data types, and constraints.

Answer 19

Inner joins return rows with matching values in both tables, while outer joins include all rows when there is a match in one of the tables. If there is no match, the missing value will contain NULL

Answer 20

Primary keys uniquely identify each record, while foreign keys establish relationships between tables.

Answer 21

"NOT NULL" ensures a column cannot have null values, and "UNIQUE" ensures all values in a column are different.

Answer 22

Data warehouses store processed data organized for efficient querying and analysis, often in data marts.

Answer 23

XML uses a hierarchical tree structure with nested elements, while JSON is based on name-value pairs and arrays.

Answer 24

XBRL instance documents tag financial data for electronic communication, making it machine-readable and standardized.

Answer 25

Taxonomy in XBRL acts like a dictionary, defining specific tags for different financial reporting elements.

Answer 26

Schema documents describe the structure of XBRL instance documents, listing taxonomies used and declaring unique elements.

Answer 27

Linkbases in XBRL link additional useful information to facts in instance documents, enhancing data usability.

Answer 28

Textual data analysis faces challenges like unstructured format, complexity in collection and analysis, and issues with detecting deceptive language.

Answer 29

Effective data visualization follows principles like maintaining informativeness, readability, viewer-centric design, and appropriate use of colors and contrast

Answer 30

Visualization types like bar charts, histograms, scatter plots, and maps are chosen based on data characteristics like amounts, distributions, relationships, and geospatial information.

Answer 31

Tree-based models use if-then rules to generate predictions, applicable for both regression (predicting numerical values) and classification (predicting categorical values).

Answer 32

ree pruning involves cutting back a large, complex tree to a simpler version, reducing overfitting and improving interpretability.

Answer 33

Ensemble methods, like bagging and random forests, combine multiple models to reduce variance, increase prediction accuracy, and prevent overfitting.

Answer 34

K-means clustering partitions data into K distinct, non-overlapping clusters based on similarity, useful for identifying natural groupings in data.

Answer 35

The Elbow method plots the within-cluster sum of squares against the number of clusters, helping to find the optimal number of clusters where the curve bends.

Answer 36

hallenges include selecting the number of clusters, data standardization, and the non-robustness of results to changes in the underlying data.

Answer 37

Loss functions assess prediction accuracy, with binary losses for classification and squared difference losses for regression.

Answer 38

Sensitivity (true positive rate) and specificity (true negative rate) measure a model's ability to correctly identify positive and negative cases, respectively.

Flash cards made by gpt based on summary

(62 cards)