Flash cards made by gpt based on summary

1
Q

What are the four Vs of big data?

A

Volume (vast amounts of data), Variety (different types of data), Velocity (speed of data generation and movement), and Veracity (quality of data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between supervised and unsupervised learning in data analytics?

A

Supervised learning involves training a model on a labeled dataset, while unsupervised learning involves the model trying to understand and structure unlabeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the OSEMN steps in data science.

A

Obtain (extract, import, scrape), Scrub (clean and manage), Explore, Model (analyze), and Interpret (communicate).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the challenge in organizational data management?

A

The main challenge is making effective use of organizational data. Collecting data is not enough; it must be used effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the features of organizational data management systems?

A

Features include a storage medium, a common structure for the dataset, an interface for rapid entry and retrieval. (NB acknowledging trade-offs in system design is also important)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the types of organizational data management systems?

A

Types include Transaction Processing System, Management Information System, Decision Support Systems, Business Intelligence, Online Analytical Processing, Data Mining, and Machine Learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are desirable attributes of data in data management?

A

Data should be shareable, transportable, secure, accurate, timely, and relevant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do XML and HTML differ in data exchange?

A

XML is an extensible language without predefined tags, used for data exchange, while HTML focuses more on presentation and formatting with predefined tags.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is JSON and its significance in data interchange?

A

JSON (JavaScript Object Notation) is a text-based, human-readable format used universally for web applications’ data interchange. It’s based on JavaScript object syntax.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is XBRL (Extensible Business Reporting Language) important in digital reporting?

A

XBRL standardizes digital reporting, making it more accurate and secure. It’s crucial for facilitating standardized information exchange.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the main components of an XBRL instance?

A

An XBRL instance includes values (text or numbers), context and variables (like entity, period, decimals, currency), concepts (business terms representation), and a dictionary (linking concepts to business terms).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How are data attributes stored in XBRL elements?

A

Attributes in XBRL elements are stored as <dimensions>, including label, ID, definition, type, period, balance, and reference.</dimensions>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the role of schema documents in XBRL?

A

Schema documents in XBRL describe the structure of an XBRL instance document, list the taxonomies used, and declare unique company-specific elements with attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some common mistakes to avoid in data visualization?

A

Avoid having too much or too little information, inconsistency, ignoring the limits of human perception, misrepresenting data, using inappropriate data, and bad taste.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the different types of data in visualization?

A

Data types include items (individual entities), attributes (properties measured or observed), links (relationships between items), and positions (spatial data providing location).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the significance of data visualization in analyzing data?

A

Data visualization helps in finding relationships, discovering structure, quantifying values and influences, and effectively communicating data insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain the concept of ‘Anscombe’s quartet’ in data visualization.

A

Anscombe’s quartet demonstrates how lack of visualization can lead to misinterpretation of data, highlighting the effects of outliers and influential observations on statistical properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the CRAP principle in data visualization?

A

CRAP stands for Contrast, Repetition, Alignment, and Proximity, guiding the effective design and layout of visual data presentations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How does the ‘Gestalt principle’ apply to data visualization?

A

The Gestalt principle suggests that our brain tends to organize visual elements into structured groups based on proximity, similarity, connection, and other factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Exploratory Data Analysis (EDA)?

A

EDA is a method used to analyze, investigate, and summarize data sets’ main characteristics, often using visualization methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define Sampling Bias and its types.

A

Sampling bias is a systematic error in selecting participants for a sample. Types include self-selection, nonresponse, undercoverage, and survivorship biases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the key elements of Descriptive Statistics?

A

Key elements include typical values (mean, median), variation (standard deviation), distribution (skewness, quantiles), abnormalities (outliers, missing values), and variable relationships (correlation).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do mean and median differ, and when is each more appropriate?

A

Mean is the average, best for symmetric distributions without outliers. Median is the middle value, better for skewed distributions or with outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain the concept of a Boxplot.

A

A Boxplot displays the distribution of data based on five summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is Skewness in data?

A

Skewness measures the asymmetry of a data distribution. Positive skew indicates right-skewed data, negative skew left-skewed, and zero skew indicates symmetry.

26
Q

How are outliers and missing values managed in data analysis?

A

Outliers can be identified and managed through methods like trimming or winsorizing. Missing values can be replaced, discarded, or analyzed further.

27
Q

What is the primary purpose of machine learning in analytics?

A

Machine learning is used to estimate a prediction function relating inputs to outputs, using training data and various methods for feature extraction and model evaluation.

28
Q

Describe the concept of loss functions in evaluating predictions.

A

Loss functions assess the quality of prediction functions; in classification, it’s binary (wrong or correct), and in regression, it’s based on the squared difference between predictions and actual values.

29
Q

What is the significance of the bias-variance tradeoff in machine learning?

A

The bias-variance tradeoff is the balance between a model’s complexity (variance) and its performance on training data (bias). High bias can lead to underfitting, while high variance can lead to overfitting.

30
Q

How do training and test sets function in machine learning?

A

the training set is used to train the model, and the test set, which contains unseen data, is used to evaluate the model’s performance and predict its real-world application.

31
Q

Explain K-fold cross-validation.

A

In K-fold cross-validation, data is divided into K equal parts. The model is trained on K-1 parts and tested on the remaining part. This process repeats K times, averaging the results for stability.

32
Q

What is regularization in machine learning?

A

Regularization involves techniques to calibrate machine learning models, minimizing adjusted loss functions to prevent overfitting and underfitting, enhancing model’s generalizability.

33
Q

What are the key considerations in using machine learning for predicting earnings changes?

A

Predicting earnings changes requires considering the limited impact of financial statement data alone, as earnings are influenced by various factors beyond the data

34
Q

What challenges arise with least squares estimation in regression?

A

Challenges include difficulty estimating when the number of variables is close to or larger than the number of observations, leading to overfitting or high variance, and reduced interpretability.

35
Q

What are the methods to reduce model complexity in regression?

A

Methods include subset selection, shrinkage (or regularization), and dimension reduction, which help improve predictive accuracy and manage model complexity.

36
Q

Describe the concept of forward and backward stepwise selection in regression.

A

Forward stepwise selection gradually adds the most significant variables, while backward stepwise selection starts with all variables and removes the least significant ones iteratively.

37
Q

How do hybrid methods work?

A

Hybrid methods combine forward and backward steps, adding significant variables and removing non-significant ones to balance model complexity and predictive performance.

38
Q

How does the Akaike Information Criterion (AIC) function in model selection?

A

AIC helps determine the best model by penalizing complexity; a lower AIC value indicates a better balance between model fit and complexity. Cp = 1/n(RSS + 2dσ^2) (sum of residual sum of squares and the product of number of predictors and sigma squared(variance measure) divided by the nr of observations)

39
Q

What are the advantages and drawbacks of best subset selection in regression?

A

It allows for identifying the most predictive variables, but can be computationally intensive, potentially lead to overfitting, and may not account for multicollinearity.

40
Q

What are the main advantages of SQL for relational models?

A

SQL is scalable, offers fast processing, and is often embedded into other programming languages.

41
Q

Explain the concept of a relational database.

A

A relational database allows access to related data points in different tables, using SQL for data manipulation.

42
Q

Describe the “CREATE TABLE” SQL command and its purpose.

A

“CREATE TABLE” is used to define a new table in SQL, specifying its columns, data types, and constraints.

43
Q

How do inner and outer joins differ in SQL?

A

Inner joins return rows with matching values in both tables, while outer joins include all rows when there is a match in one of the tables. If there is no match, the missing value will contain NULL

44
Q

What are the functions of primary and foreign keys in database tables?

A

Primary keys uniquely identify each record, while foreign keys establish relationships between tables.

45
Q

How do constraints like “NOT NULL” and “UNIQUE” function in SQL?

A

“NOT NULL” ensures a column cannot have null values, and “UNIQUE” ensures all values in a column are different.

46
Q

Explain the role of a data warehouse in data management.

A

Data warehouses store processed data organized for efficient querying and analysis, often in data marts.

47
Q

How do XML and JSON differ in structuring data?

A

XML uses a hierarchical tree structure with nested elements, while JSON is based on name-value pairs and arrays.

48
Q

How do XBRL instance documents work?

A

XBRL instance documents tag financial data for electronic communication, making it machine-readable and standardized.

49
Q

Describe the function of taxonomy in XBRL.

A

Taxonomy in XBRL acts like a dictionary, defining specific tags for different financial reporting elements.

50
Q

What role do schema documents play in XBRL?

A

Schema documents describe the structure of XBRL instance documents, listing taxonomies used and declaring unique elements.

51
Q

Explain the concept of linkbases in XBRL

A

Linkbases in XBRL link additional useful information to facts in instance documents, enhancing data usability.

52
Q

What are the challenges in analyzing textual data in financial contexts?

A

Textual data analysis faces challenges like unstructured format, complexity in collection and analysis, and issues with detecting deceptive language.

53
Q

What are key principles of effective data visualization?

A

Effective data visualization follows principles like maintaining informativeness, readability, viewer-centric design, and appropriate use of colors and contrast

54
Q

How do different visualization types cater to various data characteristics?

A

Visualization types like bar charts, histograms, scatter plots, and maps are chosen based on data characteristics like amounts, distributions, relationships, and geospatial information.

55
Q

What is the role of tree-based models in analytics?

A

Tree-based models use if-then rules to generate predictions, applicable for both regression (predicting numerical values) and classification (predicting categorical values).

56
Q

Explain the concept of tree pruning in decision trees.

A

ree pruning involves cutting back a large, complex tree to a simpler version, reducing overfitting and improving interpretability.

57
Q

How do ensemble methods like bagging and random forests improve prediction?

A

Ensemble methods, like bagging and random forests, combine multiple models to reduce variance, increase prediction accuracy, and prevent overfitting.

58
Q

Describe K-means clustering and its purpose

A

K-means clustering partitions data into K distinct, non-overlapping clusters based on similarity, useful for identifying natural groupings in data.

59
Q

How does the Elbow method aid in determining the number of clusters in K-means?

A

The Elbow method plots the within-cluster sum of squares against the number of clusters, helping to find the optimal number of clusters where the curve bends.

60
Q

What are the challenges in using clustering methods?

A

hallenges include selecting the number of clusters, data standardization, and the non-robustness of results to changes in the underlying data.

61
Q

How do loss functions in machine learning evaluate prediction quality?

A

Loss functions assess prediction accuracy, with binary losses for classification and squared difference losses for regression.

62
Q

Describe the significance of sensitivity and specificity in model evaluation

A

Sensitivity (true positive rate) and specificity (true negative rate) measure a model’s ability to correctly identify positive and negative cases, respectively.