Flash cards made by gpt based on summary

1
Q

What are the four Vs of big data?

A

Volume (vast amounts of data), Variety (different types of data), Velocity (speed of data generation and movement), and Veracity (quality of data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between supervised and unsupervised learning in data analytics?

A

Supervised learning involves training a model on a labeled dataset, while unsupervised learning involves the model trying to understand and structure unlabeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the OSEMN steps in data science.

A

Obtain (extract, import, scrape), Scrub (clean and manage), Explore, Model (analyze), and Interpret (communicate).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the challenge in organizational data management?

A

The main challenge is making effective use of organizational data. Collecting data is not enough; it must be used effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the features of organizational data management systems?

A

Features include a storage medium, a common structure for the dataset, an interface for rapid entry and retrieval. (NB acknowledging trade-offs in system design is also important)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the types of organizational data management systems?

A

Types include Transaction Processing System, Management Information System, Decision Support Systems, Business Intelligence, Online Analytical Processing, Data Mining, and Machine Learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are desirable attributes of data in data management?

A

Data should be shareable, transportable, secure, accurate, timely, and relevant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do XML and HTML differ in data exchange?

A

XML is an extensible language without predefined tags, used for data exchange, while HTML focuses more on presentation and formatting with predefined tags.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is JSON and its significance in data interchange?

A

JSON (JavaScript Object Notation) is a text-based, human-readable format used universally for web applications’ data interchange. It’s based on JavaScript object syntax.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is XBRL (Extensible Business Reporting Language) important in digital reporting?

A

XBRL standardizes digital reporting, making it more accurate and secure. It’s crucial for facilitating standardized information exchange.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the main components of an XBRL instance?

A

An XBRL instance includes values (text or numbers), context and variables (like entity, period, decimals, currency), concepts (business terms representation), and a dictionary (linking concepts to business terms).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How are data attributes stored in XBRL elements?

A

Attributes in XBRL elements are stored as <dimensions>, including label, ID, definition, type, period, balance, and reference.</dimensions>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the role of schema documents in XBRL?

A

Schema documents in XBRL describe the structure of an XBRL instance document, list the taxonomies used, and declare unique company-specific elements with attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some common mistakes to avoid in data visualization?

A

Avoid having too much or too little information, inconsistency, ignoring the limits of human perception, misrepresenting data, using inappropriate data, and bad taste.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the different types of data in visualization?

A

Data types include items (individual entities), attributes (properties measured or observed), links (relationships between items), and positions (spatial data providing location).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the significance of data visualization in analyzing data?

A

Data visualization helps in finding relationships, discovering structure, quantifying values and influences, and effectively communicating data insights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain the concept of ‘Anscombe’s quartet’ in data visualization.

A

Anscombe’s quartet demonstrates how lack of visualization can lead to misinterpretation of data, highlighting the effects of outliers and influential observations on statistical properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the CRAP principle in data visualization?

A

CRAP stands for Contrast, Repetition, Alignment, and Proximity, guiding the effective design and layout of visual data presentations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How does the ‘Gestalt principle’ apply to data visualization?

A

The Gestalt principle suggests that our brain tends to organize visual elements into structured groups based on proximity, similarity, connection, and other factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Exploratory Data Analysis (EDA)?

A

EDA is a method used to analyze, investigate, and summarize data sets’ main characteristics, often using visualization methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define Sampling Bias and its types.

A

Sampling bias is a systematic error in selecting participants for a sample. Types include self-selection, nonresponse, undercoverage, and survivorship biases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the key elements of Descriptive Statistics?

A

Key elements include typical values (mean, median), variation (standard deviation), distribution (skewness, quantiles), abnormalities (outliers, missing values), and variable relationships (correlation).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do mean and median differ, and when is each more appropriate?

A

Mean is the average, best for symmetric distributions without outliers. Median is the middle value, better for skewed distributions or with outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain the concept of a Boxplot.

A

A Boxplot displays the distribution of data based on five summary statistics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is Skewness in data?
Skewness measures the asymmetry of a data distribution. Positive skew indicates right-skewed data, negative skew left-skewed, and zero skew indicates symmetry.
26
How are outliers and missing values managed in data analysis?
Outliers can be identified and managed through methods like trimming or winsorizing. Missing values can be replaced, discarded, or analyzed further.
27
What is the primary purpose of machine learning in analytics?
Machine learning is used to estimate a prediction function relating inputs to outputs, using training data and various methods for feature extraction and model evaluation.
28
Describe the concept of loss functions in evaluating predictions.
Loss functions assess the quality of prediction functions; in classification, it’s binary (wrong or correct), and in regression, it's based on the squared difference between predictions and actual values.
29
What is the significance of the bias-variance tradeoff in machine learning?
The bias-variance tradeoff is the balance between a model's complexity (variance) and its performance on training data (bias). High bias can lead to underfitting, while high variance can lead to overfitting.
30
How do training and test sets function in machine learning?
the training set is used to train the model, and the test set, which contains unseen data, is used to evaluate the model’s performance and predict its real-world application.
31
Explain K-fold cross-validation.
In K-fold cross-validation, data is divided into K equal parts. The model is trained on K-1 parts and tested on the remaining part. This process repeats K times, averaging the results for stability.
32
What is regularization in machine learning?
Regularization involves techniques to calibrate machine learning models, minimizing adjusted loss functions to prevent overfitting and underfitting, enhancing model’s generalizability.
33
What are the key considerations in using machine learning for predicting earnings changes?
Predicting earnings changes requires considering the limited impact of financial statement data alone, as earnings are influenced by various factors beyond the data
34
What challenges arise with least squares estimation in regression?
Challenges include difficulty estimating when the number of variables is close to or larger than the number of observations, leading to overfitting or high variance, and reduced interpretability.
35
What are the methods to reduce model complexity in regression?
Methods include subset selection, shrinkage (or regularization), and dimension reduction, which help improve predictive accuracy and manage model complexity.
36
Describe the concept of forward and backward stepwise selection in regression.
Forward stepwise selection gradually adds the most significant variables, while backward stepwise selection starts with all variables and removes the least significant ones iteratively.
37
How do hybrid methods work?
Hybrid methods combine forward and backward steps, adding significant variables and removing non-significant ones to balance model complexity and predictive performance.
38
How does the Akaike Information Criterion (AIC) function in model selection?
AIC helps determine the best model by penalizing complexity; a lower AIC value indicates a better balance between model fit and complexity. Cp = 1/n(RSS + 2dσ^2) (sum of residual sum of squares and the product of number of predictors and sigma squared(variance measure) divided by the nr of observations)
39
What are the advantages and drawbacks of best subset selection in regression?
It allows for identifying the most predictive variables, but can be computationally intensive, potentially lead to overfitting, and may not account for multicollinearity.
40
What are the main advantages of SQL for relational models?
SQL is scalable, offers fast processing, and is often embedded into other programming languages.
41
Explain the concept of a relational database.
A relational database allows access to related data points in different tables, using SQL for data manipulation.
42
Describe the "CREATE TABLE" SQL command and its purpose.
"CREATE TABLE" is used to define a new table in SQL, specifying its columns, data types, and constraints.
43
How do inner and outer joins differ in SQL?
Inner joins return rows with matching values in both tables, while outer joins include all rows when there is a match in one of the tables. If there is no match, the missing value will contain NULL
44
What are the functions of primary and foreign keys in database tables?
Primary keys uniquely identify each record, while foreign keys establish relationships between tables.
45
How do constraints like "NOT NULL" and "UNIQUE" function in SQL?
"NOT NULL" ensures a column cannot have null values, and "UNIQUE" ensures all values in a column are different.
46
Explain the role of a data warehouse in data management.
Data warehouses store processed data organized for efficient querying and analysis, often in data marts.
47
How do XML and JSON differ in structuring data?
XML uses a hierarchical tree structure with nested elements, while JSON is based on name-value pairs and arrays.
48
How do XBRL instance documents work?
XBRL instance documents tag financial data for electronic communication, making it machine-readable and standardized.
49
Describe the function of taxonomy in XBRL.
Taxonomy in XBRL acts like a dictionary, defining specific tags for different financial reporting elements.
50
What role do schema documents play in XBRL?
Schema documents describe the structure of XBRL instance documents, listing taxonomies used and declaring unique elements.
51
Explain the concept of linkbases in XBRL
Linkbases in XBRL link additional useful information to facts in instance documents, enhancing data usability.
52
What are the challenges in analyzing textual data in financial contexts?
Textual data analysis faces challenges like unstructured format, complexity in collection and analysis, and issues with detecting deceptive language.
53
What are key principles of effective data visualization?
Effective data visualization follows principles like maintaining informativeness, readability, viewer-centric design, and appropriate use of colors and contrast
54
How do different visualization types cater to various data characteristics?
Visualization types like bar charts, histograms, scatter plots, and maps are chosen based on data characteristics like amounts, distributions, relationships, and geospatial information.
55
What is the role of tree-based models in analytics?
Tree-based models use if-then rules to generate predictions, applicable for both regression (predicting numerical values) and classification (predicting categorical values).
56
Explain the concept of tree pruning in decision trees.
ree pruning involves cutting back a large, complex tree to a simpler version, reducing overfitting and improving interpretability.
57
How do ensemble methods like bagging and random forests improve prediction?
Ensemble methods, like bagging and random forests, combine multiple models to reduce variance, increase prediction accuracy, and prevent overfitting.
58
Describe K-means clustering and its purpose
K-means clustering partitions data into K distinct, non-overlapping clusters based on similarity, useful for identifying natural groupings in data.
59
How does the Elbow method aid in determining the number of clusters in K-means?
The Elbow method plots the within-cluster sum of squares against the number of clusters, helping to find the optimal number of clusters where the curve bends.
60
What are the challenges in using clustering methods?
hallenges include selecting the number of clusters, data standardization, and the non-robustness of results to changes in the underlying data.
61
How do loss functions in machine learning evaluate prediction quality?
Loss functions assess prediction accuracy, with binary losses for classification and squared difference losses for regression.
62
Describe the significance of sensitivity and specificity in model evaluation
Sensitivity (true positive rate) and specificity (true negative rate) measure a model's ability to correctly identify positive and negative cases, respectively.