Test 1 Flashcards

1
Q

What are the six phases of CRISP-DM?

A

1) Business Understanding 2) Data Understanding 3) Data Preparation 4) Modeling 5) Evaluation 6) Deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the main purpose of the Data Understanding phase in CRISP-DM?

A

To evaluate the raw material (data) for the project and perform Exploratory Data Analysis (EDA).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the different types of decisions in data science?

A

1) Structured 2) Semi-Structured 3) Unstructured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are common types of data visualizations?

A

Scatterplots, bar charts, cross tabs, pie charts, histograms, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between operational and analytic data stores?

A

Operational data stores support day-to-day operations, while analytic data stores support decision-making and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the purpose of database normalization?

A

To reduce data redundancy and improve data integrity by organizing the database into tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is UMLS?

A

Unified Medical Language System, often used to integrate and manage medical terminologies and relationships in databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do associations and multiplicities represent in database design?

A

Associations represent relationships between entities, and multiplicities define the number of instances in those relationships (e.g., one-to-many, many-to-many).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the key SQL concepts?

A

SELECT, JOIN, WHERE, GROUP BY, INSERT, UPDATE, DELETE, and more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the REA model stand for?

A

Resources, Events, and Agents—used in database design to model business transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the different types of keys in databases?

A

Primary key, foreign key, composite key, and unique key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does a scatterplot represent?

A

It shows the relationship between two variables using points plotted in two dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are bar charts used for?

A

To compare different categories or groups using bars of varying lengths.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of a cross-tabulation (cross tab)?

A

To summarize the relationship between two categorical variables in a matrix format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are evaluation metrics used for?

A

To measure the performance of models or attribute combinations, such as accuracy, precision, recall, F1-score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does R² represent in statistics?

A

It represents the proportion of variance in the dependent variable that is predictable from the independent variable(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the purpose of ANOVA?

A

Analysis of Variance (ANOVA) is used to compare means among three or more groups to see if there is a significant differenc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a t-test used for?

A

To determine if there is a significant difference between the means of two groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the Chi-Square test measure?

A

It tests the association between two categorical variables.

21
Q

What are the Four Vs of Big Data?

A

Volume, Velocity, Variety, and Veracity.

22
Q

What are data integration pipelines and models?

A

Pipelines automate the movement and transformation of data, while models define how data from different sources is combined.

23
Q

What are the key components of data architecture?

A

Requirements, approaches, workflows, data storage, and integration mechanisms.

24
Q

What are the core aspects of data governance?

A

Cataloging, discovery, quality management, master data management, and data sharing.

25
Q

What are the key concerns in information security?

A

Classification, privacy, encryption, and management in-depth.

26
Q

What does XaaS stand for and what services does it include?

A

Anything-as-a-Service, including SaaS (Software), PaaS (Platform), and IaaS (Infrastructure).

27
Q

What are the key components of IT infrastructure?

A

Hardware, software, networking, and data storage systems.

28
Q

What are some basic macroeconomic concepts?

A

GDP, inflation, unemployment, monetary policy, fiscal policy.

29
Q

Why is Exploratory Data Analysis (EDA) crucial in the data understanding phase?

A

EDA helps to detect patterns, anomalies, and relationships within data, guiding decisions about how to prepare the data and select models.

30
Q

What are the key components of a data quality report in the CRISP-DM framework?

A

Data completeness, consistency, accuracy, validity, and timeliness.

31
Q

What are the main categories of models used in data science projects?

A

1) Predictive models (e.g., regression)
2) Descriptive models (e.g., clustering)
3) Prescriptive models (e.g., optimization).

32
Q

What factors should be considered when deciding on data inclusion or exclusion during the data preparation phase?

A

Relevance to the business problem, data quality, presence of outliers, missing values, and potential bias.

33
Q

Why is it important to list assumptions when building a model?

A

Assumptions impact the model’s reliability and must be validated to ensure accurate predictions and generalizability.

34
Q

What are the main types of SQL JOINs, and when are they used?

A

1) INNER JOIN: to match records from two tables where there is a match in both. 2) LEFT JOIN: returns all records from the left table and matched records from the right. 3) RIGHT JOIN: returns all records from the right table and matched records from the left. 4) FULL OUTER JOIN: returns all records when there is a match in one of the tables.

35
Q

What are the key normalization forms in databases, and what do they achieve?

A

1st Normal Form (1NF): eliminates duplicate columns,
2nd Normal Form (2NF): removes subsets of data that apply to multiple rows,
3rd Normal Form (3NF): ensures that all fields depend solely on the primary key.

36
Q

What are derived attributes, and why are they important in data preparation?

A

Derived attributes are new variables created from existing data (e.g., calculating profit from revenue and cost), which help enhance the model’s performance by providing more meaningful inputs.

37
Q

What metrics can be used to evaluate the performance of a predictive model?

A

Accuracy, precision, recall, F1-score, R² for regression, and confusion matrix for classification models.

38
Q

When should you use a t-test versus ANOVA?

A

A t-test is used to compare the means of two groups, while ANOVA is used to compare means among three or more groups.

39
Q

What are the primary functions of a data pipeline in data integration?

A

To extract, transform, and load (ETL) data from various sources into a centralized system for analysis and reporting.

40
Q

What are the four key components of an information security strategy?

A

1) Confidentiality 2) Integrity 3) Availability 4) Non-repudiation.

41
Q

How do privacy and encryption differ in the context of data security?

A

Privacy ensures that personal and sensitive information is protected, while encryption converts data into a secure format to prevent unauthorized access during transmission or storage.

42
Q

What does a low R² value indicate about a regression model’s fit?

A

A low R² suggests that the model explains only a small portion of the variability in the data, indicating a poor fit.

43
Q

How do the Four Vs of Big Data impact data analytics strategies?

A

Volume influences storage
velocity affects real-time processing
variety introduces complexities in data formats
veracity impacts data quality and trustworthiness

44
Q

What role does data cataloging play in data governance?

A

Data cataloging organizes metadata to enable users to discover, understand, and trust data, improving accessibility and governance.

45
Q

What are the key metrics for measuring XaaS performance?

A

Service availability, response time, cost-efficiency, scalability, and user satisfaction.

46
Q

How do scatterplots visually indicate correlation between variables?

A

The direction and strength of the scatterplot’s trendline suggest whether there is a positive, negative, or no correlation between the variables.

47
Q

In what scenario would a business analyst use cross-tabulation?

A

To explore the relationship between two categorical variables, such as customer demographics and product preferences.

48
Q

What are the assumptions necessary to perform a Chi-Square test?

A

Observations must be independent, the sample size should be sufficiently large, and expected frequency counts should be 5 or more for each category.