Test 1 Flashcards

Question 1

Q

What are the six phases of CRISP-DM?

Answer

A

1) Business Understanding 2) Data Understanding 3) Data Preparation 4) Modeling 5) Evaluation 6) Deployment

Question 2

Q

What is the main purpose of the Data Understanding phase in CRISP-DM?

Answer

A

To evaluate the raw material (data) for the project and perform Exploratory Data Analysis (EDA).

Question 3

Q

What are the different types of decisions in data science?

Answer

A

1) Structured 2) Semi-Structured 3) Unstructured.

Question 4

Q

What are common types of data visualizations?

Answer

A

Scatterplots, bar charts, cross tabs, pie charts, histograms, etc.

Question 5

Q

What is the difference between operational and analytic data stores?

Answer

A

Operational data stores support day-to-day operations, while analytic data stores support decision-making and analysis.

Question 6

Q

What is the purpose of database normalization?

Answer

A

To reduce data redundancy and improve data integrity by organizing the database into tables.

Question 7

Q

What is UMLS?

Answer

A

Unified Medical Language System, often used to integrate and manage medical terminologies and relationships in databases.

Question 8

Q

What do associations and multiplicities represent in database design?

Answer

A

Associations represent relationships between entities, and multiplicities define the number of instances in those relationships (e.g., one-to-many, many-to-many).

Question 9

Q

What are the key SQL concepts?

Answer

A

SELECT, JOIN, WHERE, GROUP BY, INSERT, UPDATE, DELETE, and more.

Question 10

Q

What does the REA model stand for?

Answer

A

Resources, Events, and Agents—used in database design to model business transactions.

Question 11

Q

Question 12

Q

What are the different types of keys in databases?

Answer

A

Primary key, foreign key, composite key, and unique key.

Question 13

Q

What does a scatterplot represent?

Answer

A

It shows the relationship between two variables using points plotted in two dimensions.

Question 14

Q

What are bar charts used for?

Answer

A

To compare different categories or groups using bars of varying lengths.

Question 15

Q

What is the purpose of a cross-tabulation (cross tab)?

Answer

A

To summarize the relationship between two categorical variables in a matrix format.

Question 16

Q

What are evaluation metrics used for?

Answer

A

To measure the performance of models or attribute combinations, such as accuracy, precision, recall, F1-score.

Question 17

Q

What does R² represent in statistics?

Answer

A

It represents the proportion of variance in the dependent variable that is predictable from the independent variable(s).

Question 18

Q

What is the purpose of ANOVA?

Answer

A

Analysis of Variance (ANOVA) is used to compare means among three or more groups to see if there is a significant differenc

Question 19

Q

What is a t-test used for?

Answer

A

To determine if there is a significant difference between the means of two groups.

Question 20

Q

What does the Chi-Square test measure?

Answer

A

It tests the association between two categorical variables.

Question 21

Q

What are the Four Vs of Big Data?

Answer

A

Volume, Velocity, Variety, and Veracity.

Question 22

Q

What are data integration pipelines and models?

Answer

A

Pipelines automate the movement and transformation of data, while models define how data from different sources is combined.

Question 23

Q

What are the key components of data architecture?

Answer

A

Requirements, approaches, workflows, data storage, and integration mechanisms.

Question 24

Q

What are the core aspects of data governance?

Answer

A

Cataloging, discovery, quality management, master data management, and data sharing.

Question 25

Q

What are the key concerns in information security?

Answer

A

Classification, privacy, encryption, and management in-depth.

Question 26

Q

What does XaaS stand for and what services does it include?

Answer

A

Anything-as-a-Service, including SaaS (Software), PaaS (Platform), and IaaS (Infrastructure).

Question 27

Q

What are the key components of IT infrastructure?

Answer

A

Hardware, software, networking, and data storage systems.

Question 28

Q

What are some basic macroeconomic concepts?

Answer

A

GDP, inflation, unemployment, monetary policy, fiscal policy.

Question 29

Q

Why is Exploratory Data Analysis (EDA) crucial in the data understanding phase?

Answer

A

EDA helps to detect patterns, anomalies, and relationships within data, guiding decisions about how to prepare the data and select models.

Question 30

Q

What are the key components of a data quality report in the CRISP-DM framework?

Answer

A

Data completeness, consistency, accuracy, validity, and timeliness.

Question 31

Q

What are the main categories of models used in data science projects?

Answer

A

1) Predictive models (e.g., regression)
2) Descriptive models (e.g., clustering)
3) Prescriptive models (e.g., optimization).

Question 32

Q

What factors should be considered when deciding on data inclusion or exclusion during the data preparation phase?

Answer

A

Relevance to the business problem, data quality, presence of outliers, missing values, and potential bias.

Question 33

Q

Why is it important to list assumptions when building a model?

Answer

A

Assumptions impact the model’s reliability and must be validated to ensure accurate predictions and generalizability.

Question 34

Q

What are the main types of SQL JOINs, and when are they used?

Answer

A

1) INNER JOIN: to match records from two tables where there is a match in both. 2) LEFT JOIN: returns all records from the left table and matched records from the right. 3) RIGHT JOIN: returns all records from the right table and matched records from the left. 4) FULL OUTER JOIN: returns all records when there is a match in one of the tables.

Question 35

Q

What are the key normalization forms in databases, and what do they achieve?

Answer

A

1st Normal Form (1NF): eliminates duplicate columns,
2nd Normal Form (2NF): removes subsets of data that apply to multiple rows,
3rd Normal Form (3NF): ensures that all fields depend solely on the primary key.

Question 36

Q

What are derived attributes, and why are they important in data preparation?

Answer

A

Derived attributes are new variables created from existing data (e.g., calculating profit from revenue and cost), which help enhance the model’s performance by providing more meaningful inputs.

Question 37

Q

What metrics can be used to evaluate the performance of a predictive model?

Answer

A

Accuracy, precision, recall, F1-score, R² for regression, and confusion matrix for classification models.

Question 38

Q

When should you use a t-test versus ANOVA?

Answer

A

A t-test is used to compare the means of two groups, while ANOVA is used to compare means among three or more groups.

Question 39

Q

What are the primary functions of a data pipeline in data integration?

Answer

A

To extract, transform, and load (ETL) data from various sources into a centralized system for analysis and reporting.

Question 40

Q

What are the four key components of an information security strategy?

Answer

A

1) Confidentiality 2) Integrity 3) Availability 4) Non-repudiation.

Question 41

Q

How do privacy and encryption differ in the context of data security?

Answer

A

Privacy ensures that personal and sensitive information is protected, while encryption converts data into a secure format to prevent unauthorized access during transmission or storage.

Question 42

Q

What does a low R² value indicate about a regression model’s fit?

Answer

A

A low R² suggests that the model explains only a small portion of the variability in the data, indicating a poor fit.

Question 43

Q

How do the Four Vs of Big Data impact data analytics strategies?

Answer

A

Volume influences storage
velocity affects real-time processing
variety introduces complexities in data formats
veracity impacts data quality and trustworthiness

Question 44

Q

What role does data cataloging play in data governance?

Answer

A

Data cataloging organizes metadata to enable users to discover, understand, and trust data, improving accessibility and governance.

Question 45

Q

What are the key metrics for measuring XaaS performance?

Answer

A

Service availability, response time, cost-efficiency, scalability, and user satisfaction.

Question 46

Q

How do scatterplots visually indicate correlation between variables?

Answer

A

The direction and strength of the scatterplot’s trendline suggest whether there is a positive, negative, or no correlation between the variables.

Question 47

Q

In what scenario would a business analyst use cross-tabulation?

Answer

A

To explore the relationship between two categorical variables, such as customer demographics and product preferences.

Question 48

Q

What are the assumptions necessary to perform a Chi-Square test?

Answer

A

Observations must be independent, the sample size should be sufficiently large, and expected frequency counts should be 5 or more for each category.