Test 1 Flashcards
What are the six phases of CRISP-DM?
1) Business Understanding 2) Data Understanding 3) Data Preparation 4) Modeling 5) Evaluation 6) Deployment
What is the main purpose of the Data Understanding phase in CRISP-DM?
To evaluate the raw material (data) for the project and perform Exploratory Data Analysis (EDA).
What are the different types of decisions in data science?
1) Structured 2) Semi-Structured 3) Unstructured.
What are common types of data visualizations?
Scatterplots, bar charts, cross tabs, pie charts, histograms, etc.
What is the difference between operational and analytic data stores?
Operational data stores support day-to-day operations, while analytic data stores support decision-making and analysis.
What is the purpose of database normalization?
To reduce data redundancy and improve data integrity by organizing the database into tables.
What is UMLS?
Unified Medical Language System, often used to integrate and manage medical terminologies and relationships in databases.
What do associations and multiplicities represent in database design?
Associations represent relationships between entities, and multiplicities define the number of instances in those relationships (e.g., one-to-many, many-to-many).
What are the key SQL concepts?
SELECT, JOIN, WHERE, GROUP BY, INSERT, UPDATE, DELETE, and more.
What does the REA model stand for?
Resources, Events, and Agents—used in database design to model business transactions.
What are the different types of keys in databases?
Primary key, foreign key, composite key, and unique key.
What does a scatterplot represent?
It shows the relationship between two variables using points plotted in two dimensions.
What are bar charts used for?
To compare different categories or groups using bars of varying lengths.
What is the purpose of a cross-tabulation (cross tab)?
To summarize the relationship between two categorical variables in a matrix format.
What are evaluation metrics used for?
To measure the performance of models or attribute combinations, such as accuracy, precision, recall, F1-score.
What does R² represent in statistics?
It represents the proportion of variance in the dependent variable that is predictable from the independent variable(s).
What is the purpose of ANOVA?
Analysis of Variance (ANOVA) is used to compare means among three or more groups to see if there is a significant differenc
What is a t-test used for?
To determine if there is a significant difference between the means of two groups.
What does the Chi-Square test measure?
It tests the association between two categorical variables.
What are the Four Vs of Big Data?
Volume, Velocity, Variety, and Veracity.
What are data integration pipelines and models?
Pipelines automate the movement and transformation of data, while models define how data from different sources is combined.
What are the key components of data architecture?
Requirements, approaches, workflows, data storage, and integration mechanisms.
What are the core aspects of data governance?
Cataloging, discovery, quality management, master data management, and data sharing.
What are the key concerns in information security?
Classification, privacy, encryption, and management in-depth.
What does XaaS stand for and what services does it include?
Anything-as-a-Service, including SaaS (Software), PaaS (Platform), and IaaS (Infrastructure).
What are the key components of IT infrastructure?
Hardware, software, networking, and data storage systems.
What are some basic macroeconomic concepts?
GDP, inflation, unemployment, monetary policy, fiscal policy.
Why is Exploratory Data Analysis (EDA) crucial in the data understanding phase?
EDA helps to detect patterns, anomalies, and relationships within data, guiding decisions about how to prepare the data and select models.
What are the key components of a data quality report in the CRISP-DM framework?
Data completeness, consistency, accuracy, validity, and timeliness.
What are the main categories of models used in data science projects?
1) Predictive models (e.g., regression)
2) Descriptive models (e.g., clustering)
3) Prescriptive models (e.g., optimization).
What factors should be considered when deciding on data inclusion or exclusion during the data preparation phase?
Relevance to the business problem, data quality, presence of outliers, missing values, and potential bias.
Why is it important to list assumptions when building a model?
Assumptions impact the model’s reliability and must be validated to ensure accurate predictions and generalizability.
What are the main types of SQL JOINs, and when are they used?
1) INNER JOIN: to match records from two tables where there is a match in both. 2) LEFT JOIN: returns all records from the left table and matched records from the right. 3) RIGHT JOIN: returns all records from the right table and matched records from the left. 4) FULL OUTER JOIN: returns all records when there is a match in one of the tables.
What are the key normalization forms in databases, and what do they achieve?
1st Normal Form (1NF): eliminates duplicate columns,
2nd Normal Form (2NF): removes subsets of data that apply to multiple rows,
3rd Normal Form (3NF): ensures that all fields depend solely on the primary key.
What are derived attributes, and why are they important in data preparation?
Derived attributes are new variables created from existing data (e.g., calculating profit from revenue and cost), which help enhance the model’s performance by providing more meaningful inputs.
What metrics can be used to evaluate the performance of a predictive model?
Accuracy, precision, recall, F1-score, R² for regression, and confusion matrix for classification models.
When should you use a t-test versus ANOVA?
A t-test is used to compare the means of two groups, while ANOVA is used to compare means among three or more groups.
What are the primary functions of a data pipeline in data integration?
To extract, transform, and load (ETL) data from various sources into a centralized system for analysis and reporting.
What are the four key components of an information security strategy?
1) Confidentiality 2) Integrity 3) Availability 4) Non-repudiation.
How do privacy and encryption differ in the context of data security?
Privacy ensures that personal and sensitive information is protected, while encryption converts data into a secure format to prevent unauthorized access during transmission or storage.
What does a low R² value indicate about a regression model’s fit?
A low R² suggests that the model explains only a small portion of the variability in the data, indicating a poor fit.
How do the Four Vs of Big Data impact data analytics strategies?
Volume influences storage
velocity affects real-time processing
variety introduces complexities in data formats
veracity impacts data quality and trustworthiness
What role does data cataloging play in data governance?
Data cataloging organizes metadata to enable users to discover, understand, and trust data, improving accessibility and governance.
What are the key metrics for measuring XaaS performance?
Service availability, response time, cost-efficiency, scalability, and user satisfaction.
How do scatterplots visually indicate correlation between variables?
The direction and strength of the scatterplot’s trendline suggest whether there is a positive, negative, or no correlation between the variables.
In what scenario would a business analyst use cross-tabulation?
To explore the relationship between two categorical variables, such as customer demographics and product preferences.
What are the assumptions necessary to perform a Chi-Square test?
Observations must be independent, the sample size should be sufficiently large, and expected frequency counts should be 5 or more for each category.