Quiz 1 Review Flashcards
What are the main differences between structured, semi-structured, and unstructured data?
Structured data is organized with a clear format (e.g., tables), semi-structured has some structure (e.g., JSON), and unstructured lacks a defined format (e.g., text, images)
List common types of data in programming and data analysis
Common data types include string (text), integer (whole numbers), float (decimal numbers), boolean (true/false), and character
Outline the steps in the CRISP-DM process and their purpose in data mining and analysis
Stands for Cross-Industry Standard Process for Data Mining. CRISP-DM involves Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment; used for structured approaches in data analysis and mining
Define API and describe its primary purpose in software development and data integration
: API (Application Programming Interface) is a set of rules for software to interact with other software; used to access web services and retrieve data.
What is a black box model and how is it utilized in machine learning?
A black box model conceals its internal workings; used when the focus is on the model’s output rather than its process or logic.
Explain what it means to have imbalanced classes in a dataset
Imbalanced classes occur when one class has significantly fewer instances than others in a dataset.
Provide key actions to take and avoid when dealing with missing data, imbalanced classes, and outliers in a dataset
Impute or remove missing data; address imbalanced classes with techniques like oversampling or under sampling; handle outliers by identifying and potentially removing them.
Define ETL and describe its significance in data processing and analysis
ETL (Extract, Transform, Load) is a process for data integration: extracting data from sources, transforming it into a usable format, and loading it into a target database.
Why do we conduct exploratory data analysis(EDA) in the field of data science?
EDA is used to understand data’s main characteristics, patterns, and relationships; aids in making informed decisions for subsequent analysis.
Define metadata and explain its relevance to understanding and managing data
Metadata is data about data, providing information about data attributes, structure, and context; aids in understanding and managing data effectively.
Fishbone Diagram
This is a handy diagram to help us think
through the factors that could be inputs to
the business question/problem.
Classification
We use this to put new data points in
categories.
Type of learning we use to find patterns in
unlabeled data
Unsupervised
We implement an ______________ to study
what is in our data.
Algorithm
Type of learning we use to predict or
classify using labeled training data
Supervised
. The data mining process we’e following is
____
CRISP - DM
Type of learning we use to teach robots how
to vacuum floors.
Reinforcement
We perform ________ analysis to find
subgroups in the data.
Clustering
To figure out what type of products people
buy together, we perform _____ ______
mining.
Association Rule
A ______ _______ diagram helps us
formulate the analytic problem.
Black-Box
To predict the value of y based on the value
of x, we use __________ analysis.
Regression
. It all starts with a _____________.
Question
PCA is a type of ________ _________.
Dimension Reduction
Data mining involves deriving ___________ from raw data using algorithms.
Insights
Clustering ___________ the difference between points within the same cluster.
minimizes
Clustering ____________ the distance between points that are in different clusters
maximizes
Another term for association rule mining, which describes what it is often used for is _______ ________ analysis
market basket
Geospatial data can be raster or ______________.
Vector
Examining data for missing values, errors, formatting, statistical distributions etc. is (2 words) ____ _____ analysis.
Exploratory Data
Central tendency, dispersion, shape, frequency and percentiles are all examples of (1 word) _____ statistics.
Descriptive
What does Stratified k fold do?
ensures that all samples taken for testing have the same class imbalance as the whole dataset.
How can your time series model training and testing be ruined if you don’t use forward chaining?
Using backward or reverse chaining in time series model training and testing can ruin the process because it introduces information leakage, results in unrealistic evaluation, violates causality, and provides invalid performance estimates.