Quiz 1 Review Flashcards
What are the main differences between structured, semi-structured, and unstructured data?
Structured data is organized with a clear format (e.g., tables), semi-structured has some structure (e.g., JSON), and unstructured lacks a defined format (e.g., text, images)
List common types of data in programming and data analysis
Common data types include string (text), integer (whole numbers), float (decimal numbers), boolean (true/false), and character
Outline the steps in the CRISP-DM process and their purpose in data mining and analysis
Stands for Cross-Industry Standard Process for Data Mining. CRISP-DM involves Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment; used for structured approaches in data analysis and mining
Define API and describe its primary purpose in software development and data integration
: API (Application Programming Interface) is a set of rules for software to interact with other software; used to access web services and retrieve data.
What is a black box model and how is it utilized in machine learning?
A black box model conceals its internal workings; used when the focus is on the model’s output rather than its process or logic.
Explain what it means to have imbalanced classes in a dataset
Imbalanced classes occur when one class has significantly fewer instances than others in a dataset.
Provide key actions to take and avoid when dealing with missing data, imbalanced classes, and outliers in a dataset
Impute or remove missing data; address imbalanced classes with techniques like oversampling or under sampling; handle outliers by identifying and potentially removing them.
Define ETL and describe its significance in data processing and analysis
ETL (Extract, Transform, Load) is a process for data integration: extracting data from sources, transforming it into a usable format, and loading it into a target database.
Why do we conduct exploratory data analysis(EDA) in the field of data science?
EDA is used to understand data’s main characteristics, patterns, and relationships; aids in making informed decisions for subsequent analysis.
Define metadata and explain its relevance to understanding and managing data
Metadata is data about data, providing information about data attributes, structure, and context; aids in understanding and managing data effectively.
Fishbone Diagram
This is a handy diagram to help us think
through the factors that could be inputs to
the business question/problem.
Classification
We use this to put new data points in
categories.
Type of learning we use to find patterns in
unlabeled data
Unsupervised
We implement an ______________ to study
what is in our data.
Algorithm
Type of learning we use to predict or
classify using labeled training data
Supervised