Quiz 1 Review Flashcards by Christopher Bucsa

What are the main differences between structured, semi-structured, and unstructured data?

Structured data is organized with a clear format (e.g., tables), semi-structured has some structure (e.g., JSON), and unstructured lacks a defined format (e.g., text, images)

How well did you know this?

Not at all

Perfectly

List common types of data in programming and data analysis

Common data types include string (text), integer (whole numbers), float (decimal numbers), boolean (true/false), and character

How well did you know this?

Not at all

Perfectly

Outline the steps in the CRISP-DM process and their purpose in data mining and analysis

Stands for Cross-Industry Standard Process for Data Mining. CRISP-DM involves Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment; used for structured approaches in data analysis and mining

How well did you know this?

Not at all

Perfectly

Define API and describe its primary purpose in software development and data integration

: API (Application Programming Interface) is a set of rules for software to interact with other software; used to access web services and retrieve data.

How well did you know this?

Not at all

Perfectly

What is a black box model and how is it utilized in machine learning?

A black box model conceals its internal workings; used when the focus is on the model’s output rather than its process or logic.

How well did you know this?

Not at all

Perfectly

Explain what it means to have imbalanced classes in a dataset

Imbalanced classes occur when one class has significantly fewer instances than others in a dataset.

How well did you know this?

Not at all

Perfectly

Provide key actions to take and avoid when dealing with missing data, imbalanced classes, and outliers in a dataset

Impute or remove missing data; address imbalanced classes with techniques like oversampling or under sampling; handle outliers by identifying and potentially removing them.

How well did you know this?

Not at all

Perfectly

Define ETL and describe its significance in data processing and analysis

ETL (Extract, Transform, Load) is a process for data integration: extracting data from sources, transforming it into a usable format, and loading it into a target database.

How well did you know this?

Not at all

Perfectly

Why do we conduct exploratory data analysis(EDA) in the field of data science?

EDA is used to understand data’s main characteristics, patterns, and relationships; aids in making informed decisions for subsequent analysis.

How well did you know this?

Not at all

Perfectly

Define metadata and explain its relevance to understanding and managing data

Metadata is data about data, providing information about data attributes, structure, and context; aids in understanding and managing data effectively.

How well did you know this?

Not at all

Perfectly

Fishbone Diagram

This is a handy diagram to help us think
through the factors that could be inputs to
the business question/problem.

How well did you know this?

Not at all

Perfectly

Classification

We use this to put new data points in
categories.

How well did you know this?

Not at all

Perfectly

Type of learning we use to find patterns in
unlabeled data

Unsupervised

How well did you know this?

Not at all

Perfectly

We implement an ______________ to study
what is in our data.

Algorithm

How well did you know this?

Not at all

Perfectly

Type of learning we use to predict or
classify using labeled training data

Supervised

How well did you know this?

Not at all

Perfectly

. The data mining process we’e following is
____

Study These Flashcards

CRISP - DM

Type of learning we use to teach robots how
to vacuum floors.

Study These Flashcards

Reinforcement

We perform ________ analysis to find
subgroups in the data.

Study These Flashcards

Clustering

To figure out what type of products people
buy together, we perform _____ ______
mining.

Study These Flashcards

Association Rule

A ______ _______ diagram helps us
formulate the analytic problem.

Study These Flashcards

Black-Box

To predict the value of y based on the value
of x, we use __________ analysis.

Study These Flashcards

Regression

. It all starts with a _____________.

Study These Flashcards

Question

PCA is a type of ________ _________.

Study These Flashcards

Dimension Reduction

Data mining involves deriving ___________ from raw data using algorithms.

Study These Flashcards

Insights

Clustering ___________ the difference between points within the same cluster.

minimizes

Clustering ____________ the distance between points that are in different clusters

maximizes

Another term for association rule mining, which describes what it is often used for is _______ ________ analysis

market basket

Geospatial data can be raster or ______________.

Vector

Examining data for missing values, errors, formatting, statistical distributions etc. is (2 words) ____ _____ analysis.

Exploratory Data

Central tendency, dispersion, shape, frequency and percentiles are all examples of (1 word) _____ statistics.

Descriptive

What does Stratified k fold do?

ensures that all samples taken for testing have the same class imbalance as the whole dataset.

How can your time series model training and testing be ruined if you don't use forward chaining?

Using backward or reverse chaining in time series model training and testing can ruin the process because it introduces information leakage, results in unrealistic evaluation, violates causality, and provides invalid performance estimates.

Quiz 1 Review Flashcards

(32 cards)