Quiz 1 Review Flashcards

1
Q

What are the main differences between structured, semi-structured, and unstructured data?

A

Structured data is organized with a clear format (e.g., tables), semi-structured has some structure (e.g., JSON), and unstructured lacks a defined format (e.g., text, images)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

List common types of data in programming and data analysis

A

Common data types include string (text), integer (whole numbers), float (decimal numbers), boolean (true/false), and character

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Outline the steps in the CRISP-DM process and their purpose in data mining and analysis

A

Stands for Cross-Industry Standard Process for Data Mining. CRISP-DM involves Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment; used for structured approaches in data analysis and mining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define API and describe its primary purpose in software development and data integration

A

: API (Application Programming Interface) is a set of rules for software to interact with other software; used to access web services and retrieve data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a black box model and how is it utilized in machine learning?

A

A black box model conceals its internal workings; used when the focus is on the model’s output rather than its process or logic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain what it means to have imbalanced classes in a dataset

A

Imbalanced classes occur when one class has significantly fewer instances than others in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Provide key actions to take and avoid when dealing with missing data, imbalanced classes, and outliers in a dataset

A

Impute or remove missing data; address imbalanced classes with techniques like oversampling or under sampling; handle outliers by identifying and potentially removing them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define ETL and describe its significance in data processing and analysis

A

ETL (Extract, Transform, Load) is a process for data integration: extracting data from sources, transforming it into a usable format, and loading it into a target database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why do we conduct exploratory data analysis(EDA) in the field of data science?

A

EDA is used to understand data’s main characteristics, patterns, and relationships; aids in making informed decisions for subsequent analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define metadata and explain its relevance to understanding and managing data

A

Metadata is data about data, providing information about data attributes, structure, and context; aids in understanding and managing data effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Fishbone Diagram

A

This is a handy diagram to help us think
through the factors that could be inputs to
the business question/problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Classification

A

We use this to put new data points in
categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Type of learning we use to find patterns in
unlabeled data

A

Unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

We implement an ______________ to study
what is in our data.

A

Algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Type of learning we use to predict or
classify using labeled training data

A

Supervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

. The data mining process we’e following is
____

A

CRISP - DM

17
Q

Type of learning we use to teach robots how
to vacuum floors.

A

Reinforcement

18
Q

We perform ________ analysis to find
subgroups in the data.

A

Clustering

19
Q

To figure out what type of products people
buy together, we perform _____ ______
mining.

A

Association Rule

20
Q

A ______ _______ diagram helps us
formulate the analytic problem.

A

Black-Box

21
Q

To predict the value of y based on the value
of x, we use __________ analysis.

A

Regression

22
Q

. It all starts with a _____________.

A

Question

23
Q

PCA is a type of ________ _________.

A

Dimension Reduction

24
Q

Data mining involves deriving ___________ from raw data using algorithms.

A

Insights

25
Q

Clustering ___________ the difference between points within the same cluster.

A

minimizes

26
Q

Clustering ____________ the distance between points that are in different clusters

A

maximizes

27
Q

Another term for association rule mining, which describes what it is often used for is _______ ________ analysis

A

market basket

28
Q

Geospatial data can be raster or ______________.

A

Vector

29
Q

Examining data for missing values, errors, formatting, statistical distributions etc. is (2 words) ____ _____ analysis.

A

Exploratory Data

30
Q

Central tendency, dispersion, shape, frequency and percentiles are all examples of (1 word) _____ statistics.

A

Descriptive

31
Q

What does Stratified k fold do?

A

ensures that all samples taken for testing have the same class imbalance as the whole dataset.

32
Q

How can your time series model training and testing be ruined if you don’t use forward chaining?

A

Using backward or reverse chaining in time series model training and testing can ruin the process because it introduces information leakage, results in unrealistic evaluation, violates causality, and provides invalid performance estimates.