Advanced DA (ML) Flashcards

Question 1

Q

What Is Clustering? List the Main Properties of Clustering Algorithms.

Answer

A

Clustering is the technique of identifying groups or categories within a dataset and placing data values into those groups, thus creating clusters.

Clustering algorithms have the following properties:

Iterative
Hard or soft
Disjunctive
Flat or hierarchical

Question 2

Q

What Is Logistic Regression?

Answer

A

Logistic regression is a form of predictive analysis that is used in cases where the dependent variable is dichotomous in nature.
Examples: is an e-mail spam or not, is tumor malignant or not…

Question 3

Q

What Is Linear Regression?

Answer

A

Linear regression is a statistical method used to find out how two variables are related to each other. The process used to establish this relationship involves fitting a linear equation to the dataset.

Question 4

Q

Explain Kmeans Clustering.

Answer

A

Analysts use K-means clustering to partition observations into k non-overlapping sub-groups called clusters. It is a popular technique for cluster analysis in data mining.

Question 5

Q

What Do You Mean by Hierarchical Clustering?

Answer

A

Hierarchical clustering is a data analysis method that first considers every data point as its own cluster. It then uses the following iterative method to create larger clusters:

Identify the values, which are now clusters themselves, that are the closest to each other.
Merge the two clusters that are most compatible with each other.

Question 6

Q

Explain Data Warehousing.

Answer

A

A data warehouse is a data storage system that collects data from various disparate sources and stores them in a way that makes it easy to produce important business insights. Data warehousing is the process of identifying heterogeneous data sources, sourcing data, cleaning it, and transforming it into a manageable form for storage in a data warehouse.

Question 7

Q

How Do You Differentiate Between Overfitting and Underfitting?

Answer

A

Underfitting and overfitting are both modeling errors.

OVERFITTING:
The model trains the data well using the training set.
The performance drops considerably over the test set.

UNDERFITTING:
The model neither trains the data well nor can generalize to new data.
Performs poorly both on the train and the test set.

Question 8

Q