Advanced DA (ML) Flashcards
What Is Clustering? List the Main Properties of Clustering Algorithms.
Clustering is the technique of identifying groups or categories within a dataset and placing data values into those groups, thus creating clusters.
Clustering algorithms have the following properties:
Iterative
Hard or soft
Disjunctive
Flat or hierarchical
What Is Logistic Regression?
Logistic regression is a form of predictive analysis that is used in cases where the dependent variable is dichotomous in nature.
Examples: is an e-mail spam or not, is tumor malignant or not…
What Is Linear Regression?
Linear regression is a statistical method used to find out how two variables are related to each other. The process used to establish this relationship involves fitting a linear equation to the dataset.
Explain Kmeans Clustering.
Analysts use K-means clustering to partition observations into k non-overlapping sub-groups called clusters. It is a popular technique for cluster analysis in data mining.
What Do You Mean by Hierarchical Clustering?
Hierarchical clustering is a data analysis method that first considers every data point as its own cluster. It then uses the following iterative method to create larger clusters:
Identify the values, which are now clusters themselves, that are the closest to each other.
Merge the two clusters that are most compatible with each other.
Explain Data Warehousing.
A data warehouse is a data storage system that collects data from various disparate sources and stores them in a way that makes it easy to produce important business insights. Data warehousing is the process of identifying heterogeneous data sources, sourcing data, cleaning it, and transforming it into a manageable form for storage in a data warehouse.
How Do You Differentiate Between Overfitting and Underfitting?
Underfitting and overfitting are both modeling errors.
OVERFITTING:
The model trains the data well using the training set.
The performance drops considerably over the test set.
UNDERFITTING:
The model neither trains the data well nor can generalize to new data.
Performs poorly both on the train and the test set.