Random Question Flashcards
What are the commonly used programming languages in data science?
Python, R, and SQL.
Fill in the blank: A __________ is a combination of data, algorithms, and machine learning techniques used to make predictions.
predictive model
What is overfitting in machine learning?
When a model learns the training data too well, capturing noise instead of the underlying pattern.
Which of the following is a common metric for evaluating classification models? A) Mean Absolute Error B) Accuracy C) R-squared
B) Accuracy
What is the purpose of cross-validation?
To assess how the results of a statistical analysis will generalize to an independent data set.
What does ETL stand for in data processing?
Extract, Transform, Load.
True or False: Feature engineering is the process of selecting, modifying, or creating features to improve model performance.
True
What is the difference between supervised and unsupervised learning?
Give examples of supervised and unsupervised algorithms
Supervised learning uses labeled data to train models, while unsupervised learning uses unlabeled data.
Supervised learning has a feedback learning
S: decision trees, SVM
U: k-means, hierchical clustering
What is a confusion matrix?
A table used to evaluate the performance of a classification model by comparing predicted and actual outcomes.
Fill in the blank: The __________ is a statistical measure that represents the likelihood of an event occurring.
probability
What is the purpose of a data pipeline?
To automate and streamline the process of data collection, transformation, and storage.
Which algorithm is commonly used for regression tasks?
Linear regression.
What is the significance of p-values in hypothesis testing?
P-values indicate the probability of observing the data, or something more extreme, under the null hypothesis.
True or False: Data visualization is an important part of data analysis in data science.
True
What is the purpose of the ‘train-test split’ in machine learning?
To evaluate the performance of a model on unseen data.
Name one common library used for data manipulation in Python.
Pandas.
What does the term ‘big data’ refer to?
Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations.
Fill in the blank: __________ learning is a subset of machine learning focused on teaching computers to learn from data without being explicitly programmed.
Machine
What is the purpose of regularization in machine learning?
To prevent overfitting by adding a penalty for larger coefficients.
What is a common use case for clustering algorithms?
Market segmentation.
What does ‘data wrangling’ involve?
Cleaning and transforming raw data into a usable format.
Which of the following is a regression algorithm? A) K-means B) Decision Trees C) Naive Bayes
B) Decision Trees
True or False: Dimensionality reduction techniques are used to reduce the number of features in a dataset.
True
What is the role of a data scientist?
To analyze and interpret complex data to help organizations make informed decisions.
What is the main advantage of using ensemble methods in machine learning?
They combine multiple models to improve predictive performance.
Fill in the blank: A __________ is a graphical representation of the distribution of numerical data.
histogram
What are outliers?
Data points that differ significantly from other observations in a dataset.