Random Question Flashcards

1
Q

What are the commonly used programming languages in data science?

A

Python, R, and SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Fill in the blank: A __________ is a combination of data, algorithms, and machine learning techniques used to make predictions.

A

predictive model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is overfitting in machine learning?

A

When a model learns the training data too well, capturing noise instead of the underlying pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following is a common metric for evaluating classification models? A) Mean Absolute Error B) Accuracy C) R-squared

A

B) Accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of cross-validation?

A

To assess how the results of a statistical analysis will generalize to an independent data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does ETL stand for in data processing?

A

Extract, Transform, Load.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: Feature engineering is the process of selecting, modifying, or creating features to improve model performance.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between supervised and unsupervised learning?
Give examples of supervised and unsupervised algorithms

A

Supervised learning uses labeled data to train models, while unsupervised learning uses unlabeled data.

Supervised learning has a feedback learning
S: decision trees, SVM
U: k-means, hierchical clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a confusion matrix?

A

A table used to evaluate the performance of a classification model by comparing predicted and actual outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Fill in the blank: The __________ is a statistical measure that represents the likelihood of an event occurring.

A

probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of a data pipeline?

A

To automate and streamline the process of data collection, transformation, and storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which algorithm is commonly used for regression tasks?

A

Linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the significance of p-values in hypothesis testing?

A

P-values indicate the probability of observing the data, or something more extreme, under the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True or False: Data visualization is an important part of data analysis in data science.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of the ‘train-test split’ in machine learning?

A

To evaluate the performance of a model on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Name one common library used for data manipulation in Python.

A

Pandas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the term ‘big data’ refer to?

A

Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fill in the blank: __________ learning is a subset of machine learning focused on teaching computers to learn from data without being explicitly programmed.

A

Machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the purpose of regularization in machine learning?

A

To prevent overfitting by adding a penalty for larger coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a common use case for clustering algorithms?

A

Market segmentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does ‘data wrangling’ involve?

A

Cleaning and transforming raw data into a usable format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which of the following is a regression algorithm? A) K-means B) Decision Trees C) Naive Bayes

A

B) Decision Trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

True or False: Dimensionality reduction techniques are used to reduce the number of features in a dataset.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the role of a data scientist?

A

To analyze and interpret complex data to help organizations make informed decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the main advantage of using ensemble methods in machine learning?

A

They combine multiple models to improve predictive performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Fill in the blank: A __________ is a graphical representation of the distribution of numerical data.

A

histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are outliers?

A

Data points that differ significantly from other observations in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the purpose of exploratory data analysis (EDA)?

A

To summarize the main characteristics of a dataset, often using visual methods.

29
Q

What is the difference between batch and online learning?

A

Batch learning processes data in batches, while online learning processes data one instance at a time.

30
Q

How is logistic regression done?

A

Data -> linear model->proba-> sigmoïde fonction -> values 0 et 1-> treshold classifier

31
Q

Formula sigmoïde function

A

P= 1/( 1+ exp(-y)) ou y =ax+b

32
Q

What are the step to make a decision tree

A

1) calculate the entropy of the target and prédiction attributes.
2) calculate the information gain.
3) Root is the feature with the highest info gain
Repeat

33
Q

Build a random forest.

A

Select k record randomly ( k< m)
Calculate the node D using the best Split.
Repeat for daugther nodes
Repeat with another k

34
Q

How to avoid overfitting

A

1) keep model simple
2 detect via cross validation
3 ) régularisation
4 ) add more data or feature selectio
5) early stop

35
Q

Feature sélection méthodes?

A

Filtre méthods: Lda, chi-square,ANOVA
Wrapper méthods: forward fea sel
Backward feature, récursive feature sélection ( thé others two look at one AT thé Time)

36
Q

Why dimension reduction

A

LESS storage Space, less computational power, removing redundant features

37
Q

Calculate eigenvalues and eigenvector of
-2 -4 2
-2 1 2
4 2 5

A

Lamda3 - 4lamda2 - 27lamda +90

38
Q

What are recommender système ?

A

Collaboratrice filtering,content based filtering

39
Q

How to Select k for k means

A

Calculate the Wss, sum of squared distance between the centroid and each membre of a cluster and search for elbow method

40
Q

How treat outlier

A

Remove if they are garbage
Normalise
Use another model
Use algo rebust against outlier random forest

41
Q

Precison

A

TP /TP+ fp

42
Q

Recall

A

TP / TP + fn

43
Q

Entropy formula inpurity level

A

-sum(P * log2.p)

44
Q

Tpr

A

TP/TP+ fn

45
Q

Fpr

A

Fp/fp+tn

46
Q

Différence entre logistique and linéaire

A

L’output est catégorique vs continue

47
Q

Bagging vs boosting

A

Bagging: aim to reduce variance in a noisy dataset: Split data, train models, average.
Boosting IS ensemble learning to strengthen weak models
Learning from previous errors.
Gradient boosting (risk overfitting)

48
Q

F1 score

A

2 x Precison x recall/ (précision + recall)

49
Q

Assumptions of linear regression

A

Linear dependency between feature and y
Independence

50
Q

What is logistic regression

A

Prédictive analyses to find relatioships between dépendant binary variable and indépendant features using logistic regression équation

51
Q

What IS décision tree.

A

Tool to classify data and déterminé thé probabilités of a outcome of a système. Thé base IS a Root node, branches in décision node and into leaves node

52
Q

Pruning thé décision tree.

A

Éliminate leaves to avoid overfitting using gini index

53
Q

Errors vs residual error

A

Observed value- true values
Observed value - estimated valuez

54
Q

Ensemble learning

A

Multiple models are uséd to improve prédictive performance

55
Q

Naive Bayes

A

Classification algorithme that assumes that the feature are indépendant

56
Q

SVM

A

Prédictive and classification using hyperplanes to ségrégate between two classes

57
Q

Law of large number

A

To get thé expected result one should run thé experiment a large number of times

58
Q

Counfouding variable

A

Variable that have an effect on other cause and effect

59
Q

Do gradient descend Always converge same point

A

No there are some local optimum

60
Q

Binomial formula

A

N!/(N-x)! X! P^x q^n-x

61
Q

Type I error

A

False positive

62
Q

Type II error

A

False négatif

63
Q

L1 régularisation vs L2

A

L1 absolute value of weight * lamda3 leads to sparse model and values near to zéro good for high dim data with irrelevant features
L2 squared values prevent overfitting without éliminating features works well correlated features

64
Q

Feature scalling

A

Min max scalling z score transformation log transformation
X- min/ (max-min)
X- mean/ std
Robuste scaling

65
Q

How to deal with outlier

A

Visualisez, statistical méthodes (z score, iqr)
1) removal
2 transfo
3 capping
4) investigation

66
Q

One hot encoding

A

Transformation catégorie into binaries

67
Q

How to deal with catégorie values

A

One hot encoding (0,1)
Label encoding (1,2,3…)
Target encoding uses thé mean
Fréquence replace thé catégorie with fréquence
Domaine specific encoding i’e encoded based on the distance of a central point

68
Q

Bias variance tradeoff

A

Biais error introduce by the d’simplication underfit thé data
Variance error by model sensitivity overfitting thé data