AI Flashcards

1
Q

What are the 2 types of data ?

A

Numerical Data and Categorical Data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What kind of value does Numerical or Continuous data accept ?

A

Can accept any value within a finite or infinite interval (e.g., height, weight, temperature, blood glucose, …).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 2 types of Numerical or Continuous data ?

A

Interval and ratio.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe data on an interval scale.

A

Can be added and subtracted but cannot be meaningfully multiplied or divided because there is no true zero. For example, we cannot say that one day is twice as hot as another day.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe data on a ratio scale.

A

Has true zero and can be added, subtracted, multiplied or divided (e.g., weight).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Categorical or Discrete variable is the one that has ….. .

A

two or more categories (values).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 2 types of categorical variables ?

A

Nominal and ordinal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe Nominal variables.

A

Has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (male and female) with no intrinsic ordering to the categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe Ordinal variables.

A

Has a clear ordering. For example, temperature as a variable with three orderly categories (low, medium and high).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a frequency table ?

A

Is a way of counting how often each category of the variable in question occurs. It may be enhanced by the addition of percentages that fall into each category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Encoding or continuization ?

A

Is the transformation of categorical variables to binary or numerical counterparts. An example is to treat male or female for gender as 1 or 0. Categorical variables must be encoded in many modeling methods (e.g., linear regression, SVM, neural networks).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 2 types of encoding ?

A

Binary and Target-based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Binning or discretization ?

A

Is the process of transforming numerical variables into categorical counterparts.
An example is to bin values for Age into categories such as 20-39, 40-59, and 60-79.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Numerical variables are usually discretized in the modeling methods based on ….. .

A

frequency tables (e.g., decision trees).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Binning may improve accuracy of the predictive models by ….. or ….. .

A

reducing the noise, non-linearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Dataset ?

A

Is a collection of data, usually presented in a tabular form. Each column represents a particular variable, and each row corresponds to a given member of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Alternatives for columns: ….., ….., ….. .

A

Fields, Attributes, Variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Alternatives for rows: ….., ….., ….., ….., ….., ….. .

A

Records, Objects, Cases, Instances, Examples, Vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Alternatives for values: ….. .

A

Data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In predictive modeling, ….. or ….. are the input variables

A

predictors, attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In predictive modeling, ….. or ….. is the output variable

A

target, class attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

In predictive modeling, the output variable value is determined by ….. and ….. .

A

the values of the predictors, function of the predictive model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Pattern recognition predicts the future by ….. .

A

means of modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Predictive modeling ?

A

Is the process by which a model is created to predict an outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

If the outcome is categorical it is called ….. .

A

Classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

If the outcome is numerical it is called ….. .

A

Regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is Descriptive modeling or clustering ?

A

Is the assignment of observations into clusters so that observations in the same cluster are similar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is Classification ?

A

Is a predicting the value of a categorical variable (target or class) by building a model based on one or more numerical and/or categorical variables (predictors or attributes).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is ZeroR classifier ?

A

Is the simplest classification method which relies on the target and ignores all predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

ZeroR classifier simply predicts the ….. .

A

Majority category (class).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Although there is no predictability power in ZeroR, it is useful for ….. .

A

determining a baseline performance as a benchmark for other classification methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the ZeroR classifier Algorithm ?

A

Construct a frequency table for the target and select its most frequent value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is OneR classifier ?

A

Short for “One Rule”, is a simple classification algorithm that generates one rule for each predictor in the data, then selects the rule with the smallest total error as its “one rule”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

To create a rule for a predictor, we ….. .

A

Construct a frequency table for each predictor against the target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

OneR produces rules only slightly less accurate than state-of-the-art classification algorithms while ….. .

A

producing rules that are simple for humans to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the OneR Classifier Algorithm ?

A

For each predictor,
For each value of that predictor, make a rule as
follows;
Count how often each value of target (class)
appears.
Find the most frequent class.
Make the rule assign that class to this value of the
predictor.
Calculate the total error of the rules of each
predictor.
Choose the predictor with the smallest total error.

37
Q

A low total error means ….. .

A

a higher contribution to the predictability of the model.

38
Q

A is a random variable if:

A
  • A denotes something about which we are uncertain.

- perhaps the outcome of a randomized experiment.

39
Q

What is P(A) ?

A

The fraction of possible worlds in which A is true.

40
Q

What is the set of possible worlds called ?

A

Sample space (S).

41
Q

The Naive Bayesian (NB) classifier is based on ….. .

A

Bayes’ theorem with independence assumptions between predictors.

42
Q

A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for ….. .

A

very large datasets.

43
Q

Why does the Naive Bayesian classifier often do surprisingly well and is widely used ?

A

Because it often outperforms more sophisticated classification methods.

44
Q

What is the Naive Bayes classifier algorithm ?

A
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors. 
This assumption is called class conditional independence.
45
Q

How to solve the The zero-frequency problem ?

A

Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute value (Outlook=Overcast) doesn’t occur with every class value (Play Golf=no).

46
Q

What to do when working with numerical predictors ?

A
  • Numerical variables need to be transformed to their categorical counterparts (binning) before constructing their frequency tables.
  • The other option we have is using the distribution of the numerical variable to have a good guess of the frequency.
  • For example, one common practice is to assume normal distributions for numerical variables.
47
Q

What are the 2 parameters that defines The probability density function (PDF) for the normal distribution ?

A

Mean and standard deviation.

48
Q

What is Decision Tree Classification ?

A
  • Decision tree builds classification or regression models in the form of a tree structure.
  • It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.
  • The final result is a tree with decision nodes and leaf nodes.
  • Decision trees can handle both categorical and numerical data.
49
Q

Describe a decision node.

A

A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy).

50
Q

Describe a leaf node.

A

Leaf node (e.g., Play) represents a classification or decision.

51
Q

The topmost decision node in a tree which corresponds to the best predictor is called ….. .

A

root node.

52
Q

What is the decision tree classifier algorithm ?

A

The core algorithm is called ID3 employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree.

53
Q

What is Entropy ?

A
  • A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous).
  • ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one.
54
Q

The information gain is based on ….. .

A

the decrease in entropy after a dataset is split on an attribute.

55
Q

Constructing a decision tree is all about ….. .

A

Finding attribute that returns the highest information gain.

56
Q

How to transform decision tree to decision rules ?

A

By mapping from the root node to the leaf nodes one by one.

57
Q

What are some decision tree issues ?

A
  • Working with continuous attributes (binning).
  • Avoiding overfitting
  • Super Attributes (attributes with many values).
  • Working with missing values.
58
Q

What is Logistic Regression ?

A

Logistic regression predicts the probability of an outcome that can only have two values (i.e. a dichotomy).

59
Q

In logistic regression, what is The prediction based on ….. .

A

The use of one or several predictors (numerical and categorical).

60
Q

What are the 2 reasons why a linear regression is not appropriate for predicting the value of a binary variable ?

A
  • A linear regression will predict values outside the acceptable range (e.g. predicting probabilities outside the range 0 to 1).
  • Since the dichotomous experiments can only have one of two possible values for each experiment, the residuals will not be normally distributed about the predicted line.
61
Q

A logistic regression produces a logistic curve, which is limited to values between ….. .

A

0 and 1.

62
Q

Logistic regression is similar to a linear regression, but the curve is constructed using ….. , rather than the probability.

A

The natural logarithm of the “odds” of the target variable.

63
Q

….. is the method used to estimate coefficients for the best fit line in linear regression.

A

Ordinary least square regression.

64
Q

logistic regression uses ….. to obtain the model coefficients that relate predictors to the target.

A

Maximum likelihood estimation (MLE).

65
Q

A ….. value is available to indicate the adequacy of the regression model.

A

pseudo R2.

66
Q

What is a Likelihood ratio test ?

A

Is a test of the significance of the difference between the likelihood ratio for the baseline model minus the likelihood ratio for a reduced model.

67
Q

A ….. difference between the likelihood ratio for the baseline model minus the likelihood ratio for a reduced model.

A

model chi-square

68
Q

….. is used to test the statistical significance of each coefficient (b) in the model (i.e., predictors contribution).

A

Wald test

69
Q

What is the Core of ML ?

A

Making predictions or decisions from Data.

70
Q

What are the 5 principals of Representation ?

A
  • Coverage.
  • Concision.
  • Directness.
  • Templates.
  • Histograms.
71
Q

What is K nearest neighbors (KNN)?

A

Is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

72
Q

….. has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.

A

K nearest neighbors (KNN).

73
Q

What is the KNN algorithm ?

A
  • A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function.
  • If K = 1, then the case is simply assigned to the class of its nearest neighbor.
74
Q

What are the 3 distance functions ?

A
  • Euclidean
  • Manhattan
  • Minkowski
75
Q

All three distance measures are only valid for ….. variables.

A

Continuous.

76
Q

In the instance of categorical variables the ….. distance must be used.

A

Hamming.

77
Q

Choosing the optimal value for K is best done by first ….. .

A

Inspecting the data.

78
Q

In general, a ….. K value is more precise as it reduces the overall noise but there is no guarantee.

A

Large.

79
Q

….. is a way to retrospectively determine a good K value by using an independent dataset to validate the K value.

A

Cross-validation.

80
Q

Historically, the optimal K for most datasets has been between ….. and ….. . That produces much better results than 1NN.

A

3, 10.

81
Q

One major drawback in calculating distance measures directly from the training set is in the case where ….. or ….. .

A

Variables have different measurement scales, there is a mixture of numerical and categorical variables.

82
Q

What is Linear Discriminant Analysis (LDA) ?

A

LDA is a classification method based upon the concept of searching for a linear combination of the variables that best separates two classes, It is simple, mathematically robust and often produces models whose accuracy is as good as more complex methods.

83
Q

One way of assessing the effectiveness of the discrimination is to calculate ….. .

A

the Mahalanobis distance between two groups.

84
Q

In LDA model assessment what does a distance greater than 3 means ?

A

it means that in two averages differ by more than 3 standard deviations. It means that the overlap (probability of misclassification) is quite small.

85
Q

A simple linear correlation between the LDA model scores and predictors can be used to ….. .

A

test which predictors contribute significantly to the discriminant function.

86
Q

How to avoid overfitting ?

A
  • Stop growing when data split not statistically significant.
  • Grow full tree, then post-prune.
87
Q

What are decision tree Pros ?

A
  • Simple to understand and interpret.
  • Little data preparation and little computation.
  • Indicates which attributes are most important for classification.
88
Q

What are decision tree Cons?

A
  • Learning an optimal decision tree is NP-complete.
  • Perform poorly with many classes and small data.
  • Computationally expensive to train.
  • Over-complex trees do not generalize well from the training data (overfitting).