Chapters 1, 2, 3 Flashcards

Question

What is the Data Preparation step?

Answer 1

* **Data Preparation** * ****_Convert data into useable format._ * _Data are manipulated and converted into forms_ that _yield better results._ * **_Leakage_** must be considered → a situation in which a variable collected in _historical data gives information on the target variable_, but this _information is not available when the decision needs to be made_. (Discrepancy)

Answer 2

* **Modelling** * Primary place where _data mining techniques are applied_ to the data. * Understand techniques and algorithms that can be used

Answer 3

* **Evaluation** * _Assess the data mining results & gain confidence that they are valid_ and reliable before moving on * Ensures that the model _satisfies the original business goal_. * Goal: prove that detected patterns are truly regular * Assessment: is _qualitative & quantitative_ by using a comprehensive _**evaluation framework**._ * Model _needs to be **comprehensible** for other stakeholders_ (non-data scientists) * Evaluation _may be extended into the development_ environment.

Answer 4

* **Deployment** * _Putting the results of data mining into real use_ to realize some ROI. * Use case: _implementing a predictive model in a business process_ * Increasingly the _data mining techniques themselves are deployed_ * _More fundamental than creating a model._ * The _world changes faster than the data science team can adapt the model_ * A business has too many modeling tasks to manually curate each model * _Systems automatically build models_ (for the associated process) * Typically requires that the model is recoded for the production environment, e.g. to accommodate greater speed or compatibility * After mining the data (successfully or not) the _process often returns back to the initial business problem._

Answer 5

* “Statistics” * In the context of data science often refer to summary statistics as the basic building block of much data science theory and practice * Also denotes the field of study “Statistics” provides data science with knowledge that underlies the analytics (e.g. hypothesis testing) * Query * A specific request for a subset of data or for statistics about data, formulated in a technical language and _posed to a database system_. * _Does not discover any patterns or models_, in contrast to data mining. * Data mining can be used to come up with a query in the first place. * Query tools have the ability to execute sophisticated logic, including computing summary statistics over subpopulations, sorting, joining together multiple tables with related data, etc. * Data warehousing * Collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database * Facilitating technology of data mining * Regression analysis * Data mining focuses more on predictive modeling than on explanatory modeling * Two forms of modeling overlap, but not all lessons learned from explanatory modeling apply to predictive modeling * Machine learning methods * _Collection of methods for extracting (predictive) models from data._ * Data mining (and **KDD** → knowledge discovery and data mining) started as an offshoot of Machine learning * Concerned with many types of performance improvement and the issues of agency and cognition * KDD tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, etc. * _Machine learning overaches data mining._

Answer 6

Is a modelling technique that incorporates the idea of supervised segmentation in an elegant manner, repeatedly selecting informative attributes.

Answer 7

* A _formula for estimating the unknown value of interest_: the target * Prediction → an estimate of an unknown value * Judged on its predictive performance

Answer 8

A model that gains insight into the underlying phenomenon or process.

Answer 9

A fact or a data point that is described by a set of attributes. Also called a feature vector or row.

Answer 10

Goal of segmentation is to _create groups that are as **homogeneous** **as possible** within the group_ with respect to the target variable. If _everyone has the same target value, the group is pure_. Most common splitting criteria: _information gain_ which is based on entropy.

Answer 11

A _measure of disorder that can be applied to a set_. _Disorder corresponds to how mixed (impure) the segment is with respect to the properties of interest._ Each p_i is the probability of property i within the set, ranging from p_i = 1 when all members of the set have property i, and p_i = 0 when no members of the set have property i At p_i = 1 the instance classes are balanced within the group (the same amount of each class in one group)

Answer 12

Measure of _how much an attribute improves (decreases) entropy over the whole segmentation it creates_ (e.g. one large group into 3 subgroups based on variable x). Measures the _change in entropy_ due to any amount of new information being added. The _higher the information gain_, the l_ower will be the resulting entropy of the newly created segments_. If you find a variable with a high IG, it reduces entropy, meaning that it is very characteristic for the target variable

Answer 13

* **Variance**: measure of impurity for numerical values (when regression is used) * **Pure set / variance of 0**: set has all the same values for the target variable * **Impure / high variance**: numeric target values are very different * **General idea**: find a variable with the highest correlation with the target variable * Calculating the single attribute with the highest information gain * Calculate the information gain achieved by splitting on each attribute individually

Answer 14

* A type of tree-structured model * Consists of interior (“decision nodes”) and terminal nodes (“leaf”) * Each branch from the node represents a distinct value for that attribute * Each _leaf corresponds to a segment_ * No two parents share descendants and there are no cycles * _First feature to use is the one with the highest information gain_, the order of the remaining features depends on the set of instances against which it is evaluated

Answer 15

Numeric variables can be "discretised" by choosing a split point (or many split points) and then treating the result as a categorical attribute.

Answer 16

**Variance**, information gain is not the right measure.

Answer 17

Calculate the information gain achieved by splitting on each attribute individually.

Answer 18

The tree is a supervised segmentation, because each leaf contains a value for the target variable.

Answer 19

* The space described by the data features. * Often visualized using a scatterplot on some pair of features. * The _lines separating the regions in an instance space_. * Also called decision surfaces or decision boundaries. * Instance space is two-dimensional.

Answer 20

* Each internal (decision) node corresponds to a split of the instance space * Each leaf node corresponds to an unsplit region of the space (a segment of the population)

Answer 21

From the information you collect you can derive a rule.

Answer 22

n/(n+m) where n is positive instances and m negative instances

Answer 23

* A “_smoothed_” version of the frequency-based estimate * Where **n** is the _number of examples in the leaf belonging to class c_, and **m** is the _number of examples not belonging to class c_ * _As the number of instances increases_, the _Laplace correction converges to the frequency-based estimate_.

Answer 24

A bunch of nested if-else statements. Remember though that are many potential splitting decisions. The model will choose the split that maximises information gain.

Answer 25

You calculate the information gain by splitting on each attribute.

Answer 26

How much an attribute and its split reduces the entropy. In other words, the close the information gain is to 1 the better.

Answer 27

Procedure of classification tree induction is a recursive process of divide and conquer, the goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible wrt the target variable. This partitioning is done recursively, spliting until we are done. The attributes to split upon are chosen by testing all of the attributes and selecting the one with the purest subgroups. We continue until the leaf nodes are pure, we run out of variables to split on or we decide to stop earlier in case neither of the other two are met. In summary, the procedure of classification tree induction is a recursive process of divide and conquer, where the goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible with respect to the target variable. We perform this partitioning recursively, splitting further and further until we are done. We choose the attributes to split upon by testing all of them and selecting whichever yields the purest subgroups. 1. Test for the individual information gain of each variable 2. Select the variable with the highest IG as the root 3. Evaluate which variable will then give you the highest IG 1. Variables will not be applied chronologically based on the individual IG 2. _Subsequent IG depends on the preceding variable_ (e.g. at root)

Answer 28

Yes. If you split a tree on an attribute you can split the new nodes on different attributes.

Answer 29

The proportions need to add up.

Answer 30

As the number of instances increases, the Laplace correction converges to the frequency-based estimate.

Answer 31

You can have mulltiple leaf nodes leading to one class.