final exam Flashcards by Michael Markovinovic

Append

Add records from one dataset to another

How well did you know this?

Not at all

Perfectly

Merge

Add fields from one dataset to another

How well did you know this?

Not at all

Perfectly

Rectangular Data

Product of records and fields

How well did you know this?

Not at all

Perfectly

Stages of CRISPDM

Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment

How well did you know this?

Not at all

Perfectly

SuperNode

Condensing several nodes into a single node

How well did you know this?

Not at all

Perfectly

Histogram is used for which fields

Continuous Fields

How well did you know this?

Not at all

Perfectly

Distribution is used for which fields

Categorical Fields

How well did you know this?

Not at all

Perfectly

Direct link between 2 variables

Causation

How well did you know this?

Not at all

Perfectly

2 variables change at a certain rate in relationship to eachother

Correlation

How well did you know this?

Not at all

Perfectly

Data point that deviates so far from the other observations

Outlier/Extreme

How well did you know this?

Not at all

Perfectly

Values 3-5 SD from the mean

Outlier

How well did you know this?

Not at all

Perfectly

Values more then 5 SD away from the mean

Extreme (Outlier)

How well did you know this?

Not at all

Perfectly

2 or more categories that can be ranked (School ranking)

Ordinal Ranking

How well did you know this?

Not at all

Perfectly

2 or more categories that can be ranked, that have no order (peoples favorite color)

Nominal Ranking

How well did you know this?

Not at all

Perfectly

All the numbers added together then divided by how many number there are (average)

Mean

How well did you know this?

Not at all

Perfectly

The number in the middle when you arrange a set from least to greatest

Median

The number that appears most often in the set

Mode

1) Rank,
2) Fractional rank as %
3) Sum of case weights
4) Savage score
5)Fractional rank

Options for ranking models

Estimates and compares models for continuous numeric range outcomes

AutoNumeric

Estimates and compares models for either nominal or binary targers

AutoClassifier

Testing multiple models, this node would present the results for all models, for both partitions so you can easily determine which model performed the best

Running an Analysis Node

1)Undefined values represented as $null$
2) White Spaces
3)Values that are not in the allowed set of values,

Invalid Values

Define an area of certain size (Space-time-box)

Geohash

Models in this category predict a target field, using one or more predictors

Supervised Models

These models create groups of records with similar values on the input field

Unsupervised Models (Segmentation)

The process of extracting valuable insights from larger datasets, it helps organizations make data-driven decisions understand customer behavior. By uncovering patterns and relationships in data, businesses can gain a competitive edge and improve efficiency.

Data Mining

Modify unit of analysis, remove duplicates, create a dataset with one record per customer

Distinct Node

Only data from records present in all source datasets will be merged

Inner Join

Automatically create new nominal fields based on the values of one or more existing continuous field

Binning Node

1) First m – Returns/Discards the first M records in in dataset 2)1-in-n - Every nth record is selected/discarded 3)Random % - There is a r% probability of each record being selected/discarded

Simple Node

Measures distance from the center (mean)

Standard Deviation

Sample from groups of records rather than from individual records

Clustered Sample

Sample independently within subgroups

Stratified Sample

Relationship between 2 Categorical Fields

Matrix, Distribution

Relationship between 1 Categorical Fields, 1 Continuous Field

Means, Histogram

Relationship between 2 Continuous Fields

Statistics, Plot

Process of reading or specifying information such as measurement levels and values for a field

Instantiating Data

A field has unknown storage

Uninstantiated

K-Means, Kohohen, Two-Step

3 segmentation methods