Domain 3 - Data Flashcards

Question

Accumulating snapshot tables

Answer 1

record aggregate facts at a given point in time

Answer 2

Hhave a smaller number of records compared to fact tables although each record may have a very large number of attributes. Dimension table includes time dimension tables, geography dimension table, product dimension table, employee dimension table, and range dimension tables.

Answer 3

Deletion of record Deletion when necessary Imputation Imputation at random

Answer 4

Filtering can involve using relational algebra projection and selection to add or remove data based on its value. Filtering usually involves outlier removal, exponential smoothing and the use of either Gaussian or median filters.

Answer 5

If other observations in the dataset can be used, then values for missing data can be generated using random sampling or Monte Carlo Markov Chain methods. To avoid using other observations, imputation can be done using the mean, regression models or statistical distributions based on existing observations.

Answer 6

Principle component analysis or factor analysis can help determine whether there is correlation across different dimensions in the data

Answer 7

term frequency-inverse document frequency (tf-idf): is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

Answer 8

Dimensionality reduction technique for when data has a variable number of features. Feature hashing is an efficient method for creating a fixed number of features which form the indices of an array

Answer 9

Used when you don't know which features of your data are important. Wrapper methods involve identifying a set of features on a small sample and then testing that set on a holdout sample.

Answer 10

Used to understand the probability distribution of the data

Answer 11

Used to ensure data stays within common ranges. Prevents scales of data from obscuring interpretation and analysis

Answer 12

When data is in binary format

Answer 13

With frequency data

Answer 14

For geometric data defined over a euclidean space.

Answer 15

AKA Hierarchical clustering | Generates an ordered set of clusters with variable precision

Answer 16

AKA Connectivity-Based methods | Generates an ordered set of clusters with variable precision

Answer 17

When the number of clusters is known, k-means is a popular technique. When the number is unknown, x-means is a useful extension of k-means that both creates clusters and searches for the optimal number of clusters. Canopy clustering is an alternate way of enhancing k-means when the number of clusters is unknown.

Answer 18

Gaussian mixture models, which typically used the expectation-maximization (EM) algorithm. Used if you want any data element’s membership in a segment to be `soft.’

Answer 19

Clustering method for non-elliptical clusters - fractal and DB scan can be used.

Answer 20

Clustering method for when you have knowledge of how one item is connected to another. Cliques and semi-cliques

Answer 21

Clustering method for text data

Answer 22

Tree-based methods

Answer 23

GLM models

Answer 24

Regression with shrinkage (e.g., LASSO, elastic net) and stepwise regression

Answer 25

Neutral nets and random forests are helpful

Answer 26

Decision trees (e.g., CART, CHAID)

Answer 27

Difficult to explain, "black box", less transparent than decision trees

Answer 28

K-nearest neighbours

Answer 29

When you have a large dataset with an | unknown classification signal

Answer 30

When estimating an unobservable state based on observable values

Domain 3 - Data Flashcards

(54 cards)