Domain 3 - Data Flashcards

1
Q

Completeness

A

Are all the fields of the data complete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Correctness

A

Is the data accurate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Consistency

A

Is the data provided under a given field and for a given concept consistent with the definition of that field and concept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Currency

A

Is the data obsolete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Collaborative

A

Is the data based on one opinion or on a conses of experts in the relative area

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Confidential

A

Is the data secure from unauthorized use by individuals other than the decision maker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Clarity

A

Is the data legible and comprehensivle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Common Format

A

Is the data in a format easily used in the application for which it is intended

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Convenient

A

Can the data be conveniently and quickly access by the intended user in a time-frame that allow for it to be effectively used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cost-effective

A

Is the cost of collecting and using the data commensurate with its value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data warehouses typically describe (three things)

A
  1. A Staging area
  2. Data integration
  3. Access Layers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data warehouse staging area

A

The operational data sets from which the information is extracted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data integration

A

The centralized source where the data is conveniently stored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Access layers

A

Multiple OLAP data marts which store the data in a form which will be easy for the analysis to retrieve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data mart

A

A subset of the data warehouse organized along a single point of view (e.g., time, product type, geography) for efficient data retrieval.

Usually oriented to a specific business line or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data marts allow analysts to… (five things)

A
  1. Slice Data
  2. Dice Data
  3. Drill-down/up
  4. Roll-up
  5. Pivot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Slice data

A

filtering data by picking a specific subset of the data-cube and choosing a single value for one of its dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Dice data

A

grouping data by picking specific values for multiple dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Drill-down/up

A

allow the user to navigate from the most summarized (high-level) to the most detailed (drill-down)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Roll-up

A

summarize the data along a dimension (e.g., computing totals or using some other formula)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Pivot

A

interchange rows and columns (`rotate the cube’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Fact tables

A

used to record measurements or metrics for specific events at a fairly granular level of detail

23
Q

Transaction fact details

A

record facts about specific events (like sales events)

24
Q

Snapshot fact tables

A

record facts at a given point in time (like account details at month end)

25
Q

Accumulating snapshot tables

A

record aggregate facts at a given point in time

26
Q

Dimension tables

A

Hhave a smaller number
of records compared to fact tables although each record may have a very large number of attributes. Dimension table includes time dimension tables, geography
dimension table, product dimension table, employee dimension table, and range
dimension tables.

27
Q

What to do with missing data (4 things)

A

Deletion of record
Deletion when necessary
Imputation
Imputation at random

28
Q

Filtering

A

Filtering can involve using relational algebra projection and selection to add or remove data based on its value.

Filtering usually involves outlier removal, exponential smoothing and the use of either Gaussian or median filters.

29
Q

Filling in missing data with imputation

A

If other observations in the dataset can be used, then values for missing data can be generated using random sampling or Monte Carlo Markov Chain methods.

To avoid using other observations, imputation can be done using the
mean, regression models or statistical distributions based on existing
observations.

30
Q

Dimensionality reduction options for structured data

A

Principle component analysis or factor analysis can help determine whether there is correlation across different dimensions in the data

31
Q

Dimensionality reduction options for unstructured text data

A

term frequency-inverse document frequency (tf-idf): is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

32
Q

Feature hashing

A

Dimensionality reduction technique for when data has a variable number of features. Feature hashing is an
efficient method for creating a fixed number of features which form the indices of an array

33
Q

Sensitivity analysis and wrapper methods

A

Used when you don’t know which features of your data are important.
Wrapper methods involve identifying a set of features on a small sample and then testing that set on a holdout sample.

34
Q

Self-organizing maps and Bayes nets

A

Used to understand the probability distribution of the data

35
Q

Normalization

A

Used to ensure data stays within common ranges. Prevents scales of data from obscuring interpretation and analysis

36
Q

When is format conversion used?

A

When data is in binary format

37
Q

When are Fast Fourier Transforms and Discrete wavelet transforms used?

A

With frequency data

38
Q

When are coordinate transformations used?

A

For geometric data defined over a euclidean space.

39
Q

Connectivity-Based clustering methods

A

AKA Hierarchical clustering

Generates an ordered set of clusters with variable precision

40
Q

Hierarchical clustering

A

AKA Connectivity-Based methods

Generates an ordered set of clusters with variable precision

41
Q

Centroid–Based clustering methods

A

When the number of clusters is known,
k-means is a popular technique.
When the number is unknown, x-means is a useful extension of k-means that both creates clusters and searches for the optimal number of clusters.
Canopy clustering is an alternate way of enhancing k-means when the number of clusters is unknown.

42
Q

Distribution-based clustering methods

A

Gaussian mixture models, which typically
used the expectation-maximization (EM) algorithm.
Used if you want any data element’s membership in a segment to be `soft.’

43
Q

Density-based methods

A

Clustering method for non-elliptical clusters - fractal and DB scan can be used.

44
Q

Graph-Based methods

A

Clustering method for when you have knowledge of how one item is connected to another. Cliques and semi-cliques

45
Q

Topic modelling

A

Clustering method for text data

46
Q

How to determine important variables when structure of data is unknown?

A

Tree-based methods

47
Q

How to determine important variables when statistical measures of importance are needed?

A

GLM models

48
Q

How to determine important variables when statistical measures of importance are not needed?

A

Regression with shrinkage (e.g., LASSO, elastic net) and stepwise regression

49
Q

How to classifying data into existing groups when unsure of feature importance?

A

Neutral nets and random forests are helpful

50
Q

How to classifying data into existing groups when unsure of feature importance but require a transparent model?

A

Decision trees (e.g., CART, CHAID)

51
Q

Key problem with neutral nets and random forests

A

Difficult to explain, “black box”, less transparent than decision trees

52
Q

How to classifying data into existing groups with fewer than 20 dimensions?

A

K-nearest neighbours

53
Q

When to use Naive Bayes?

A

When you have a large dataset with an

unknown classification signal

54
Q

When to use Hidden Markov Chains?

A

When estimating an unobservable state based on observable values