Domain 3 - Data Flashcards
Completeness
Are all the fields of the data complete
Correctness
Is the data accurate
Consistency
Is the data provided under a given field and for a given concept consistent with the definition of that field and concept
Currency
Is the data obsolete
Collaborative
Is the data based on one opinion or on a conses of experts in the relative area
Confidential
Is the data secure from unauthorized use by individuals other than the decision maker
Clarity
Is the data legible and comprehensivle
Common Format
Is the data in a format easily used in the application for which it is intended
Convenient
Can the data be conveniently and quickly access by the intended user in a time-frame that allow for it to be effectively used
Cost-effective
Is the cost of collecting and using the data commensurate with its value
Data warehouses typically describe (three things)
- A Staging area
- Data integration
- Access Layers
Data warehouse staging area
The operational data sets from which the information is extracted
Data integration
The centralized source where the data is conveniently stored
Access layers
Multiple OLAP data marts which store the data in a form which will be easy for the analysis to retrieve
Data mart
A subset of the data warehouse organized along a single point of view (e.g., time, product type, geography) for efficient data retrieval.
Usually oriented to a specific business line or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department.
Data marts allow analysts to… (five things)
- Slice Data
- Dice Data
- Drill-down/up
- Roll-up
- Pivot
Slice data
filtering data by picking a specific subset of the data-cube and choosing a single value for one of its dimensions
Dice data
grouping data by picking specific values for multiple dimensions
Drill-down/up
allow the user to navigate from the most summarized (high-level) to the most detailed (drill-down)
Roll-up
summarize the data along a dimension (e.g., computing totals or using some other formula)
Pivot
interchange rows and columns (`rotate the cube’)
Fact tables
used to record measurements or metrics for specific events at a fairly granular level of detail
Transaction fact details
record facts about specific events (like sales events)
Snapshot fact tables
record facts at a given point in time (like account details at month end)
Accumulating snapshot tables
record aggregate facts at a given point in time
Dimension tables
Hhave a smaller number
of records compared to fact tables although each record may have a very large number of attributes. Dimension table includes time dimension tables, geography
dimension table, product dimension table, employee dimension table, and range
dimension tables.
What to do with missing data (4 things)
Deletion of record
Deletion when necessary
Imputation
Imputation at random
Filtering
Filtering can involve using relational algebra projection and selection to add or remove data based on its value.
Filtering usually involves outlier removal, exponential smoothing and the use of either Gaussian or median filters.
Filling in missing data with imputation
If other observations in the dataset can be used, then values for missing data can be generated using random sampling or Monte Carlo Markov Chain methods.
To avoid using other observations, imputation can be done using the
mean, regression models or statistical distributions based on existing
observations.
Dimensionality reduction options for structured data
Principle component analysis or factor analysis can help determine whether there is correlation across different dimensions in the data
Dimensionality reduction options for unstructured text data
term frequency-inverse document frequency (tf-idf): is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus
Feature hashing
Dimensionality reduction technique for when data has a variable number of features. Feature hashing is an
efficient method for creating a fixed number of features which form the indices of an array
Sensitivity analysis and wrapper methods
Used when you don’t know which features of your data are important.
Wrapper methods involve identifying a set of features on a small sample and then testing that set on a holdout sample.
Self-organizing maps and Bayes nets
Used to understand the probability distribution of the data
Normalization
Used to ensure data stays within common ranges. Prevents scales of data from obscuring interpretation and analysis
When is format conversion used?
When data is in binary format
When are Fast Fourier Transforms and Discrete wavelet transforms used?
With frequency data
When are coordinate transformations used?
For geometric data defined over a euclidean space.
Connectivity-Based clustering methods
AKA Hierarchical clustering
Generates an ordered set of clusters with variable precision
Hierarchical clustering
AKA Connectivity-Based methods
Generates an ordered set of clusters with variable precision
Centroid–Based clustering methods
When the number of clusters is known,
k-means is a popular technique.
When the number is unknown, x-means is a useful extension of k-means that both creates clusters and searches for the optimal number of clusters.
Canopy clustering is an alternate way of enhancing k-means when the number of clusters is unknown.
Distribution-based clustering methods
Gaussian mixture models, which typically
used the expectation-maximization (EM) algorithm.
Used if you want any data element’s membership in a segment to be `soft.’
Density-based methods
Clustering method for non-elliptical clusters - fractal and DB scan can be used.
Graph-Based methods
Clustering method for when you have knowledge of how one item is connected to another. Cliques and semi-cliques
Topic modelling
Clustering method for text data
How to determine important variables when structure of data is unknown?
Tree-based methods
How to determine important variables when statistical measures of importance are needed?
GLM models
How to determine important variables when statistical measures of importance are not needed?
Regression with shrinkage (e.g., LASSO, elastic net) and stepwise regression
How to classifying data into existing groups when unsure of feature importance?
Neutral nets and random forests are helpful
How to classifying data into existing groups when unsure of feature importance but require a transparent model?
Decision trees (e.g., CART, CHAID)
Key problem with neutral nets and random forests
Difficult to explain, “black box”, less transparent than decision trees
How to classifying data into existing groups with fewer than 20 dimensions?
K-nearest neighbours
When to use Naive Bayes?
When you have a large dataset with an
unknown classification signal
When to use Hidden Markov Chains?
When estimating an unobservable state based on observable values