EXAM 2 Flashcards
4.1 How do Accountants Design Data Analysis Projects?
Create a Data Analysis Project Plan
What are the 5 steps?
Step 1: F___ on the o____e
Step 2: Select a d___ str___y
Step 3: Select an an___ str___y
Step 4: Consider r___
Step 5: Em___ co___s
Step 1: Focus on the objective
Step 2: Select a data strategy
Step 3: Select an analysis strategy
Step 4: Consider risks
Step 5: Embed controls
Create a Data Analysis Project Plan
Step 1: Focus on the objective
Keep the project’s 1) o___ve and 2) sp___ q____s in mind to 3) select the 4) b___ data and 4) an___ strategies to 5) ful__ the ob___tive and answer those 6) q___
Simply remembering to ask how the plan’s 7) pro__ d__and analysis st___y decisions 8) re___ to the 9) objective helps us make better 10) ch__
1) objective
2) specific questions
3) select
4) best data
4) analysis strategies
5) fulfill the objective
6) questions
7) proposed data and analysis strategy decisions
8) relate
9) objective
10) choices
Create a Data Analysis Project Plan
Step 2: Select a data strategy
Use 1) cr___ t____g to 2) d__p and r___ a few data 3) al___s.
This 4) en___ that we choose the data option most 5) ap___e for the objective
1) critical thinking
2) develop and rank
3) alternatives
4) ensures
5) appropriate
Create a Data Analysis Project Plan
Step 3: Select an analysis strategy
Use what was 1) le___ from 2) s___g the data 3) str__ and apply that same 4) dev___t and r____g process to analyses 5) alt___
Following this step 6) inc___ the likelihood of selecting the 7) b___ analysis 8) o___n
1) learned
2) selecting
3) strategy
4) development and ranking
5) alternatives
6) increases
7) best
8) option
Create a Data Analysis Project Plan
Step 4: Consider risks
1) Co___ and pr___g cr___l risks to both the data and the analysis strategies 2) re__s how these risks can create 3) mis___ and in___ results
1) Considering and prioritizing critical
2) reveals
3) misleading and invalid
Create a Data Analysis Project Plan
Step 5: Embed controls
Designing and implementing 1) pre____ and de___ co___s into the analysis process leads to results that are 2) ac___, v____, and re___
1) preventative and detective controls
2) accurate, valid, and reliable
Step 2: Select the Data Strategy
Develop several data alternatives that could help answer the objective question.
Then, to select the most useful data alternative for the project plan, identify the factors you want to use to rank these alternatives and assign values to each alternative’s factors.
The best data strategy alternative is the one with the 1) hi___t ov__l factor r___gs
1) highest overall factor
Step 3: Select the Analysis Strategy
For any project plan, choosing the best analysis strategy involves considering and evaluating several possible alternative analyses given the objective questions and the already selected strategy for the data
Steps 4 and 5: Data Strategy Risks and Controls
These data risks can be controlled by comparing the extracted data to the source invoice and collection documents.
Possible risks involving the management’s estimates of bad debt percentages include human bias embedded in the authorized bad debt percentages, changes in customer payment behavior, and business process changes to credit customer approvals and collection practices.
Steps 4 and 5: Data Strategy Risks and Controls
One way to control for these risks is to ask the finance, sales, and accounts receivable managers if there were changes to the customer base, market conditions, or business policies and procedures for credit approval, write-offs, or collection policies during the year. This knowledge could confirm existing bad debt percentages or motivate their adjustment.
Another control for evaluating the risks in management estimates and assumptions is to compare 1) c___t data to p___ y__ data. Evaluating increases in the median number of days outstanding and the number of new to returning customers in each age group can offer insights into the reasonableness of bad debt estimation percentages.
1) current data to prior year data
4.2 What Should We Consider When Selecting Data for Analysis?
It’s easier to make better decisions when we have the 1) ne___ in___
1) necessary information
A successful data analysis project hinges on selecting data that are 1) rel___t and ap___ for the objective of the project, respecting the data’s 2) cha___s and me___ scales, and controlling for in__ data ri__
1) relevant and appropriate
2) characteristics and measurement scales, and controlling for inherent data risks
Identify Appropriate Data
Data can be considered appropriate for analysis when they are 1) r___t, av__e, and the ch___ ma__ the analysis method 2) req___.
Appropriate data can be 3) in___, e___al, or a co____ of both
1) relevant, available, and the characteristics match
2) requirements
3) internal, external, or a combination of both
Identify Appropriate Data
Internal data are generated within the organization, such as 1) s__ data, pur___ data, in__y data, c___er data, and ve___r data. Internal data can typically be more 2) ea___ co___d and v___d by an organization.
External data are obtained from sources outside of an organization. This data can include 3) we___ data, ge___ic data, and pub__ available co___or data. External data are somewhat 4) ri___r to use since we often 5) ca__t know if the data are 6) ac___ or co___. External data can, however, provide 7) in___ that internal data alone 8) c___t p___e
After identifying the available and relevant data alternatives, the 9) cha__ of the possible data sets need to be 10) ve___ as 11) s___e for the planned analysis.
1) sales data, purchase data, inventory data, customer data, and vendor data
2) easily controlled and verified
3) weather data, geographic data, and publicly available competitor data
4) riskier
5) cannot
6) accurate or complete
7) insights
8) cannot provide
9) characteristics
10) verified
11) suitable
Define Data Set
A 1) co____ of data 2) col___ and r___ available for analysis
1) collection
2) columns and rows
Understanding the 1) cha____ of a data set is important because, for example, 1.1) st___al me___ and t__s often require certain data characteristics or a minimum of data points.
2) Vi___ the data requirements for these measures and tests can threaten the 3) ac___, re___y, and sig____e of the analysis results.
1) characteristics
1.1) statistical measures and tests
2) Violating
3) accuracy, reliability, and significance
Define Fields
1) In____l columns representing the 2) cha___s about each 3) r___ stored in the columns of a data set
1) Individual columns
2) characteristics
3) record
Define Attributes
The data fields that describe aspects of a 1) re___, e___t, or a___t of the object of 2) in___.
When the data source is a database, they are the 3) co__ in a data set.
1) resource, event, or agent 2) interest
3) columns
Define Records
1) R___ in a data set from a database are records, which represent the collection of 2) co____ that hold the 3) descriptions of a 4) s___e oc_____ of the data set’s 5) pu____
1) Rows
2) columns
3) descriptions
4) single occurrence
5) purpose
In addition to understanding the content of data fields in accounting databases, considering the 1) so__ of the data is important because the 2) q____of the data in the 3) f___s impact the 3.1) q___ of the 4) a__s
1) source
2) quality
3) fields
3.1) quality
4) analysis
What are the 2 types of raw data fields?
-____d raw data
-N____-___ed raw data
-Measured raw data
-Non-measured raw data
Define Measured Raw Data
Data 1) cr____ or ca___d by a 2) con____process capturing the 3) v__ of the data.
Examples include 4) p___e, c___t, n___r on hand, w___t, d___, h____s worked, temp___, sensor r___s, h____observation, or ec____c value.
Their format can be 5) di___ or co___ data.
1) created or captured
2) controlled
3) value
4) price, cost, number on hand, weight, date, hours worked, temperature, sensor readings, human observation, or economic value
5) discrete or continuous
Measured raw numeric data
can be used in ma___l cal__
mathematical calculations
Define Non-measured raw data
Data often created 1) au___ by the 2) co___ or com___y policy for 3) c___l.
Examples include 4) ide____ codes, chart of ac___t nu___, standard d____s, product cat___ c__s, or loc___ codes such as city.
These fields are typically formatted as 5) di___ data
1) automatically
2) computer or company
3) control
4) identification codes, chart of account numbers, standard descriptions, product category codes, or location codes such as city
5) discrete
Non-measured raw data
Non-measured raw data that have been formatted as 1) nu___ and di___e, meaning that they are 2) ___n-con____and will 3) n__r have 4) pa___ u___t v__.
These fields 5) ca___ be used in analysis for 6) mat____ cal___s because they refer to 7) u___e ide___ers for 8) in____ items and their 9) pro____category group.
1) numeric and discrete
2) non-continuous
3) never
4) partial unit values
5) cannot
6) mathematical calculations 7) unique identifiers
8) inventory items
9) product category
Define Calculated Data
Data created when 1) o__ or m___ fields in a particular 2) r____ (r___) have any number of 3) ma___l operators (such as +, −, ×, %) applied, and often are 4) d___d from using the data in another 5) fi___ or fi__ within the 6) sa__
record (7) n___across 8) di____ rows).
These fields can be formatted as 9) d___e or co___ data.
1) one or more
2) record (row)
3) mathematical operators
4) derived
5) field or fields
6) same
7) never a
8) different
9) discrete or continuous
Calculated data that is also 1) nu__and likely to be used in a variety of 2) dif___ mat___ ca___s in accounting.
1) numeric
2) different mathematical calculation
Define measurement scale
refers to the type of 1) in___ pr___ by the data.
Data measurement scales should be considered when 2) de___ the data strategy, as they impact which analyses can be 3) pe___ on the data.
1) information provided
2) designing
3) performed
What are the 4 types of data measurement scales?
-ca____
-o___l
-i____l
-r___
-categorical
-ordinal
-interval
-ratio
Define Categorical (nominal) Data
1) La____or n____d data that can be sorted into 2) gr___s according to 3) sp___ ch___s. The data 4) d__ n___ have a 5) qu___ value
1) Labeled or named
2) groups
3) specific characteristics
4) do not
5) quantitative
Define Ordinal Data
1) Or___or r__d ca___al data. 2) Di___ between the categories 3) d___ n__ need to be 4) k___ or e___
1) Ordered or ranked categorical
2) Distance
3) does not
4) known or equal
Define Interval Data
1) O___ data that have 2) e__l and con___ different between 3) o___ns and an ar___ z___ po___t
1) Ordinal
2) equal and constant
3) observations and an arbitrary zero point
Define Ratio Data
1) Int__ data with a 2) n___l z__ p__t. A natural zero point means that it’s 3) n__ ar___
1) Interval
2) natural zero point
3) not arbitrary.
Consider Data Strategy Risks and Implement Controls
Another benefit of documenting data strategy in a project plan is that examining each choice can help identify three common data 1) ri___ that might affect an analysis’s 2) v___, a____, and r___ty
1) risks
2) validity, accuracy, and reliability
What are the 3 common data risks ? (ig this would be more summarized)
- N___-re____e sa___ sel____
- O___ data points
- D__Data
- Non-representative sample selections
- Outlier data points
- Dirty Data
What are the recommended controls for the 3 data risks?
- Non-representative sample selections
-1) v___ rep____s of sample - Outlier data points
-perform a 2) h___m or quartile analysis to identify 3) ou___, then 4) ex__ or justify the 5) r___ used for outlier 6) ad___t or re__l - Dirty Data
-7) v___ in___y of data set and cl___ u__ dirty data is___
1) verify representativeness of sample
2) histogram
3) outliers
4) explain or justify
5) rule
6) adjustment or removal
7) verify integrity of data set and clean up dirty data issues
3 common data risks (more detailed)
The first risk is that a sample 1) ex____from a 2) la___r po___ of data are a 3) po___ representation of the 4) un____ population.
That is why performing tests on the sample’s 5) repr____ can either 6) ve___ its 7) val___ or d___t its significant 8) we___
1) extracted
2) larger population
3) poor
4) underlying
5) representativeness
6) verify
7) validity or detect
8) weakness
3 common data risks (more detailed)
The second risk is potentially including unusual data points in an analysis.
Outlier data points are 1) un__ data points compared to the rest of the data in either an activity point (x-axis or the independent variable), such as number of units produced, or an unusual level of the economic value (y-axis or the dependent variable), such as total costs or total sales.
The best control for identifying outliers is to 2) vi___ the data in a graph to check if the measure is very different than the rest of the data.
Identified outliers can either be 3) re___ (if possible) or 4) re___ with a logical and documented rule.
1) unusual
2) visualized
3) remeasured
4) removed
Define Dirty Data
variety of data ___
provides 1) in___or inc___e descriptions of the economic 2) ac___ of a business
errors
1) inaccurate or incomplete
2) activities
3 common data risks (more detailed)
Dirty Data
Dirty data include 1) m__g, in___ du__e, and inc___ formatted data.
For example, the purchase orders, invoices, or checks written by the company should be sequentially accounted for, with no missing numbers or two with the same identifying number.
Controls should 2) a__s t__t for dirty data 3) be__ the analysis is performed.
Data can be 4) co___ to its source 5) doc____, or if that is not possible, tested for 6) rea____. Testing for reasonableness can include checking for 7) m___g seq___ num__, 8) du___ numbers, or 9) ve___ the data have the ex___ fo___t or ac___ble characters.
1) missing, invalid, duplicate, and incorrectly
2) always test
3) before
4) compared
5) documents
6) reasonableness
7) missing sequence numbers
8) duplicate numbers
9) verifying the data have the expected format or acceptable characters.
Once data have been selected, evaluated, and any risks identified and controlled, the next step is to develop an analysis 1) st___ ap___e for the 2) o___ of the project
1) strategy appropriate
2) objective
4.3 What Should We Consider When Selecting an Analysis?
What are the two questions when designing a data analysis strategy?
- Can the ch___ a___is strategy an__r the sp__c ob___ qu___s?
- Is the me____ sc___ of the data ap____ for the selected analysis strategy?
- Can the chosen analysis strategy answer the specific objective questions?
- Is the measurement scale of the data appropriate for the selected analysis strategy?
The specific type of descriptive and diagnostic analysis used depends on the 1) me____ s__ of the data
1) measurement scale
1) Des____ analyses are the most common analysis strategies using accounting data.
It is useful for 2) ev____g str____ pe____e because it provides more 3) me_____ and b____ss in____
1) Descriptive
2) evaluating strategy performance
3) meaning and business intelligence
Diagnostic analysis strategies can be compared to detective work. Accountants can use these analysis strategies to identify and discover the most likely 1) ca__s of accounting phenomena
Diagnostic strategies always require applying 2) cr___l thi___ skills.
1) causes
2) critical thinking
Define Variable
a data 1) it__ that will be 2) us___ in the analysis
1) item
2) used
Descriptive and diagnostic analyses help to better understand 1) w___ happened and 2) w__ it happened.
What if the objective is to predict a 3) fu___ outcome or determine what strategies 4) mi___t achieve a specific outcome? In those cases, we need to use 5) pre____ or pr____e analyses
1) what
2) why
3) future
4) might
5) predictive or prescriptive
Predictive analysis strategies use 1) hi___l data to create models that estimate a 2) fu___ va___e or o___e.
Prescriptive analysis strategies, on the other hand, help determine which 3) op___ is most likely to produce the 4) b___ ou___e given the objective.
Predictive and prescriptive analysis strategies can determine the best use of 5) res___, im___e pe___ce, anticipate market re__s and mo___s, and av___ process breakdowns:
1) historical
2) future value or outcome
3) option is most
4) best outcome
5) resources, improve performance, anticipate market reactions and movements, and avoid process breakdowns
The last two steps of a data analysis plan are to consider 1) ri___ in__t to the data and analysis 2) str___ and to put in place 3) co___s to 4) re___ those risk
1) risks inherent
2) strategies
3) controls
4) reduce
It is critical to consider these risks during analysis strategy planning and to add controls to help 1) m___ them.
Without those controls we risk 2) pre___, inter___g, and rep____ inc___t analyses res___ possibly causing ourselves or our stakeholders to make 3) har__ dec___s
1) mitigate
2) preparing, interpreting, and reporting incorrect analyses results
3) harmful decisions
4.4 How Do Data and Analysis Strategies Differ Across Practice Areas?
As the last step of the data analysis planning stage, accounting professionals across practice areas 1) d___ data and analysis 2) str___ based on their project 3) ob___s
1) design
2) strategies
3) objectives
Accounting Information Systems
Because a company’s accounting information system is involved in 1) pla___, ex___, con___, and rep___ the business’ operations, data analysis projects in this area often involve 2) in____ary data and 3) kn___
1) planning, executing, controlling, and reporting
2) interdisciplinary
3) knowledge
Accounting Information Systems
These project objectives can range in scope from impacting just one or two employees to involving most of the organization.
All four types of analytics objectives (1) de___e, diag___, pr___e, and pre____) are commonly used.
1) descriptive, diagnostic, predictive, and prescriptive
Accounting Information Systems
The data analyzed in these projects can include a variety of IT operational data with different levels of data 1) inte___ and co__l doc___n.
More traditional accounting data are often analyzed with 2) qua___ cat____l and or____ data:
-Counts of h___ tickets and inc____
-Er___ counts and d___y times
-L___n issues
-System s____n and _ service satisfaction
1) integrity and control documentation
2) qualitative categorical and ordinal
-Counts of help tickets and incidents
-Error counts and delay times
-Login issues
-System satisfaction and IT service satisfaction
Accounting Information Systems
AIS data analyses projects may also choose data strategies which use 1) qua____ int___l and ra___ data:
Costs such as equipment and software installations and updates, and IT staff labor.
Budget variances for any of these costs
1) quantitative interval and ratio
Accounting Information Systems
It is important to select data that captures the AIS system’s 1) per__, vuln___, and er___
1) performance, vulnerabilities, and errors
Accounting Information Systems
Accountants designing analysis strategies for AIS projects are focused on increasing 1) co___ve adv____ and improving the 2) or___’s op__s.
They often use 3) st___s to understand which 4) ele__of the accounting system they should 5) f___ on. The mathematical formality of statistical tools can help to 5) pe___e management to make the necessary investments.
1) competitive advantage
2) organization’s operations
3) statistics
4) element
5) persuade
Accounting Information Systems
Risks
Auditing
Auditors use statistics to determine 1) r__s to account balances and 2) un__ transactions, especially in 3) jo___l en__s that involve estimates and assumptions.
They commonly work with different kinds of 4) d___ so___, from process documentation, journal entries, general and subsidiary ledgers, trial balances, to financial ratios
1) risks
2) unusual
3) journal entries
4) data sources
Auditing
Auditors use analysis strategies for:
-Continuous auditing modules to test large data populations rather than testing samples with inference risks.
-Automated identification of dirty data, unusual transactions, and pattern anomalies to better reduce audit risk and fraud risks.
-Testing entire transaction cycles’ internal controls with process mining and tracing purchasing to payment flows, P-cards (purchasing credit cards) usage and payroll documentation. Testing the revenue cycle from order to collection
Auditing
Auditors use analysis strategies for:
-Using robotic process automation (RPA) to remove the human (and often inconsistent) element of repetitive audit tasks, freeing auditors to focus on areas that require thoughtful critical thinking and judgments. (RPA applications and benefits to accountants are explained in the chapter that covers data and analysis developments in accounting.)
-Testing hypotheses about inventory and fixed assets with sampling, statistical tests, and inferences for the population of these assets.
Auditing
Risks
Financial Accounting
Financial accountants are primarily responsible for 1) cap____, re__g, pro___g, sto___, and re__g accounting information. 2) Ac__y, thoroughness, and documentation are essential.
The accounting data they use for analyses are both guided and restrained by accounting 3) re___ns and 4) go___ agency compliance
1) capturing, recording, processing, storing, and reporting
2) Accuracy
3) regulations
4) government
Financial Accounting
Due to the measurement scales of the variables they need for analysis, financial accountants may have to 1) tran___ data depending on their 2) ob___ q___ns.
These data most often include 3) cat___al, in__al, and ra___ data
1) transform
2) objective questions
3) categorical, interval, and ratio
Financial Accounting
The purpose of a data analysis strategy might be 1) des___ and di____ng based on corresponding financial accounting rules and regulations:
-2) Na___, ti___g, and aut____ of transactions (and diagnose 3) is___) charged to each account.
-4) In___l control effectiveness and weaknesses.
-The 5) com___ and rea___ of adjusting entries at the end of the period.
1) describing and diagnosing
2) Nature, timing, and authorization
3) issues
4) Internal
5) completeness and reasonableness
Financial Accounting
Financial accountants also use analysis strategies designed for their 1) pre___ and pr____e objectives regarding financial outcomes for managers and board members. One example is 2) es___g additional information to be presented along with their financial statements, such as:
-Ranking alternative capital sources and costs of capital in terms of favorability.
-Future net income and cash flows from operations, investing, and financing activities.
-Impacts of new strategies on the financial statement.
-Expected future costs associated with contingent liabilities, pension costs, new business units or the discontinuation of business units
1) predictive and prescriptive
2) estimating
Financial Accounting
Financial accountants use statistics to identify 1) opp___s and pr___s with 2) pr___, li___y, and business va___
1) opportunities and problems
2) profitability, liquidity, and business valuation
Financial Accounting
Risks
Managerial Accounting
The purpose of most managerial accounting data analyses is to improve 1) pla___, op___al control, and de___-m___g to support an organization’s mission and strategies.
This type of data analysis is valuable for improving the 2) se____, ex___n, and ev____n of organizational strategy.
These kinds of analyses can also lead to decisions that give employees more access to information, which improves performance and, eventually, organizational culture and operation
1) planning, operational control, and decision-making
2) selection, execution, and evaluation
Managerial Accounting
Managerial accountants prepare analysis strategies, both for routine and ad hoc (one-time) objectives for each functional area of their organization. These strategies use data across measurement scales to describe, diagnose, predict, and prescribe strategy impacts:
-Identifying areas where innovation in business organization, policy and process will increase efficiencies, for example, to reduce non-value-added steps and delays.
-Identifying areas where innovation in business partnerships, operations, and internal controls will increase effectiveness.
-Increasing competitive advantage through many new business intelligence opportunities.
-Improving compliance with all legal and regulatory regulations
Managerial Accounting
Identifying risks to the data and analyses used in managerial accounting is important to 1) e__e analysis results are 2) ac___ and re__e
1) ensure
2) accurate and reliable
Managerial Accounting
Risks
Tax Accounting
tax accountants have more information for their tax 1) pla___ and co___ce services, which improves their 2) jud____ and how they document defend their positions to both clients and regulators.
1) planning and compliance
2) judgments
Tax Accounting
Perhaps the biggest effect of data analysis strategies on tax practice is the movement away from a dependency on 1) h__c data and toward a forward-looking, 2) v___-ad___ se__ perspective by using 3) pr___ and pre___g analytics strategies.
Examples of these predictive and prescriptive analytics include using complex data modeling to evaluate decision and position alternatives, which improves the value and accuracy of advice
1) historic
2) value-adding service
3) predicting and prescribing
Tax Accounting
Risks
5.1 What is Data Profiling?
Dirty Data
-can result in issues from 1) in__ pricing to an inability to detect 2) f__d to sending wrong bills to customers
-data preparation helps 3) a___d these 4) is__ and more
1) inaccurate
2) fraud
3) avoid
4) issues
Define Data Preparation
The process of 1) tran____ data into an analytical database by 2) pr___g, cle___, re__ng and in____g data prior to processing and analysis
It helps ensure 3) hi___-qu___ data and, therefore, improved 4) d___n-m__g.
1) transforming
2) profiling, cleaning, restructuring and integrating
3) high-quality
4) decision-making
Define Data Profiling
the process of 1) in____ data 2) q___y and st____re
1) investigating
2) quality and structure
What are the 3 parts of Data Profiling
- Inv___ the data q___ty
- Inv____ data str___e
- De___and inf___g
- Investigating the data quality
- Investigating data structure
- Deciding and informing
3 parts of Data Profiling
- Investigating data quality: Search for 1) an___s in the data. That is, are the data 2) di__?
- Investigating data structure: Find the 3) b__t way to 4) org___ the data and 5) im___ an__s.
- Deciding and informing: Make 6) dec___ about whether it is possible to address the identified 7) is___, what the 8) c__t of doing so would be, and consider the possible 9) cons___ if the issues are not 10) ad___.
1) anomalies
2) dirty
3) best
4) organize
5) improve analytics
6) decisions
7) issues
8) ocst
9) consequences
10 addressed
Decisions made in the final phase will guide the 1) ex__, tran___, l__ (__) process by determining what must be 2) c___e
1) extract, transform, load (ETL)
2) change
For now, keep in mind that data profiling detects data 1) iss__ and 2) __ corrects them.
1) issures
2) ETL (extract, transform, load)
Investigate Data Quality
When discussing the “quality” of the data we work with, we are referring to the 1) sui__ of using the data for 2) de___n-m___.
Assessing data quality identifies 3) fla___ values in the data set, which reveals if there are data that must be 4) cle___.
There are different methods for doing this, which is called
-5) ru__-dr__ method
-6) exp___n and in___e
1) suitability
2) decision-making
3) flawed
4) cleaned
5) rule-driven method
6) exploration and inference
Investigate Data Quality
Rule-Driven Method
The rule-driven method is a 1) t__p-d___ approach.
A 2) log___ relationship, or 3) r___, is defined 4) a___g data and 5) te__ to determine if the data 6) con__ to it.
The number of rules that can be specified is nearly 7) unl___
1) top-down
2) logical
3) rule
4) among
5) tested
6) conform
7) unlimited
Investigate Data Quality
Exploration and Inference Methods
The exploration and inference methods are 1) bot___-__ approaches. The goal is to find 2) an___s by 3) exa___ the data from many different 4) per____.
Sorting, 5) fre___ distributions, and 6) o___r analysis are examples of powerful techniques for exploration purposes.
The second bottom-up approach, inference, is a method that relies on 7) co___ algorithms to identify 8) an___s
EX. sorting and frequency distributions
1) bottom-up
2) anomalies
3) examining
4) perspectives
5) frequency
6) outlier
7) computer
8) anomalies
Rule-driven, exploration, and inference methods identify data 1) ano__, which occur when the data 2) d_ n_t meet expectations of 3) cor___, val__, co___y, and com___
1) anomalies
2) do not
3) correctness, validity, consistency, and completeness
Define Validation Rules
An 1) in___l part of data profiling, these rules define what 2) va__ are and are 3) n__ acceptable for analysis when investigating data 4) q___
1) integral
2) values
3) not
4) quality
What are Data quality characteristics?
-cor__
-va____
-inc___t value, inv___value
-correctness
-validity
-incorrect value, invalid value
Correctness
Data describe facts about entities, such as the name of a customer, the price of a product, or the date of a transaction.
Data are incorrect when the value assigned to an entity’s characteristic is wrong
Validity
An incorrect value means the 1) wr__ value is assigned, and an invalid value means an 2) un____ value is assigned
1) wrong
2) unacceptable
Consistency
In addition to being correct and valid, data should be 1) con___
Data inconsistency occurs when the 2) s___e cha__c is represented 3) m__ways
1) consistent
2) same characteristic
3) multiple
Using mgr, mngr, and manager to describe the same job position would create data in___
inconsistency
Consistency
How can we identify inconsistencies like these?
Here are two profiling techniques:
-Create a 1) li__ with all distinct 2) val___, then 3) s___ and review. The inconsistent values of mgr, mngr, and manager will likely be noticeable immediately.
-Build a 4) fre__ table, or a table that 5) co___ how many times a 6) v__e occurs. Values with a 7) l___ frequency might indicate 8) inc___ data.
Which is the best option ?
1) list
2) values
3) sort
4) frequency
5) counts
6) value
7) low
8) inconsistent
the second one is the best option
Completeness
Data that is correct, valid, and consistent are only 1) ac___ if they are also 2) co___e, as noncomplete data might result in 3) dis___ insights
1) accurate
2) complete
3) distorted
Completeness
Data can be in complete in 2 ways.
- A missing instance is when a 1) co__t occurred but is not 2) rec___, (nothing there) such as when goods were sold, but the sales transaction was not recorded. A missing instance such as a missing sales transaction can be identified with gap analysis. If there is a sequential number for each sales invoice, then a missing number might indicate a sale that occurred but was not recorded.
- A missing value occurs when the 3) tra___ is recorded, but we 4) __ n___ have information for all 5) cha__. (we have some info, but not complete) The result is empty cells. This might happen if a customer is recorded, but the record does not include the customer’s email address. The term null typically indicates a missing or unknown value. It is important to assess how the missing information affects decision-making
1) concept
2) recorded
3) transaction
4) do not
5) characteristics
Investigate Data Structure
Along with their quality, data are investigated to assess whether they are 1) str___ in a way that makes data analytics 2) e__ and ef___t
1) structured
2) easy and efficient
Investigate Data structure
Unambiguous descriptions
-poorly named columns make it complicated to develop information models and to conduct analysis
-column names should be correct, intuitive, and clear because they’re part of the analytical database, which might have many different users
-unambiguous descriptions should be a priority when developing an analytical database
-analytical database: integrated data set used for analyses purposes
Define Analytical Database
A 1) c__, w___l-st___d, in___ data set that can be used for analysis
1) clean, well-structured, integrated
Define Slicing
Analysis includes 1) ag___, or gr__, data and then examining them from different 2) vie___s in many ways
1) aggregating, or grouping
2) viewpoints
What 2 table structures are common best practices?
Si__-Va__d Co__s
F___ Tables
Single-Valued Columns
Flat Tables
Define Single-Valued Columns
A 1) co___ in which each cell contains 2) o__ value describing a 3) si__e ch__
4) T__ or m___e values in the same cell makes analysis more 5) ch___g
1) column
2) one value
3) single characteristic
4) Two or more
5) challenging
Define Composite Column
A column with cells that 1) co___ values for 2) t__o or more cha__
1) combines
2) two or more characteristics
Define Multi-Valued Column
A column with cells that contain 1) mu___ values of the 2) s__ ch___c
1) multiple
2) same characteristic
Define Flat Tables
column 0) h___ 1) ___ n__t contain data values 2) u__l for analysis purposes.
For data analysis, flat table structures are 3) pre___ over 4) cros___n tables
it’s the 5) __ practice for structuring data
0) headers
1) do not
2) useful
3) preferred
4) crosstabulation
5) best
Crosstabulation tables have li___ for analysis
limitations
Define Data Model
defines how different 1) ta__ re__ te to each other
1) tables relate
Define Star Schema
The 1) reco___ data structure for analytical databases. They consist of 2) f___ t___s and di___n tables
1) recommended
2) fact tables and dimension
Define Facts
In an accounting context, the 1) st__ data that correspond to business 2) tran___ such as 3) or__, sa___, pu__s, and pa___s
1) stored
2) transactions
3) orders, sales, purchases, and payments
Fact Tables
The columns in fact tables are also mostly 1) qua___, and more specifically, 2) ad___
The values in these columns can be easily 3) ag__d, or gro___ together, as me__s.
These measures can then be 5) s__ using the columns in the next type of tables
1) quantitative,
2) additive
3) aggregated, or grouped together, as measures
4) sliced
Define Dimensions
In an accounting context, these are the data that 1) pro___ co__t to analysis and give 2) me___ to f__s.
They are the 2) va___es or fi__s that can be used to 4) dr__ do___ or dis__te (sli__) analysis measures
1) provide context
2) meaning to facts
3) variables or fields
4) drill down or disaggregate (slice)
Relationships
Relationships, the final element of data models, 1) l___k tables and are represented by 2) l___s in 3) s___r sc___a model.
All relationships in a star schema have a 4) o___to-m___y (___-___) 5) car____ pattern
1) link
2) lines
3) star schema
4) one-to-many (1-N)
5) cardinality
Define cardinality
A constraint that defines how many times an instance of an entity may participate in a relationship
Cardinality
It can take two possible values, N or 1:
N: An instance can 1) pa___e 2) m___ times in the 3) re____. There is 4) ___ restriction.
1: An instance can participate only 5) o___ time in the relationship.
1) participate
2) many
3) relationship
4) no
5) one
A 1-N cardinality pattern for a relationship between a dimension and a fact table can be interpreted as follows:
1: For every fact, there is just 1) o__e co____ng va__ in each dimension table. For each instance of the Sale table (an invoice line), there is one date, product, customer, and salesperson.
N: There can be 2) m__y fa__ for each dimension. There can be many sales on the same day, and the same product type can be sold many times. Both customers and salespeople can be involved in many sales transactions.
1) one corresponding value
2) many facts
Decide and Inform
After investigating the quality and structure of the data, it is time to take the next step:
-D__ n__ m__e fo___d
-Re___ the s____ce system
-Co___m the data’s fit___ for the analysis and move f__
If the decision is made to move forward, the next step in the process would be informed by any 1) is___s that must be 2) ad___.
-Do not move forward
-Redesign the source system
-Confirm the data’s fitness for the analysis and move forward
1) issues
2) addressed
Do not move forward
If the data are 1) u__t, such as the data having 2) p___r qu___, then using the data for 3) d__n-ma___ would be too 4) ris___.
Similarly, if the data structure is poor, then 6) res___g and pr__g the data might be too 6) com___ and eco___y 7) unf___
1) unfit
2) poor quality
3) decision-making
4) risky
5) restructuring and processing
6) complicated and economically
7) unfeasible
Redesign the source system
This decision is usually made because 1) er__s generated by the source system need to be 2) cor___. Once 3) f__d, the data source should be 4) pr___ again
1) errors
2) corrected
3) fixed
4) profiled
Confirm the data’s fitness for the analysis and move forward
Moving forward with the data requires being 1) ri___-aw___
That is, we can move forward with the data with an awareness of potential 2) rel___y iss__ and their 3) im___ on 4) dec___
1) risk-aware
2) reliability issues
3) impact
4) decisions
5.2 What Does it Mean to Extract-Transform-Load (ETL) Data?
Define Extract-Transform-Load (ETL)
co__s is___s
corrects issues
What’s the Extract-Transform-Load (ETL)
- R__ Data
- Ex___
- Tr___m
-cle___
-re____g
-in___ - Lo___
- An___l Da___se
- Raw Data
- Extract
- Transform
-cleaning
-restructuring
-integrating - Load
- Analytical Database
Define Data Extraction
The 1) ___ step in the ETL process, which involves 2) tra___ data from where they are 3) st__d to the platform where they will be 4) tra___
1) first
2) transferring
3) stored
4) transformed
Extract Data
-Source data are moved to the platform where they will be transformed. This platform is normally a data warehouse, which is software that 1) st___ and an___s large data sets. Power BI and Tableau are examples of data warehouses.
-The process also includes data 2) va___, or confirming that the data were transferred 3) com__ and cor___.
1) stores and analyzes
2) validation
3) completely and correctly
Transform Data
1) R__ data are 2) ra___ ready for analysis after they are 3) e___d
Data transformation 4) im___ the raw data for analysis through 5) cl___, res___g, and int___n
1) Raw
2) rarely
3) extracted
4) improves
5) cleaning, restructuring, and integration
Cleaning Data
Data can be 1) inc____, inv___, inco____ or inco___.
Cleaning data, one of the most 2) im___ and t___e-con___aspects of data transformation, is also known as data 3) cl___or data 4) sc___ng. It involves 5) ad__, mo___g, and de___data
1) incorrect, invalid, inconsistent or incomplete
2) important and time-consuming
3) cleansing
4) scrubbing
5) adding, modifying, and deleting data
Cleaning Data
-In the case of incomplete data, such as a missing sales transaction, data might need to be 1) ad__. A 1.1) sp__c st___y for dealing with incomplete data is 2) imp__, which is when estimated values are 3) sub___ for m__g data.
-Modifying data is necessary when a 4) cu__ value must be 5) re__ with a new 6) v__ if data are 7) inc___ in__, in__, or inc___te. Replacing the value NY with NJ in a column that records a customer’s state is an example of this.
-8) De__e data that are not 9) re___ for analysis. 10) Re___t data, such as a duplicate sales transaction, should be removed.
1) added
1.1) specific strategy
2) imputation
3) substituted for missing
4) current
5) replaced
6) value
7) incorrect, invalid, inconsistent, or incomplete
8) Delete
9) relevant
10) Redundant
Define Data Restructuring
Also known as data 1) wra___ or data 2) mu__ this process 3) ch__ how the data in a data set are 4) or__d, such as 5) a___g, d___g, and ren___ columns or splitting and combining tables
does 6) n__ ch__ data values, but it does change how the data are 7) org__
1) wrangling
2) munging
3) changes
4) organized
5) adding, deleting, and renaming
6) not
7) organized
Define Data Integration
The process of 1) co___ related data by 2) d___g rela__s with 3) p__y and f__n keys or combining 4) t___ or m___ tables with information from the 5) s___ entity.
1) connecting
2) defining
3) primary and foreign
4) two or more
5) same
Integrating Data
There are two distinctive forms of integration:
-1) Li__g two tables by defining a 2) re___p between them. Relationships are created using 3) p___ry and for___ keys. Other aspects of relationships that must be specified are 4) ca___s.
-Combining two or more tables unites 5) inf___about the 6) s___ entity. Tables can be combined two ways.
-A union 7) co___es dif___ tables with the same data 8) str___. The result is a table with 9) m__ r___s.
-Recall from the foundational data analysis chapter that a join, or merge, combines data 10) el___ or co__s from 11) dif___ t__s. The result is a table with more 12) co__s.
Integrating Data
There are two distinctive forms of integration:
1) Linking
2) relationship
3) primary and foreign
4) cardinalities.
5) information
6) same entity
7) combines
8) structure
9) more rows.
10) elements or columns 11) different
12) columns.
Define Data Matching
A process that 1) co____s data and determines whether they 2) d___e the 3) s___ entity
Data Matching is a challenge specific to 4) in__
1) compares
2) describe
3) same
4) integration
Define Data Loading
The process of making the 1) ana___ d__e available for 3) u__e in software that 4) p__s analysis
Once the data are 4) cle___ and tra___, they are 5) lo__ into the 6) sof__ for analysis
1) analytical database available
2) use
3) performs
4) cleaned and transformed
5) loaded
6) software
5.3 Which Patterns Extract Data?
Each data analytics project has unique data preparation challenges.
While there is not a single, common approach for all projects, a structured set of data preparation patterns can address most challenges. Thesepatterns signal potential problems and provide guidance for finding them within the data set and correcting them.
Each pattern identifies a data issue, discusses how to detect it with a profiling method, and explains one or more ETL methods to correct it. Think of the patterns as a menu; you can select those which are most appropriate for your needs.
The first step is identifying the 1) a__ data and 2) c___ng a data 3) dic___
1) available
2) creating
3) dictionary
Define Data Dictionary
A 1) c__of the data in a data set that 2) in___ what data are 3) a___ and where they can be 4) f__.
1) chart
2) indicates
3) available
4) found
A data dictionary 1) re___ds different pieces of 2) inf___ for each field, including a 3) na__, a brief 4) des___ of content, the data 5) t__e, whether the field is a 6) pr__y or for___n key, and whether the field is 7) man___ or op___al
1) records
2) information
3) name
4) description
5) type,
6) primary or foreign key
7) mandatory or optional
A data dictionary is often referred to as 1) m___d__a–it is data about 2) d__a.
It is built gradually throughout the data 3) pre___ process, but creating 4) col__ names and descriptions is a good starting point
1) metadata
2) data
3) preparation
4) column
What are the 2 extraction patterns ?
-I____e Data Transfer
-In___ Data Transfer
these determine if all data have been e__d and if they were tra___ cor___
-Incomplete Data Transfer
-Incorrect Data Transfer
these determine if all data have been extracted and if they were transferred correctly
Data Preparation Pattern 1: Incomplete Data Transfer
Extraction transfers data from the source files to the 1) __ tool for further processing. An 2) in___transfer of data and 3) mi__ data leads to 4) unr___ results
1) ETL
2) incomplete
3) missing
4) unreliable
Data Preparation Pattern 1: Incomplete Data Transfer
Compare Row Counts
Comparing row counts is one way to check com___ss
completeness
Data Preparation Pattern 1: Incomplete Data Transfer
Add Missing rows
If the row counts 1) d__ ma__, add the 2) m__g r__s to the source data
1) don’t match
2) missing rows
Data Preparation Pattern 1: Incomplete Data Transfer
Summary
Problem: All the data __ __t tra___.
Detect (Data Profiling): Com__ r__ counts.
Correct (ETL): a__ the m__g rows.
Problem: All the data did not transfer.
Detect (Data Profiling): Compare row counts.
Correct (ETL): Add the missing rows.
Data Preparation Pattern 2: Incorrect Data Transfer
Even when all the rows have been transferred, the data might 1) n___ have been transferred 2) cor___. This is often caused by 3) di____ in data types.
1) not
2) correctly
3) differences
Data Preparation Pattern 2: Incorrect Data Transfer
Problem: Data was 1) n__ cor___ transferred.
Detect (Data Profiling): 2) Com___ co___l amounts.
Correct (ETL): 3) mo___ i__ct values.
1) not correctly
2) Compare control
3) Modify incorrect
5.4 Which Patterns Transform Columns?
Once all data are transferred to the ETL tool, it is time to transform them.
Transformation has two purposes–1) cle___ the data by 2) cor___ values and 3) re___ and int__the data for analytics
1) cleaning
2) correcting
3) restructuring and integrating
Data Preparation Pattern 3: Irrelevant and Unreliable Data
Data that are irrelevant for decisions 1) bl___ the data model. It is also important to avoid integrating 2) unr___ data into the data model (recall the old adage, “garbage in, garbage out”).
Keep in mind that 3) ex___g data from an analytical database is 4) n__ the same as 5) de__the data. The raw data still 6) e___t and can be 7) int___if necessary.
1) bloat
2) unreliable
3) excluding
4) not
5) deleting
6) exist
7) integrated
Data Preparation Pattern 3: Irrelevant and Unreliable Data
-Scan Columns for Irrelevant and Unreliable Data
Irrelevant columns can be identified primarily by 1) sc__ the data 2) vi__y. The data 3) dic__ can also be a helpful tool
Scanning data can also determine whether a 4) co___ contains 5) u___e data.
Most ETL tools provide statistics about 6) er___, null values, and more, that can help determine a column’s 7) rel__
1) scanning
2) visually
3) dictionary
4) column
5) unreliable
6) errors
7) reliability
Data Preparation Pattern 3: Irrelevant and Unreliable Data
-Remove Columns With Irrelevant or Unreliable Data
To correct the issue, 1) d___e the columns with 2) ir___nt and 3) unr___ data from the analytical database
1) delete
2) irrelevant
3) unreliable
Data Preparation Pattern 3: Irrelevant and Unreliable Data
Summary
Problem:
1) Irr__ data 2) b__t a data set. 3) Un__ data 4) incr__ the 5) r___ of making 6) b___ decisions.
Detect (Data Profiling):
7) Vi__ s__n columns for 8) irrelevant and unreliable data.
Correct (ETL):
9) Re__ columns with irrelevant or unreliable data.
1) Irrelevant
2) bloat
3) Unreliable
4) increases
5) risk
6) bad
7) Visually scan
8) irrelevant and unreliable
9) Remove
Data Preparation Pattern 4: Incorrect and Ambiguous Column Names
Column names become variables during data 1) exp__ and int___on. Their names are important because other people 2) m___ use the analytical database. Essentially, column names become part of the database’s 3) voc__.
Column names that are 4) co__, int___, and una__us make it 5) e___ for other people to 6) u___ a database and 7) ex___ data.
1) exploration and interpretation
2) might
3) vocabulary
4) correct, intuitive, and unambiguous
5) easier
6) use
7) explore
Data Preparation Pattern 4: Incorrect and Ambiguous Column Names
-Scan Columns for Incorrect or Ambiguous Names
Visually scanning a column’s content and its data dictionary definition can reveal whether the column name accurately reflects its content. Here are four rules for naming columns:
- 1) Na__ should 2) ac___y describe the column’s 3) con___
- They should be 4) int__e to 5) bu__ people.
- Use only common 6) ab____ that are 7) un____ by everyone, such as YTD (year to date)
- Eliminate 8) s___s, un___, or other symbols. For example, use CustomerName instead of Customer_Name.
1) Names
2) accurately
3) content
4) intuitive
5) business
6) abbreviations
7) understood
8) spaces, underscores
Data Preparation Pattern 4: Incorrect and Ambiguous Summary
Problem:
1) In___ or am__s column 3) na__ make it 3) ha__to 4) u__d and work with a data set.
Detect (Data Profiling):
5) Sc__ columns for 6) inc_ or amb__ names.
Correct (ETL):
7) Re__ columns.
1) Incorrect or ambiguous
2) names
3) harder
4) understand and work
5) Scan
6) incorrect or ambiguous
7) Rename
Data Preparation Pattern 5: Incorrect Data Types
1) Ver__ data 2) t__s is an essential part of data 3) pre___
Data types are an integral part of column definitions because they 4) det__ what we 5) c___ and ca__ do with the data in a column.
1) Verifying
2) types
3) preparation
4) determine
5) can and cannot
Data Preparation Pattern 5: Incorrect Data Types
Summary
Problem:
An 1) in__ data type 2) li__s what 3) c__ be done with the data in a column.
Detect (Data Profiling):
4) I___t the data type.
Correct (ETL):
5) C___e the data type.
1) incorrect
2) limits
3) can
4) inspect
5) change
Data Preparation Pattern 6: Composite and Multi-Valued Columns
Each cell should contain 1) o__ va__ describing one 2) cha___c because 3) t__ or m__ values in the4) sa___ cell makes analysis more 5) cha__
1) one value
2) characteristic
3) two or more
4) same
5) challenging
Data Preparation Pattern 6: Composite and Multi-Valued Columns
What are two specific scenarios that violate the single-valued rule and make analysis more complex?
composite columns and multi-valued columns.
Data Preparation Pattern 6: Composite and Multi-Valued Columns
-Scan for Composite and Multi-Valued Columns
The best method to detect composite or multi-valued columns is 1) vi__ sc___g
1) visual scanning
Data Preparation Pattern 6: Composite and Multi-Valued Columns
Summary
Problem:
Columns that are 1) n___ sin___v__d make analytics 2) di__
Detect (Data Profiling):
3) Ex__the 4) v__ in the columns.
Correct (ETL):
5) Sp___ combined columns. Create a 6) sep__ ta__ for 7) m__-v__d columns.
1) not single-valued
2) difficult.
3) Examine
4) values
5) Split combined columns
6) separate table
7) multi-valued
Data Preparation Pattern 7: Incorrect Values
Summary
Problem:
1) Inc__ data might result in 2) po___ dec___-m__g.
Detect (Data Profiling):
3) De___ inc__ values with 4) ou___
Correct (ETL):
5) Mo__ incorrect values.
1) Incorrect
2) poor decision-making.
3) Detect incorrect
4) outliers
5) Modify
Data Preparation Pattern 8: Inconsistent Values
The next pattern addresses data inconsistency, which occurs when 1) t___ or m___ 2) di___ re___s of the 3) sa__ value are 4) m__d in the same 5) c___n.
1) two or more
2) different representations
3) same
4) mixed
5) column
Data Preparation Pattern 8: Inconsistent Values
-Identify Inconsistent Values
Two profiling techniques are useful for detecting inconsistent values:
- 1) Di__ v__: 2) Vi__ sc___ the 3) di___values of a column is an 4) ef__e way to 5) id__ inco___ data.
- 6) Fr__: Values with a 7) l__ frequency could indicate 8) in__t data.
1) Distinct values
2) Visually scanning
3) distinct
4) effective
5) identify inconsistent
6) Frequencies
7) low
8) inconsistent
Data Preparation Pattern 8: Inconsistent Values
Summary
Problem:
Inconsistent data might result in 1) po__ d___n-ma__g.
Detect (Data Profiling):
2) I____ inconsistent values.
Correct (ETL):
3) Mo__ inconsistent values
1) poor decision-making.
2) identify
3) modifty
Data Preparation Pattern 9: Incomplete Values
This pattern addresses 1) inco__ that might make data 2) unu____ and un__.
For example, without customers’ addresses, a business cannot send them marketing materials.
The different meanings given to 3) n__ values might also result in unreliable data.
1) incompleteness
2) unusable and unreliable
3) null
Data Preparation Pattern 9: Incomplete Values
There are a few dimensions of incompleteness worth exploring:
- Should null values be allowed, and if not, are there any?
- If null values are allowed, what percentage of the values are null? If the percentage is high, should the column be loaded?
- How are incomplete values represented: nulls, or a specific code? Are the representations consistent?
Data Preparation Pattern 9: Incomplete Values
-Remove the Column or Replace the Null Values
The correction scenario depends on the situation. If null values are not allowed but they exist,they should be 1) re__
If the number of null values is too high to be useful, then 2) re___ the column from the analytical database.
If there is inconsistency in representing missing values, design a 3) co___t schema and 4) co__ the values in terms of that schema
1) replaced
2) remove
3) consistent
4) correct
Data Preparation Pattern 9: Incomplete Values
Summary
Problem:
Incompleteness might make data 1) un__ and un__e and result in 2) p___r decision-making.
Detect (Data Profiling):
3) In___ n___l values.
Correct (ETL):
4) Re__ the column or 5) replace the 6) n__ values.
1) unusable and unreliable
2) poor
3) Investigate null
4) Remove
5) replace
6) null
Data Preparation Pattern 10: Invalid Values
Domain-specific rules that determine whether data are acceptable can be created for most columns. Data that do not meet these expectations are considered invalid.
Data Preparation Pattern 10: Invalid Values
Create and Apply Validation Rules
Data Preparation Pattern 10: Invalid Values
Summary
Problem:
Invalid data might result in 1) p___ decision-making.
Detect (Data Profiling):
2) Cr__and ap__ v___on rules.
Correct (ETL):
3) Mo__ invalid values.
1) poor
2) Create and apply validation
3) modify
5.5 Which Patterns Transform Tables?
Data Preparation Pattern 11: Non-Intuitive and Ambiguous TableNames
-Scan Tables for Incorrect or Ambiguous Names
Examining a table’s 1) c__t and its data 2) di___ry definition can help determine whether the name 3) acc__ reflects its 4) co__. The rules for naming 5) ta___ are the same as those for naming 6) c__s. They should be 7) in___, avoid sp__, un__, and s__l co__g:
For example, use CashReceipt instead of Cash Receipt.
Avoid special coding like DCustomer (the D refers to a dimension table, but not everyone will understand that).
1) content
2) dictionary
3) accurately
4) content
5) tables
6) columns
7) intuitive, avoid spaces, underscores, and special coding:
Data Preparation Pattern 11: Non-Intuitive and Ambiguous TableNames
Summary
Problem:
Incorrect or ambiguous table names make it 1) h___to 2) und__and wo___ with a data set.
Detect (Data Profiling):
3) Vi__ sc__ tables for incorrect or ambiguous names.
Correct (ETL:
4) Re__ tables.
1) harder
2) understand and work
3) visually scan
4) rename
Data Preparation Pattern 12: Missing Primary Keys
This pattern focuses on primary keys. Tables are descriptions of entities, and each instance of an entity should be 1) un__ id__d.
To be a primary key, a column must have a 2) un___ value for each instance and 3) n__ n__ values. Primary keys are normally already in place when data are 4) e__d from a 5) rel__ database.
However, a primary key will 6) n__ be in place when data are extracted from a spreadsheet, such as in the Beans case.
To establish a primary key, the field must be 7) se__ and both rules must be 8) v__
1) uniquely identified
2) unique
3) no null
4) extracted
5) relational
6) not
7) selected
8) validated
Data Preparation Pattern 12: Missing Primary Keys
Problem:
Some tables 1) __ n__ have a 2) p__ key.
Detect (Data Profiling):
3) Ide__ tables 4) mi__ a 5) pr__ key.
Correct (ETL):
6) Cr__ a primary key
1) do not
2) primary key.
3) Identify
4) missing
5) primary
6) Create
Data Preparation Pattern 13: Redundant Content AcrossColumns
This pattern looks for 1) red___s that create 2) inc___
Data inconsistencies occur when the 3) s__ data are recorded 4) __e than on__ and 5) ch__ in one place but 6) n__ the other, such as a customer’s email address
1) redundancies
2) inconsistencies.
3) same
4) more than once
5) changed
6) not
Data Preparation Pattern 13: Redundant Content AcrossColumns
Here are two scenarios where two or more columns in a table might have the same content:
When there is 1) o__ such as an address that contains state information and a separate state field.
When there is 2) de___cy, which exists when one column’s values are 3) de__ on the values of another column in the 4) s__ table. Assume both age and date of birth are recorded.
However, the values of age change when time passes and the data will become inconsistent. Therefore, instead of transferring age from the data source, it should be calculated as part of the analytical database.
1) overlap
2) dependency
3) dependent
4) same
Data Preparation Pattern 13: Redundant Content AcrossColumns
-Perform Column-By-Column Comparisons
Performing column-by-column comparisons for 1) ov__ or de___s would 2) d__ this issue.
Doing so within the Beans data set would show there is currently no redundant content.
1) overlaps or dependencies
2) detect
Data Preparation Pattern 13: Redundant Content AcrossColumns
-Delete Redundant and Dependent Columns
Columns that contain redundant information can be 1) de__
When there is 2) d___cy, delete the column that contains the 3) de__nt value. Instead, use a 4) fo___ to recreate the column in the analytical database.
1) deleted
dependency
2) dependency
3) dependent
4) formula
Data Preparation Pattern 13: Redundant Content AcrossColumns
Summary
Problem:
1) Re____content among columns in a table might result in 2) in___es.
Detect (Data Profiling:
Perform column-by-column 3) com__
Correct (ETL):
4) D___ red____ and de___nt data
1) redundant
2) inconsistencies
3) comparisons
4) Delete redundant and dependent
Data Preparation Pattern 14: Find Invalid Values with Intra-Table Rules
Pattern 14 is similar to Pattern 10 in that it also defines 1) a___ values for a column.
However, Pattern 14 2) determines the 3) v___ of a column’s values based on the values in 4) o__ or m___ other columns in the 5) s__ table
1) acceptable
2) determines
3) validity
4) one or more
5) same
Data Preparation Pattern 14: Find Invalid Values with Intra-Table Rules
-Create and Apply Intra-Table Validation Rules
The goal of the validation rule is to 1) id___ invalid data.
Creating 2) va___n rules requires 3) i__-de__ knowledge of the business, and they are implemented using a 4) scr___ language.
1) identify
2) validation
3) in-depth
4) scripting
Data Preparation Pattern 14: Find Invalid Values with Intra-Table Rules
Summary
Problem:
1) Inv___ data might result in 2) p__ decision-making.
Detect (Data Profiling):
3) Cr___ and ap___ in___-table validation rules.
Correct (ETL):
4) M__y the invalid values.
1) invalid
2) poor
3) create and apply intra-table
4) modify
5.6 Which Patterns Transform Models?
Data Preparation Pattern 15: Data Spread Across Tables
Analysis becomes more challenging when data that describe the same entity are spread across 1) m___ tables
1) mulitple
Data Preparation Pattern 15: Data Spread Across Tables
To identify similarly structured tables, look for 1) t___ or m___ tables with the 2) sa___ st___. These tables would have the 3) s__ co__s and 4) sim___ d___.
Another option is to 5) se___ fortables that describe different 6) cha___ of the 7) s__ ent__.
In the Beans case, Employee and EmployeeDemographics are such tables. The goal is to create a single table with all the employee information
1) two or more
2) same structure
3) same columns
4) similar data
5) search
6) characteristics
7) same entity
Data Preparation Pattern 15: Data Spread Across Tables
Summary
Problem:
An entity’s data are 1) sp__across 2) mu__ tables, which 3) co__s analysis.
Detect (Data Profiling):
4) Id___ tables with a 5) si___r structure or tables that describe 6) dif___ characteristics of the 7) s___e entity.
Correct (ETL):
8) U__ or me___ tables.
1) spread
2) multiple
3) complicates
4) Identify
5) similar structure or tables
6) different
7) same entity
8) union or merge
Data Preparation Pattern 16: Data Models Do Not Comply with Principles of Dimensional Modeling
Dimensional modeling is the technique of 1) cr___ data 2) mo___with 3) f___t tables surrounded by 4) d____n tables.
These data models, such as 5) st___ s____s, are 6) e__ to understand and result in 7) ef___t data processing.
1) creating
2) models
3) fact
4) dimension
5) star schemas
6) easy
7) efficient
Data Preparation Pattern 16: Data Models Do Not Comply with Principles of Dimensional Modeling
-Analyze a Data Model’s Compliance with Dimensional ModelingPrinciples
Use these dimensional modeling principles by determining the 1) f___ and di___n tables and ensuring all 2) fi__s belong to the 3) cor__ table.
In an accounting context, fact tables correspond to 4) b___s transactions.
Dimension tables, on the other hand, 5) de___ who par__ in the transactions, 6) w___ the transactions 7) oc____, and 8) w__t was g___ u__ or a___d.
1) fact and dimension
2) fields
3) correct
4) business
5) describe who participates in the transactions
6) when
7) occurred
8) what was given up or acquired.
Data Preparation Pattern 16: Data Models Do Not Comply with Principles of Dimensional Modeling
Summary
Problem:
Data models that are 1) __t structured following the principles of 2) d__l modeling are typically more 3) d__lt to analyze.
Detect (Data Profiling):
4) A___ a data model’s compliance with 5) di___l modeling principles.
Correct (ETL):
6) Rec___ the data model as a 7) s__/s__ke schema.
1) not structured
2) dimensional
3) difficult
4) Analyze
5) dimensional
6) Reconfigure
7) star/snowflake
Define Star Schema
A data 1) st___ in which the information for a 2) dimension is 3) sp__ a___s mu___ tables.
1) structure
2) dimension
3) spread across multiple
Data Exploration Pattern 17: Find Invalid Values with Inter-Table Rules
Patterns 10, 14, and 17 are similar because they define the acceptable values for a column.
However, Pattern 17 1) de___ the v__y of a column’s values based on the values in 2) o__ or m___ other tables.
1) determines the validity
2) one or more
Define Referential Integrity
A rule that states the values of a 1) f___ key must be a 2) su___ of the values of its corresponding 3) p___ key.
which refers to the fact that all values in a foreign key should also 4) e__t as values in the corresponding primary key
1) foreign
2) subset
3) primary
4) exist
Data Exploration Pattern 17: Find Invalid Values with Inter-Table Rules
-Create and Apply Inter-Table Validation Rules
1) In__-t__val__ rules identify 2) in__ data.
Their creation 3) re___ in-d___ knowledge of the 4) b__s.
1) Inter-table validation
2) invalid
3) requires in-depth
4) business.
Data Exploration Pattern 17: Find Invalid Values with Inter-Table Rules
Summary
Problem:
1) In___data might result in poor decision-making.
Detect (Data Profiling):
Create and apply 2) i__-t__v_____ r__s.
Correct (ETL):
3) Mo__the 4) in___ rules.
1) invalid
2) inter-table validation rules
3) modify
4) invalid
5.7 Which Patterns Apply to Data Loading?
Once the data are cleaned and transformed, it is time to load them into the software for analysis. Data loading is the process of making the analytical database available for use. Since both extraction and loading are transfer processes, they have similar issues when it comes to the completeness and correctness of transferred data. It is also important that the data model of the analytical database is validated–that is, that all relationships have been defined.
Data Preparation Pattern 18: Incomplete Data Loading
Summary
Problem:
All the data 1) d___ n__t tra__r during 2) l___g.
Detect (Data Profiling):
3) Co__ row counts.
Correct (ETL):
Add the 4) mis___ data.
1) did not transfer
2) loading
3) compare
4) missing
Data Preparation Pattern 19: Incorrect Data Loading
Even when all the rows have been transferred, the data might 1) n__ have been transferred 2) co___. Pattern 19 addresses this potential issue.
1) not
2) correctly
Data Preparation Pattern 19: Incorrect Data Loading
Problem:
The 1) co___ data 2) d__ n__tra___ during 3) lo___g.
Detect (Data Profiling):
4) Co___e co___l amounts.
Correct (ETL):
5) Mo__ inc__values.
1) correct
2) did not transfer
3) loading
4) Compare control
5) Modify incorrect
Data Preparation Pattern 20: Missing or Incorrect Data Relationships
Analytics rely heavily on the underlying data 1) mo__.
2) Mis___or in___y defined data 3) re___ps make analytics 4) c___ng or even impossible, so the 5) comp___ and ac___y of the data model must be 6) va___ after 7) l___g.
1) model
2) Missing or incorrectly
3) relationships
4) challenging
5) completeness and accuracy 6) validated
7) loading
Data Preparation Pattern 20: Missing or Incorrect Data Relationships
Problem:
1) Mis__ or in___y defined data 2) re___s makes analysis 3) c___ng or even 4) im___.
Detect (Data Profiling):
5) Inv___ the 6) completeness and 7) ac___ of the data model.
Correct (ETL):
8) M___ the data 9) m___l.
1) Missing or incorrectly
2) relationships
3) challenging
4) impossible
5) Investigate
6) completeness
7) accuracy
8) Modify
9) model
6.1 What is Information Modeling?
Define Information Modeling
The process of generating 1) ad___l k___ge from 2) d___ that is 3) rel__t for analysis 4) p___es
1) additional
2) data
3) relevant
4) purposes
The Information Modeling Process
In information modeling, 1) d___are the 2) in__.
They are the raw figures and facts.
3) A___s are sets of 4) ins___ that 5) tra___ the data into 6) inf___, which is the 7) o___ of additional 8) k___dge gained from the data
1) data
2) input
3) algorithms
4) instructions
5) transform
6) information
7) output
8) knowledge
The Information Modeling Process
Algorithms are the 1) l___ between the 2) i___t of 3) f___s and the 4) ou__of useful 5) inf__. Keep in mind that data (6) i__t) for one application can also be 7) in___n (8) ou__) for another application.
One example is how financial statements are information (output) for financial statement preparers, but they are data (input) for financial analysts.
1) link
2) input
3) facts
4) output
5) information
6) input
7) information
8) output
Algorithms
are used to calculate 1) d___n, c___, fin____l r___s, and more
-they range from 2) sim__ to co__ and can be 3) c___d with a host of languages
1) depreciation, cost, financial ratios
2) simple to complex
3) coded
The Information Modeling Process
2 types of information fields
-cal___ co__n
-m___e
-calculated column
-measure
Define Calculated Column
A 1) n__data set column with 2) va___ that are calculated based on the 3) va___s in other 4) c___ns
They are the 5) int__ p__s of the table
1) new
2) values
3) values
4) columns
5) integral parts
Define Measure
An 1) ag___d (to__l) data information field that can be used in 3) re___s for 4) an___ purposes
Measures are created by 5) al____, but they are 6) n__ in__ pa__ of a table.
1) aggregated (total)
2) reports
3) analytical
5) algorithms
6) not integral parts
1) M___s are at the center of data analytics.
They can be calculated and then sliced in many ways during data 2) ex___n
1) Measures
2) exploration
A Structured Approach
Performing data analysis requires determining the 1) nu___ to analyze (these are the 2) m___es) and 3) h__ to analyze them (these are the 4) dim__, or the 5) f___s that can 6) sl__ them)
1) numbers
2) measures
3) how
4) dimensions
5) fields
6) slice
A Structured Approach
to developing accounting 1) inf__ mo__ will be used as an example to define 2) dim___ of w___, w__, and w___ in transactions to help make accounting 3) s__e of data
1) information models
2) dimensions of who, what, and when
3) sense
Define Dimensions
In an accounting context, these are the data that provide 1) co__ to analysis and give 2) m___g to 3) f__s.
They are the 3) var__ or f___s that can be used to drill down or disaggregate (slice) analysis measures.
there are typically configured in a 4) __r s__ma
1) contect
2) meaning
3) facts
4) star schema
Define Data Model
shows the 1) rel___ among the 2) dif__ e___es in a data set
It shows the concepts being 3) des___, the tab__, and the fi__ used to 4) de___e the concepts
1) relationships
2) different entities
3) described, the tables, and fields
4) describe
Define Information Model
An 1) ext__ of the data model that includes 2) c___d columns and 3) me__
It has additional 4) inf__ that is calculated from the 5) d__a s
_t, which can be used for 6) an__s pu___
1) extension
2) calculated columns
3) measure
4) information
5) data set
6) analysis purposes.
Create Measures and Dimensions
What are the 2 goals for building an information model for a star schema?
- The first is to 1) c__e a 2) ri__set of 3) me___s for the 4) f__ table
- The second goal for building an information model is a rich set of 5) di__ns that can 6) br__ do___, or sl__e, measures in many ways
1) create
2) rich
3) measures
4) fact
5) dimensions
6) break down or slice
Dimension tables can describe specific characteristics of transactions: w__ par___d, wh___ was in___d, and w___n they oc___d
who participated, what was involved, and when they occurred
Define Who Table
A 1) di___ table that describes 2) ag__ involved in the 3) bu___ transaction data
Internal agents are the employees who 4) p__e in the transaction
1) dimension
2) agents
3) business
4) participate
External agents, such as customers and vendors, are parties external to the organization involved in the accounting transactions.
A participates relationship 1) li__ 2) a__s to 3) s___c transactions.
1) links
2) agents
3) specific
Define What Table
A 1) dim___ table that describes the 2) re___sinvolved in a 3) b___s transaction
1) dimension
2) resources
3) business
Define Flows Relationship
The data 1) rela__between 2) re__ and spe__ tran__
1) relationship
2) resources and specific transactions.
Define When Table
A 1) di___ table that 2) de__s when the data from the 3) bus___ tran___ oc___.
1) dimension
2) describes
3) business transaction occurred
Define Occurs Relationship
The data 1) re__ that 2) l__s the 3) cal__ to 4) sp__ bu__ss tran__
1) relationship
2) links
3) calendar
4) specific business transactions
In__ mo__ is the heart of data analytics
Information
6.2 Which Patterns Implement Information Modeling Algorithms?
Information Modeling Pattern 1: Within-Table Numeric Calculation
The within-table numeric calculation pattern creates a new 1) fi__ (a 2) cal__ column) from 3) o__ or __ nu__columns, or fields in the 4) s__ table.
Basic 5) ari___ operations of addition, subtraction, multiplication, and division can do this
A typical EX is determining the dollar total for an order line or invoice line by multiplying price and quantity
1) field
2) calculated
3) one or more numeric
4) same
5) arithmetic
Information Modeling Pattern 2: Within-Table Text Calculation
The second implementation pattern also involves creating a 0) n___ cal___ col__ from o__ or more fi___s in the sa__ table.
However, the information in the new column is created from 1) t__ f__
For example, different pieces of location information could be linked, or concatenated, into an address so it can be read by a map service such as Bing Maps or Google Maps
0) new calculated column from one or more fields in the same table.
1) text fields
Information Modeling Pattern 3: Within-Table Classification
The number of 0) cl__n algorithms that can be applied to a data set is, in essence, 1) un__
Used when classifications are 1.1) d___d based on data that are 2) p__ of the 3) s__e ta__
most classification algorithms rely on 4) Bo__ lo__, so its necessary to understand Boolean operations like AND, OR, NOT, and IF functions and how to create them
0) classification
1) unlimited
1.1) determined
2) part
3) same table
4) Boolean logic
Information Modeling Pattern 4: Across-Table Calculation
Regardless of the software, there are multiple questions to consider when linking tables:
- Are there 1) p___y defined 2) rel___ between the 3) ta__?
- What is the 4) na___ of the 5) rel__ and what 6) t___of 7) jo___ should be considered?
- What 8) car___s apply to the relationship: 9) __-__, ___-__, __-__, or __-__? For example, calculations become more complex when navigating a 1-N relationship.
1) properly
2) relationships
3) tables
4) nature
5) relationship
6) type
7) join
8) cardinalities
9) 1-1, 1-N, N-1, or N-N?
Information Modeling Pattern 4: Across-Table Calculation
The calculations for this pattern are like the first three–a new column is created from existing data using 1) ari___, te__ cal___, or cla___
But this pattern is different from the earlier patterns because data from 2) dif__nt tables are used to create a 3) n__ co__n. This means the tables 3.1) m__ be 4) li__. How this is accomplished depends on the 5) sof__ being used
1) arithmetic, text calculation, or classification.
2) different
3) new column
3.1) must
4) linked
5) software
Information Modeling Pattern 5: Single-Column Aggregation
The 1) si___-co__ aggregation pattern is the 2) fi___ to fo__ on 3) me__s, which, as you have learned, are the 3) h__ of data analytics.
Measures calculate 4) a___s that can be 5) sli___ or b__en down, in many ways
1) single-column
2) first to focus
3) measures
3) heart
4) aggregates
5) sliced, or broken down,
In its simplest form, a measure applies a mathematical operation to all the values of a single column. Operations that can be applied include aggregate functions such as SUM, AVERAGE, COUNT, MIN, and MAX. Each of these functions generate a single value
Information Modeling Pattern 5: Single-Column Aggregation
For most applications, single-column measures make up a significant part of the information model. A few more things about aggregation functions:
Except for aggregation functions that count, most require a numeric data type.
Be sure to understand how the aggregate functions deal with specific values, such as null, as this can differ depending on the implementation platform.
Pattern 6: Filtered Aggregation
it’s often necessary to analyze aggregates based on 1) fi___ d__a sets
1) filtered data
Define Measure Hierarchy
used when a 1) ne__, m__ co___ measure is created using 2) ex__ m___es
1) new, more complex
2) existing measures
Information Modeling Pattern 7: Measure Hierarchies
In fact, 0) co__ problems can sometimes be solved by 1) di___ the problem into 2) s__ problems, then 3) combining the 4) d___t solutions.
Measure hierarchies are especially useful when calculating 5) ra___ and ben____
0) complex
1) dividing
2) smaller
3) combining
4) different
5) ratios and benchmarks.
6.3 Which Patterns Help Develop and Implement Accounting Information Models?
Information Modeling Pattern 8: How Many
Tables usually describe concepts like 1) c___s, ve___, sa___, and p___ts.
The next pattern creates a measure that 2) c___s the number of 3) r___s or instances, of a concept within a table.
This pattern can count 4) re___, tr___s, and a__s, which means it can be used for both 5) fa__ (tra___s) and di___n (re___s and a__s) tables.
1) customers, vendors, sales, and products
2) count
3) rows
4) resources, transactions, and agents
5) fact (transactions) and dimension (resources and agents
Information Model
Transactions: how many sales
Agents (who): ho. many customers
Resources (what): how many products does KLUB have
Information Modeling Pattern 9: Participates | Transaction-Who
The participates relationship describes 1) w__ is 2) inv__d in a transaction
1) who
2) invovled
In KLUB’s data model there are three participates relationships:
From Sales to Customer (the customer is an external agent).
From Sales to Employee (the employee is an internal agent).
From Purchases to Vendor (the vendor is an external agent).
There are also other types of participates relationships that do not appear in the KLUB data set:
CashReceipts to Customer.
CashReceipts to Employee.
CashDisbursements to Vendor.
CashDisbursements to Employee.
Because the information captured by these relationships is similar, the same types of analyses apply to all of them.
Information Modeling Pattern 10: Flows | Transaction-What
The flows relationship describes 1) w__ is inv___ in a transaction, such as goods and services.
A transaction results in either an 2) in___e or a dec__ of the 3) re__es
1) what is involved
2) increase or a decrease
3) resources
Information Modeling Pattern 10: Flows | Transaction-What
In KLUB’s data model there are two flows relationships:
From Sales to Merchandise.
From Purchases to Merchandise.
There are also other types of flows relationships that do not appear in the KLUB data set:
Cash Receipts to Cash.
Cash Disbursements to Cash.
Because the information captured by these relationships is similar, the same types of analysis apply to all of them.
Information Modeling Pattern 10: Flows | Transaction-What
The first step is to develop measures for the transactions, such as sales, that can be broken down by dimensions that are part of the Resource tables, such as merchandise.
Information Modeling Pattern 11: Occurs | Transaction-When
The occurs relationship describes when a transaction takes 1) p__ by connecting it to a 2) Cal__ table
1) place
2) calender
Information Modeling Pattern 11: Occurs | Transaction-When
There are two occurs relationships in KLUB’s data model:
Sales to Calendar.
Purchases to Calendar.
There are other occurs relationships that do not appear in the KLUB data set:
CashReceipts to Calendar.
CashDisbursements to Calendar.
Because the information captured by these relationships is similar, the same types of analyses apply to all of them.
Information Modeling Pattern 12: Who-What-When Star Schema
While patterns 9, 10, and 11 analyze the who, what, and when dimensions of accounting transactions, this pattern integrates these 1) t__e re__ps
1) three relationships
Information Modeling Pattern 13: Integrated Star Schemas
shows how 1) w___o-wh__-w__ stars are connected through resources.
Resources are being acquired for a purpose–to sell or to use them for manufacturing, for example
1) who-what-when