Chapter 2 - Mastering the Data Flashcards
Cleaning data
generally means filling out missing values/finding out rows with empty values or checking for duplicate records, etc.;
data transformation
converting data from one format to another. Likely moving from the format of the source system to the format of the destination system.
4 benefits of data preparation
- catch errors before the actual process begins
- produce better quality of data
- processing better data = better insights
- better insights = better decisions
What are the 4 characteristics of a data analytics mindset?
- Asking the right questions
- ETL (extract, transform, and load) relevant data
- Apply appropriate DA techniques
- Interpret and share results with stakeholders
A.E.A.I
A delimiter(sometimes known as a field separator)
is a sequence of one or more characters specifying the boundary between distinct data attributes. For example, if we write a name as “Smith, David,” then the comma delimits, or separates, the first and last names. Any combination of characters can be used as delimiters, but the most common are a comma, tab, space, colon and pipe (which is a vertical line, typed as |).
Proprietary file types
There are many different proprietary file types, but the most commonly used are .xls or .xlsx, the file types for saving Microsoft Excel documents. When proprietary file types save files, they use(underlying)coding to distinguish the rows and columns.
–Strength: The program “gets it right” when putting the data in the correct columns and rows.
–Weakness: Proprietary file types often cannot be opened in other software and the amount of records they hold can be restricted. For example, Excel files (currently) can hold approximately 1millionrows of data.
CSV
(comma-separated values)
TSV
(tab-separated values)
pipe
(|)
XBRL
eXtensible Business Reporting Language
is a freely available and global framework for exchanging business information. XBRL allows the expression of semantic meaning commonly required in business reporting.
text qualifier
When a delimiter (such as a comma) is used as a legitimate part of the text, you need to tell the program that it actually is part of the text. To do this, you need to add text qualifiers to the text. The most common text qualifier is the double quotation marks (“”).
Understand the importance of unique identifiers when dealing with data, and how they can save you time and prevent errors.
example: location number + employee ID numbers unique identifiers
5 main types of join
- inner join
- left join
- right join
- full outer join
- cross join
Inner join
The inner join(sometimes just called a join) combines data in two tables that matches one or moreidentified attribute. Importantly, an inner join will not pull the data from the tables if there is no match ofthe identified attribute
Left join - merging/joining data
The left join combines all data from the table listed on the left and only data that matches the identified attributes from the right table.
Right Join - merging/joining data
the right join functions similar to the left join, except it keeps all data in the right table and only merges matching data from the left table. A left join and right join will produce the exact same results if you switch which tables are listed on the left or right
full outer join
A full outer join returns all values from both tables when they match on a specified dimension, and then returns all values that do not match on that dimension with a null value for the non-matching fields.
Cross join (or Cartersian product)
A cross join (or Cartesian product) does not use any variable to match, rather it pairs every single instance in one table with every other instance of the other table
Aggregation
Aggregation is the level at which the data is summarized. It can be at a low level (no aggregation is used)or at a high level (data is aggregated into a single number).
Data formats
Data can be formatted in many ways. A format specifies how the data should be treated. Common formats include treating data as a number, text, percent, scientific notation, etc. It is important to understand all of the different formats used in data and what they mean.
Similarly, different data formats often don’t “speak with each other.” If a unique identifier of 1731 is listed as a number in one data set but the identical number 1731 is listed as a text string in another data set, the tables will not merge correctly until the formats are the same. Each program processes formats differently, so making sure you understand how your program deals with formats is important.
Common messy data problems
- Data formats
- Dates
- Duplicate and redundant data
- Units of measurement
- International differences
6.
Creating a repeatable ETL process
- Data format
- Data scope
- Documentation
- Automation
Mastering the data can also be described via the ETL process. The ETL process stands for __________.
- Enter, transform and load
- Enter, total and load
- Extract, transform and load
- Extract, total and load
3
Which of the following is not a common way that data will need to be cleaned after extraction and validation?
- Remove heading and subtotals
- Remove trailing zeros
- Correct inconsistencies across data
- Format negative numbers
2
What is the purpose of transforming data?
- Validate the data for completeness and integrity
- Identify which data are necessary to complete the analysis
- Load the data into the appropriate tool for analysis
- Obtain the data from the appropriate sources
3
Which of these is not included in the five steps of the ETL process?
- Scrub the data
- Validate the data for completeness and integrity
- Determine the purpose and scope of the data request
- Obtain the data
1
What are attributes that exist in a relational database that are neither primary nor foreign keys?
- Relational table attributes
- Composite key
- Descriptive attributes
- Nondescript attributes
3
The advantage of storing data in a relational database include which of the following?
a - Help in enforcing business rules
b - Increased information redundancy
c - Integrating business processes
- All possible choices
- Only B and C
- Only A and B
- Only A and C
4
Which of the following is the metadata that describes each attribute in a database?
- Data dictionary
- Flat file
- Composite primary key
- Descriptive attributes
1
Which attribute is required to exist in each table of a relational database and serves as the “unique identifier” for each record in a table?
- Key attribute
- Foreign key
- Primary key
- Unique identifier
3
In data preparation, which step includes extracting the information from any source?
- Transform
- Gather
- Discover
- Enrich
2
Which delimiter does the AICPA recommend in their Audit Data Standards?
- Pipe or vertical line
- Space
- Tab
- Comma
1
6 Steps in data preparation
(according to food analogy article)
- Gather - extracting info from any source
- Discover - meaning/understanding the data
- Cleanse - missing values, outliers
- Transform - changing data format
- Enrich - enhancing raw data to improve insights
- Store - store or send for analysis
GDCTES (great days come to each son)
composite primary key
A special case of a primary key that exists in linking tables. The composite primary key is made up of the two primary keys in the table that it is linking.
data dictionary
Centralized repository of descriptions for all of the data attributes of a dataset.
Data request form
A method for obtaining data if you do not have access to obtain the data directly yourself
Descriptive (or nonkey) attributes
Attributes that exist in relational databases that are neither primary nor foreign keys. These attributes provide business information, but are not required to build a database. An example would be “Company Name” or “Employee Address.”
ETL
Extract, transform and load process integral to mastering the data
Flat file
A means of storing data in one place, such as in an Excel spreadsheet, as opposed to storing the data in multiple tables, such as in a relational database
Foreign Key
An attribute that exists in relational databases in order to carry out the relationship between two tables. This does not serve as the “unique identifier” for each record in a table. These must be identified when mastering the data from a relational database in order to extract the data correctly from more than one table.
The foreign key is another type of attribute, and its function is to create the relationship between two tables
Mastering the data
The second step in the IMPACT cycle; it involves identifying and obtaining the data needed for solving the data analysis problem, as well as cleaning and preparing the data for analysis..
Primary key
An attribute that is required to exist in each table of a relational database and serves as the “unique identifier” for each record in a table
Relational Database
A means of storing data in order to ensure that the data are complete, not redundant, and to help enforce business rules. Relational databases also aid in communication and integration of business processes across an organization
Relational databases are made up of tables with uniquely identified records (this is done through primary keys) and are related through the usage of foreign keys
What are the 5 steps of the ETL process?
- Step 1 Determine the purpose and scope of the data request (extract).
- Step 2 Obtain the data (extract).
- Step 3 Validate the data for completeness and integrity (transform).
- Step 4 Sanitize the data (transform).
- Step 5 Load the data in preparation for data analysis (load).
RDBMS acronym and 3 examples
Relational Database Management Systems
- Microsoft Access
- SQLite
- Microsoft SQL Server
Microsoft Access
for any user of Microsoft products (Word, Excel, PowerPoint, etc.) the navigation of Microsoft Access is familiar, so it is a relatively easy entry point for working with relational databases. It is a great entry tool to learn how tables are related via primary and foreign keys because entire databases can be built via a graphical user interface instead of having to use SQL statements to create tables and relationships
SQLite
SQLite is an open-source solution to data management. For a user that is at least somewhat familiar with relational database management, it is a friendly tool, and presents an intuitive interface for writing SQL statements
open-source
denoting software for which the original source code is made freely available and may be redistributed and modified.
Microsoft SQL
Microsoft SQL Server can support enterprise-level data in ways that smaller RDBMS programs, such as Access and SQLite, cannot.
is meant to provide experience that replicates working with much larger and more complex datasets that you will likely experience in the professional world.
Other examples of RBDMS
- Teradata
- MySql
- Oracle RBDMS
- IBM DB2
- Amazon RDS
- PostGreSQL
Unified Modeling Language (UML) Class Diagram
is an illustration or a drawing of the tables and their relationships to each other (i.e., a database schema)
Benefits of a normalized relational database over a flat file (4):
- Completeness
- No redundancy
- Business rules are enforced
- Communication and integration of business processes
What is meant by these 4 benefits of using normalized relational databases over flat files
- Completeness
- No redundancy
- Business rules are enforced
- Communication and integration of business processes
- Completeness - ensures all data required for a business process are included in the dataset
- No redundancy - one version of the truth, whereas the redundancy found in flat files takes up unnecessary space (which is expensive), it increases the risk of data-entry errors,
- Business rules are enforced - allows for better placement and enforcement of internal controls, that flat files simply cannot.
- Communication and integration of business processes - the design of RDs supports business processes which results in improved communication across functional areas.
What are the three types of columns in a relational database table?
- Primary Key (PK)
- Foreign Key (FK)
- Descriptive Attributes
Procure-to-pay
Procure to pay is the process of requisitioning, purchasing, receiving, paying for and accounting for goods and services.
ADS developed by the AICPA
Audit Data Standards
While the ADSs provide an opportunity for standardization, they are voluntary
What is SQL? what are its uses and specific use related to DA?
SQL stands for structured query language
is a computer language that can be used to create, update, and delete records and tables in relational databases, but in Data Analysis, the focus is on extracting data—that is, to select the precise attributes and records that fit the criteria of our data analysis goal
Important Appendices in the Book
Appendices:
D & H for SQL - writing queries and creating joins
C for excel Vlookup
What is the Vlookup function in Excel and what is it used for?
One of Excel’s most useful tools for looking up data from two separate tables and matching them based on a matching primary key/foreign key relationship is the VLookup function. There are a variety of ways that the VLookup function can be used, but for extracting and transforming data it is best used to add a column to a table.
When to use SQL vs. VLookup for ET?
SQL - pulling out specific information of interest to answer your biz question
VLookup - for exploratory analysis where you don’t mind pulling out the entire table of data. Note: Excel has the 1M row limit and it can get super slow.
What 4 steps help validate data after Extraction?
- Compare the # of records that were extracted to the # of records in the source database.
- Compare descriptive statistics for numeric fields (calculating min, max, avg, medians - help ensure numeric data were extracted completely)
- Validate Date/Time Fields - converting to numeric and running descriptive statistics
- Compare string limits for text fields - ensure you haven’t cutoff any characters
Descriptive Statistics
Central Tendency
Variability
Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.
Measures of central tendency include the mean, median and mode
measures of variability include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness.[3]
Kurtosis
In probability theory and statistics, kurtosis (from Greek: κυρτός, kyrtos or kurtos, meaning “curved, arching”) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution and there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. Different measures of kurtosis may have different interpretations.
Pearson Kurtosis - higher kurtosis corresponds to greater extremity of deviations (or outliers), and not the configuration of data near the mean.
Moor’s Kurtosis interpretation
(RV-Mean)/(SD) where RV = random variable
Skewness
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.
For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right.
A left-skewed distribution usually appears as a right-leaning curve.
A right-skewed distribution usually appears as a left-leaning curve.
The mass of the distribution is generally concentrated on the opposite side of the name. Since the terms left and right-leaning refer to where the tail is.
Sometimes approximated nonpararmetrically as:
(Mean-Median)/(SD)
What are 4 common ways data has to be cleaned after extraction and validation?
- Remove headings or subtotals
- Clean leading zeroes and nonprintable characters - happens when stored in source DB as text and need to be analyzed as #s
- Format negative numbers
- Correct inconsistencies across data, in general
5 main quality issues to look for when analyzing a data set for the first time:
- Dates - main issue is format - Preferred format is yyyy-mm-dd (ISO 8601) (from big y to small d)
- Numbers - 1 vs I, 0 vs. O, 7 vs. seven, or $, etc. Remove any extra char. to leave raw #
- International characters and encoding - ASCII vs Unicode, invisible comp. characters (tabs, returns, line breaks, etc.)
4. Languages and measures - cheese vs frommage, pounds or lbs, dollar vs. euro
5. Human error