Chapter 2 - Mastering the Data Flashcards

Question

What is the purpose of transforming data? 1. Validate the data for completeness and integrity 2. Identify which data are necessary to complete the analysis 3. Load the data into the appropriate tool for analysis 4. Obtain the data from the appropriate sources

Answer 1

1. **Gather** - extracting info from any source 2. **Discover** - meaning/understanding the data 3. **Cleanse** - missing values, outliers 4. **Transform** - changing data format 5. **Enrich** - enhancing raw data to improve insights 6. **Store** - store or send for analysis GDCTES (great days come to each son)

Answer 2

A special case of a primary key that exists in linking tables. The composite primary key is made up of the two primary keys in the table that it is linking.

Answer 3

Centralized repository of descriptions for all of the data attributes of a dataset.

Answer 4

A method for obtaining data if you do not have access to obtain the data directly yourself

Answer 5

Attributes that exist in relational databases that are neither primary nor foreign keys. These attributes provide business information, but are not required to build a database. An example would be “Company Name” or “Employee Address.”

Answer 6

Extract, transform and load process integral to mastering the data

Answer 7

A means of storing data in one place, such as in an Excel spreadsheet, as opposed to storing the data in multiple tables, such as in a relational database

Answer 8

An attribute that exists in relational databases in order to carry out the relationship between two tables. This does not serve as the “unique identifier” for each record in a table. These must be identified when mastering the data from a relational database in order to extract the data correctly from more than one table. ## Footnote **The foreign key is another type of attribute, and its function is to create the relationship between two tables**

Answer 9

The second step in the IMPACT cycle; it involves identifying and obtaining the data needed for solving the data analysis problem, as well as cleaning and preparing the data for analysis..

Answer 10

An attribute that is required to exist in each table of a relational database and serves as the “unique identifier” for each record in a table

Answer 11

A means of storing data in order to ensure that the data are complete, not redundant, and to help enforce business rules. Relational databases also aid in communication and integration of business processes across an organization Relational databases are made up of tables with uniquely identified records (this is done through primary keys) and are related through the usage of foreign keys

Answer 12

1. **Step 1** Determine the purpose and scope of the data request (extract). 2. **Step 2** Obtain the data (extract). 3. **Step 3** Validate the data for completeness and integrity (transform). 4. **Step 4** Sanitize the data (transform). 5. **Step 5** Load the data in preparation for data analysis (load).

Answer 13

Relational Database Management Systems 1. Microsoft Access 2. SQLite 3. Microsoft SQL Server

Answer 14

for any user of Microsoft products (Word, Excel, PowerPoint, etc.) the navigation of Microsoft Access is familiar, so it is a relatively easy entry point for working with relational databases. It is a great entry tool to learn how tables are related via primary and foreign keys because entire databases can be built via a graphical user interface instead of having to use SQL statements to create tables and relationships

Answer 15

SQLite is an **open-source** solution to data management. For a user that is at least somewhat familiar with relational database management, it is a friendly tool, and presents an intuitive interface for writing SQL statements

Answer 16

denoting software for which the original source code is made freely available and may be redistributed and modified.

Answer 17

Microsoft SQL Server can support enterprise-level data in ways that smaller RDBMS programs, such as Access and SQLite, cannot. is meant to provide experience that replicates working with much larger and more complex datasets that you will likely experience in the professional world.

Answer 18

1. Teradata 2. MySql 3. Oracle RBDMS 4. IBM DB2 5. Amazon RDS 6. PostGreSQL

Answer 19

is an illustration or a drawing of the tables and their relationships to each other (i.e., a database schema)

Answer 20

1. Completeness 2. No redundancy 3. Business rules are enforced 4. Communication and integration of business processes

Answer 21

1. **Completeness** - ensures all data required for a business process are included in the dataset 2. **No redundancy** - one version of the truth, whereas the redundancy found in flat files takes up unnecessary space (which is expensive), it increases the risk of data-entry errors, 3. **Business rules are enforced -** allows for better placement and enforcement of internal controls, that flat files simply cannot. 4. **Communication and integration of business processes** - the design of RDs supports business processes which results in improved communication across functional areas.

Answer 22

1. Primary Key (PK) 2. Foreign Key (FK) 3. Descriptive Attributes

Answer 23

Procure to pay is the process of requisitioning, purchasing, receiving, paying for and accounting for goods and services.

Answer 24

Audit Data Standards While the ADSs provide an opportunity for standardization, they are voluntary

Answer 25

SQL stands for structured query language is a computer language that can be used to **create, update, and delete records and tables in relational databases, but in Data Analysis, the focus is on extracting data**—that is, to select the precise attributes and records that fit the criteria of our data analysis goal

Answer 26

Appendices: D & H for SQL - writing queries and creating joins C for excel Vlookup

Answer 27

One of Excel’s most useful tools for looking up data from two separate tables and matching them based on a matching primary key/foreign key relationship is the VLookup function. There are a variety of ways that the VLookup function can be used, but for extracting and transforming data it is best used to add a column to a table.

Answer 28

SQL - pulling out specific information of interest to answer your biz question VLookup - for exploratory analysis where you don't mind pulling out the entire table of data. Note: Excel has the 1M row limit and it can get super slow.

Answer 29

1. **Compare the # of records** that were extracted to the # of records in the source database. 2. **Compare descriptive statistics for numeric fields** (calculating min, max, avg, medians - help ensure numeric data were extracted completely) 3. **Validate Date/Time Fields** - converting to numeric and running descriptive statistics 4. **Compare string limits for text fields** - ensure you haven't cutoff any characters

Answer 30

**Descriptive statistics** is distinguished from inferential statistics (or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. Measures of **central tendency** include the mean, median and mode measures of **variability** include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness.[3]

Answer 31

In probability theory and statistics, kurtosis (from Greek: κυρτός, kyrtos or kurtos, meaning "curved, arching") is a measure of the "**tailedness**" of the probability distribution of a real-valued random variable. **Like skewness, kurtosis describes the shape of a probability distribution and there are different ways of quantifying it** for a theoretical distribution and corresponding ways of estimating it from a sample from a population. Different measures of kurtosis may have different interpretations. **Pearson Kurtosis** - higher kurtosis corresponds to greater extremity of deviations (or outliers), and not the configuration of data near the mean. **Moor's Kurtosis interpretation** (RV-Mean)/(SD) where RV = random variable

Answer 32

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal distribution, **negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right.** A left-skewed distribution usually appears as a right-leaning curve. A right-skewed distribution usually appears as a left-leaning curve. The mass of the distribution is generally concentrated on the opposite side of the name. Since the terms left and right-leaning refer to where the tail is. Sometimes approximated nonpararmetrically as: (Mean-Median)/(SD)

Answer 33

1. Remove headings or subtotals 2. Clean leading zeroes and nonprintable characters - happens when stored in source DB as text and need to be analyzed as #s 3. Format negative numbers 4. Correct inconsistencies across data, in general

Answer 34

1. **Dates** - main issue is format - Preferred format is yyyy-mm-dd (ISO 8601) (from big y to small d) 2. **Numbers** - 1 vs I, 0 vs. O, 7 vs. seven, or $, etc. Remove any extra char. to leave raw # 3. **International characters and encoding -** ASCII vs Unicode, invisible comp. characters (tabs, returns, line breaks, etc.) **4. Languages and measures -** cheese vs frommage, pounds or lbs, dollar vs. euro **5. Human error**

Chapter 2 - Mastering the Data Flashcards

(66 cards)