Prepare - Clean, transform, and load data in Power BI Flashcards
What is shaping the data?
Shaping data is about cleaning, simplifying and organising data to meet your needs and tell your story
What are the 7 steps in shaping data?
Step 1 - Identify column headers and names
Check to make sure column heads are correct
Remove undescriptive columns
Step 2 - Promote headers
Power Query assumes that all data belongs in rows and sometimes does not recognise rows as headers
Step 3 - Rename Columns
Check to make sure columns are named correctly
Adopt name conventions
Step 4 - Remove top rows
Remove rows to make sure table has no blanks
Step 5 - Remove columns
Remove unnecessary columns
Step 6 - Unpivot columns (many to 2)
Changes headers to rows
Step 7 - Pivot columns (2 to many)
Change rows to headers
What is best practice in shaping data?
Best practice
Remove unnecessary columns as early as possible to retain the most important information (do this before data model)
Best steps are
- Removing Columns
- Unpivot
- Pivot
What is simplifying the data structure?
When you import data from multiple sources into Power BI Desktop, the data retains its predefined table and column names. You might want to change some of these names so that they are in a consistent format, easier to work with, and more meaningful to a user.
What are the 4 steps in simplifying the data structure?
Rename a query
It’s good practice to change uncommon or unhelpful query names to names that are more obvious or that the user is more familiar with.
Remove unnecessary suffix’s and prefix’s
Replace values
Some values are not very informative or misspelled
Replace null values
Often data sources may contain a null (nothing entered) instead of a zero value
This is very important because otherwise calculations will not work
Remove duplicates
Remove duplicates to keep unique values
What is best practice for naming tables, columns, and values?
General
Use language and abbreviations commonly used within your organisation
Replace underscores with spaces
Give tables, columns, measures descriptive business terms
Be consistent with abbreviations and prefaces
Short abbreviations can cause confusion
Remove prefixes and suffixes
Values
Think about how values might appear on a report or visual
Values that are too long might be difficult to read on a visual
Values that are too short might be difficult to interpret
Try to avoid acronyms as much as possible
Why evaluate column data types?
When you import a table from any data source, Power BI Desktop automatically starts scanning the first 1,000 rows (default setting) and tries to detect the type of data in the columns.
Some situations might occur where Power BI Desktop does not detect the correct data type. Where incorrect data types occur, you will experience performance issues.
When are data detection problems worst?
Non-Database: You have a higher chance of getting data type errors when you are dealing with flat files, such as comma-separated values (.CSV) files and Excel workbooks (.XLSX), because data was entered manually into the worksheets and mistakes were made.
Databases: Conversely, in databases, the data types are predefined when tables or views are created.
What are the implications of incorrect data types?
When Power BI does not detect the correct type various problems can arise:
Prevent certain calculations
Deriving hierarchies
Creating proper relationships with tables
Inability to create a date hierarchy (e.g. recognised as a list of dates)
What are the best practice ways to evaluate columns?
Datatype
Evaluate column data types before loading data into a Power BI Data Model
Format data
Apply format in the column (e.g., currency, decimal etc)
Change the summarisation default
Date
However, it is a best practice to use a date table and turn off the auto date/time to get rid of the auto generated hierarchy.
Why combine multiple tables into a single table?
There are numerous reasons why you might need to combine queries or tables together.
Data model: too many tables exist, making it difficult to navigate an overly-complicated data model.
Redundant tables: a) several tables have a similar role OR b) table has only a column or two that can fit into a different table.
Single source of truth: reports and dashboards are derived from one-source-of-truth
Custom Column: You want to use several columns from different tables in a custom column
What are the two ways to combine tables?
Append (horizontal): add rows to the bottom of the table
Merge (vertical): adding columns to the side of the table
How can you append tables?
Select the tables you want to append from “Available tables” and move to “Tables to append“. Can append queries as new table or as part of existing.
How can you merge tables?
Select the tables you want to merge and the column that is matching between the tables. Then select one of the following options:
Left Outer - Displays all rows from the first table and only the matching rows from the second.
Full Outer - Displays all rows from both tables.
Inner - Displays the matched rows between the two tables.
What is best practice when combining tables?
(1) Step 1: Edit/Format first
- Delete all extraneous columns
- Keep only the pertinent columns
- Rename so the different tables have the same named columns
(2) Step 2: Join tables
What is profiling data in Power BI?
Profiling data is about evaluating data quality: determining anomalies, examining and developing the underlying data structures, and querying data statistics such as row counts, value distributions, minimum and maximum values, averages, and so on.
Why is profiling data important?
This concept is important because it allows you to shape and organize the data by Remove outliers Find anomalies Check distributions Investigate specific data points
Explain the different profiling tools?
Column Distribution
Distribution visualisation at the top of columns
Displays the counts of distinct and unique values
Column Quality
Gives the percentage of valid, error and empty values
Column Profile
Column profile gives you a more in-depth look into the statistics within the columns for the first 1,000 rows of data.
Count of rows can indicate whether the import is successful or not
Column statistics = a more in-depth look into the statistics within the columns for the first 1,000 rows of data including, summary statics, zero counts etc
The Value distribution = the count for each distinct value within the column
What is a common troubleshoot with profile tools?
When you are not using the correct data type (e.g. numbers)
What is advanced editor?
- Each time you shape data in Power Query, you create a step in the Power Query process.
- Each cleaning step that you made was likely created by using the graphical interface, but Power Query uses the M language behind the scenes.
These steps are available to read by using the Power Query Advanced Editor
How is M code written?
- Each step is one or two lines of code
- M code is written top-down (e.g. later steps refer to earlier steps to the left of the equal sign)
- It is case sensitive
- Starts with let and ends with in
What is the difference between DAX and M code?
DAX
- App = Power BI desktop
- Purpose = used to transform and create new columns within Power BI desktop
M-code (Mashup)
- App = language used by Power Query
- Purpose = (a) used to connect data (b) simplify, combine and transform data in Power Query Editor
What are table transforms?
Transformation on whole tables, think about the column vs table icon in DAX
Name the table transforms?
Promote or demote column headers Filter by row position Filter by values Choose or remove columns Grouping or summarising rows Unpivot or Pivot Transpose Reverse rows Data types Dealing with errors Working with duplicates
What are column transforms?
Transforms values within columns
Name the column transforms?
Fill values Sort columns Rename columns Move columns Replace Values Parse text Extract Text
How do you record profile statistics?
- Column quality = copy quality metrics
- Column distribution = copy distribution
- Column profile = copy value distribution
What is default profile behaviour and what is the alternative?
- To check the first 1000 rows (however this isn’t a full reflection of data)
- Change to entire dataset so you can see a more complete picture (however this could slow Power Query editor down)