Prepare - Clean, transform, and load data in Power BI Flashcards
What is shaping the data?
Shaping data is about cleaning, simplifying and organising data to meet your needs and tell your story
What are the 7 steps in shaping data?
Step 1 - Identify column headers and names
Check to make sure column heads are correct
Remove undescriptive columns
Step 2 - Promote headers
Power Query assumes that all data belongs in rows and sometimes does not recognise rows as headers
Step 3 - Rename Columns
Check to make sure columns are named correctly
Adopt name conventions
Step 4 - Remove top rows
Remove rows to make sure table has no blanks
Step 5 - Remove columns
Remove unnecessary columns
Step 6 - Unpivot columns (many to 2)
Changes headers to rows
Step 7 - Pivot columns (2 to many)
Change rows to headers
What is best practice in shaping data?
Best practice
Remove unnecessary columns as early as possible to retain the most important information (do this before data model)
Best steps are
- Removing Columns
- Unpivot
- Pivot
What is simplifying the data structure?
When you import data from multiple sources into Power BI Desktop, the data retains its predefined table and column names. You might want to change some of these names so that they are in a consistent format, easier to work with, and more meaningful to a user.
What are the 4 steps in simplifying the data structure?
Rename a query
It’s good practice to change uncommon or unhelpful query names to names that are more obvious or that the user is more familiar with.
Remove unnecessary suffix’s and prefix’s
Replace values
Some values are not very informative or misspelled
Replace null values
Often data sources may contain a null (nothing entered) instead of a zero value
This is very important because otherwise calculations will not work
Remove duplicates
Remove duplicates to keep unique values
What is best practice for naming tables, columns, and values?
General
Use language and abbreviations commonly used within your organisation
Replace underscores with spaces
Give tables, columns, measures descriptive business terms
Be consistent with abbreviations and prefaces
Short abbreviations can cause confusion
Remove prefixes and suffixes
Values
Think about how values might appear on a report or visual
Values that are too long might be difficult to read on a visual
Values that are too short might be difficult to interpret
Try to avoid acronyms as much as possible
Why evaluate column data types?
When you import a table from any data source, Power BI Desktop automatically starts scanning the first 1,000 rows (default setting) and tries to detect the type of data in the columns.
Some situations might occur where Power BI Desktop does not detect the correct data type. Where incorrect data types occur, you will experience performance issues.
When are data detection problems worst?
Non-Database: You have a higher chance of getting data type errors when you are dealing with flat files, such as comma-separated values (.CSV) files and Excel workbooks (.XLSX), because data was entered manually into the worksheets and mistakes were made.
Databases: Conversely, in databases, the data types are predefined when tables or views are created.
What are the implications of incorrect data types?
When Power BI does not detect the correct type various problems can arise:
Prevent certain calculations
Deriving hierarchies
Creating proper relationships with tables
Inability to create a date hierarchy (e.g. recognised as a list of dates)
What are the best practice ways to evaluate columns?
Datatype
Evaluate column data types before loading data into a Power BI Data Model
Format data
Apply format in the column (e.g., currency, decimal etc)
Change the summarisation default
Date
However, it is a best practice to use a date table and turn off the auto date/time to get rid of the auto generated hierarchy.
Why combine multiple tables into a single table?
There are numerous reasons why you might need to combine queries or tables together.
Data model: too many tables exist, making it difficult to navigate an overly-complicated data model.
Redundant tables: a) several tables have a similar role OR b) table has only a column or two that can fit into a different table.
Single source of truth: reports and dashboards are derived from one-source-of-truth
Custom Column: You want to use several columns from different tables in a custom column
What are the two ways to combine tables?
Append (horizontal): add rows to the bottom of the table
Merge (vertical): adding columns to the side of the table
How can you append tables?
Select the tables you want to append from “Available tables” and move to “Tables to append“. Can append queries as new table or as part of existing.
How can you merge tables?
Select the tables you want to merge and the column that is matching between the tables. Then select one of the following options:
Left Outer - Displays all rows from the first table and only the matching rows from the second.
Full Outer - Displays all rows from both tables.
Inner - Displays the matched rows between the two tables.
What is best practice when combining tables?
(1) Step 1: Edit/Format first
- Delete all extraneous columns
- Keep only the pertinent columns
- Rename so the different tables have the same named columns
(2) Step 2: Join tables
What is profiling data in Power BI?
Profiling data is about evaluating data quality: determining anomalies, examining and developing the underlying data structures, and querying data statistics such as row counts, value distributions, minimum and maximum values, averages, and so on.
Why is profiling data important?
This concept is important because it allows you to shape and organize the data by Remove outliers Find anomalies Check distributions Investigate specific data points
Explain the different profiling tools?
Column Distribution
Distribution visualisation at the top of columns
Displays the counts of distinct and unique values
Column Quality
Gives the percentage of valid, error and empty values
Column Profile
Column profile gives you a more in-depth look into the statistics within the columns for the first 1,000 rows of data.
Count of rows can indicate whether the import is successful or not
Column statistics = a more in-depth look into the statistics within the columns for the first 1,000 rows of data including, summary statics, zero counts etc
The Value distribution = the count for each distinct value within the column
What is a common troubleshoot with profile tools?
When you are not using the correct data type (e.g. numbers)
What is advanced editor?
- Each time you shape data in Power Query, you create a step in the Power Query process.
- Each cleaning step that you made was likely created by using the graphical interface, but Power Query uses the M language behind the scenes.
These steps are available to read by using the Power Query Advanced Editor
How is M code written?
- Each step is one or two lines of code
- M code is written top-down (e.g. later steps refer to earlier steps to the left of the equal sign)
- It is case sensitive
- Starts with let and ends with in
What is the difference between DAX and M code?
DAX
- App = Power BI desktop
- Purpose = used to transform and create new columns within Power BI desktop
M-code (Mashup)
- App = language used by Power Query
- Purpose = (a) used to connect data (b) simplify, combine and transform data in Power Query Editor
What are table transforms?
Transformation on whole tables, think about the column vs table icon in DAX
Name the table transforms?
Promote or demote column headers Filter by row position Filter by values Choose or remove columns Grouping or summarising rows Unpivot or Pivot Transpose Reverse rows Data types Dealing with errors Working with duplicates
What are column transforms?
Transforms values within columns
Name the column transforms?
Fill values Sort columns Rename columns Move columns Replace Values Parse text Extract Text
How do you record profile statistics?
- Column quality = copy quality metrics
- Column distribution = copy distribution
- Column profile = copy value distribution
What is default profile behaviour and what is the alternative?
- To check the first 1000 rows (however this isn’t a full reflection of data)
- Change to entire dataset so you can see a more complete picture (however this could slow Power Query editor down)
What does the column chart show in column profile?
- Shows shape of data in that column
- Is the distribution uniform
What is table schema?
- Prefixes and suffixes associated with particular database tables
Name some of the value distribution features?
- Can select (…) and group the values
What does simplifying the data mean?
- Resolve inconsistencies
- Unexpected or null values
- Data quality issues
- User-friendly values
Name some complex query types?
Nested cells:
- Errors
- Lists
- Records
- Attributes
- Tables
Selecting any of these links adds another query step
Why might Power Query use its autodetect feature?
- When the data source does not communicate the data type
How can you disable auto-detect?
- Options & Settings > options > CURRENT FILE > Data Loaded > Type Detection
Why might you disable auto-detect?
- Sometimes it can lead to incorrect formatting which has downstream implications
- Replace errors otherwise Power Query may convert to text
What are the drawbacks to data formatting in Power BI?
- Not all data types are available (e.g., percentage, currency etc)
- Further formatting can occur in Data view within Power BI desktop
What is a general rule of thumb when it comes to data formatting?
- If you are not going to perform mathematical calculations on it then it is not worth being numeric (e.g., Postcodes have numbers but will not be used in a calculations)
Why might you see errors even if you have corrected the data type?
- Power Query caches output rather than updating immediately
- Click on another query after data conversion to force changes
Why is locale important?
- Change date format to match local area
Explain the problem with incorrect locale?
- Incorrect locale can result in faulty data (e.g., 9/11 in US and 11/9 in UK)
- Power BI cannot detect this nuance
How might we use column quality?
- Allows us to check if the queried data has loaded correctly
- Get an initial check on data quality
- Nulls are not included in error count (e.g., null not recognised as text)
Are column profile tools always reliable?
- No because it shows the first 1000 rows
- Right-Click and change to entire dataset
What does the column distribution chart show?
- The shape of the data
Name an advanced feature of the column distribution tool?
- Group values to profile different values
How can you use replace values?
- Match entire cell: will only replace if the value is a complete match rather than just a part match
- Replace using special characters: null is an example of special character
How can you identify the root cause of query errors in Power Query?
- Checking nested contents (e.g., errors, lists, records and tables)
How can you use keys in Power BI?
Can combine tables in several ways:
- Combine tables (in Power Query or DAX)
- Create relationships (in Power BI in Model view)
Explain the similarities and differences of keys for joins and relationships?
Keys for joins
- Not necessary to create composite key
- Column data type important
- Can use multiple columns
Keys for relationships
- More forgiving of data type
- Only creates relationships on single pair of columns
- May need to merge columns first if you have composite key
Explain how to combine columns?
Two options
- 1) Create new column
- 2) Merge in place
n.b. column selection order important (order of text follows order of column)
How can you get an accurate row count?
- Data view will give an exact answer once loaded
What is a common way to break queries?
- When you change query steps
- The last steps query the previous steps
Explain what pivoting does?
- Creates a new column for each value
Explain what unpivoting does?
- Selected columns are converted into 2 columns
What does transpose do?
- Swaps the rows and columns
Explain the features of adding columns?
- Column from examples - Power Query attempts to write a transformation sequence that reflects your actions
- Custom column - type own M formula
- Invoke custom function - invokes a custom function for every row of a table
- Conditional column - specify if-then-else logic
- Index column - creates a sequential column that’s starts and increments the value you specify
- Duplicate column - copy of column you select
What are the table transformations?
- Reduce or keep rows
- Pivot or unpivot
- Adding columns
Name a troubleshoot with custom column?
- Creates a column without a data type so you have to specify the data type manually before loading to ensure it loads correctly
What does the double-arrows in a header do for Lists?
Expand list
- Expand to new rows: creates a new for each row inside the row’s list
- Extract values: keep row numbers the same and concatenate list rows into one cell with a separator you specify
How can you preview a nested cell?
Select the cell with the hyperlink
Name some nested function in M code?
- Braces = a list
- Double period = range
Explain append?
- Joins when there are multiple matching columns
- Columns which don’t line up with new table(s) will be filled with null values (table gets taller)
Explain how to use merge?
- Need a set of keys or matching values
- Specify which rows of the first table should be combined with which rows of the second table
Explain the different joins in merge? Give an example of how this works?
- Left outer = all from first match from second table
- Right outer = all from second matching from first table
- Full outer = all rows from both
- Inner = only matching rows
- Left anti = rows only in first
- Right anti = rows only in second
The merge joins are all about the a) columns that have matching values and b) the order in which you select the tables. So if there is 1112 in the first table and 1234 in the second table we will only see the additional columns for 1(3 times) and 2 (once) for a first outer join…Conversely a right outer join would be about which columns in the second match from columns in the first
What does the double-arrows in a header do for Tables?
Expands the table
- Expand: select the columns from the joined table that will be added to the current table (matching rows duplicated)
- Aggregate: will aggregate rows and won’t duplicate any rows…aggregating means you can apply statistical functions to the columns that have joined
What is camel case?
When words are lowercase to uppercase….camelCase
Why are naming conventions important?
- Report builders may want to browse data model fields and will need to have some intuitive understanding of what each field means (aim to make the fields understandable to a non-technical person),
Why use Advanced Editor?
- Although you can use the pre-built transformation buttons in Power Query Editor there is a lot more flexibility when you write M code
Explain the M code around the first initial query steps?
- Source = is the only step that is the same
- Navigation = all navigation steps are grouped in Power Query
- Dealing with spaces = when you have special characters (e.g., a space) for a command M uses quotation marks and a prefixes them with a hashtag (e.g., #”Changed Type”)
How can you see the M code for one query step at a time?
- View > Formula bar