Prepare - Clean, transform, and load data in Power BI Flashcards

1
Q

What is shaping the data?

A

Shaping data is about cleaning, simplifying and organising data to meet your needs and tell your story

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 7 steps in shaping data?

A

Step 1 - Identify column headers and names
Check to make sure column heads are correct
Remove undescriptive columns

Step 2 - Promote headers
Power Query assumes that all data belongs in rows and sometimes does not recognise rows as headers

Step 3 - Rename Columns
Check to make sure columns are named correctly
Adopt name conventions

Step 4 - Remove top rows
Remove rows to make sure table has no blanks

Step 5 - Remove columns
Remove unnecessary columns

Step 6 - Unpivot columns (many to 2)
Changes headers to rows

Step 7 - Pivot columns (2 to many)
Change rows to headers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is best practice in shaping data?

A

Best practice
Remove unnecessary columns as early as possible to retain the most important information (do this before data model)

Best steps are

  • Removing Columns
  • Unpivot
  • Pivot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is simplifying the data structure?

A

When you import data from multiple sources into Power BI Desktop, the data retains its predefined table and column names. You might want to change some of these names so that they are in a consistent format, easier to work with, and more meaningful to a user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 4 steps in simplifying the data structure?

A

Rename a query
It’s good practice to change uncommon or unhelpful query names to names that are more obvious or that the user is more familiar with.
Remove unnecessary suffix’s and prefix’s

Replace values
Some values are not very informative or misspelled

Replace null values
Often data sources may contain a null (nothing entered) instead of a zero value
This is very important because otherwise calculations will not work

Remove duplicates
Remove duplicates to keep unique values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is best practice for naming tables, columns, and values?

A

General
Use language and abbreviations commonly used within your organisation
Replace underscores with spaces
Give tables, columns, measures descriptive business terms
Be consistent with abbreviations and prefaces
Short abbreviations can cause confusion
Remove prefixes and suffixes

Values
Think about how values might appear on a report or visual
Values that are too long might be difficult to read on a visual
Values that are too short might be difficult to interpret
Try to avoid acronyms as much as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why evaluate column data types?

A

When you import a table from any data source, Power BI Desktop automatically starts scanning the first 1,000 rows (default setting) and tries to detect the type of data in the columns.

Some situations might occur where Power BI Desktop does not detect the correct data type. Where incorrect data types occur, you will experience performance issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When are data detection problems worst?

A

Non-Database: You have a higher chance of getting data type errors when you are dealing with flat files, such as comma-separated values (.CSV) files and Excel workbooks (.XLSX), because data was entered manually into the worksheets and mistakes were made.
Databases: Conversely, in databases, the data types are predefined when tables or views are created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the implications of incorrect data types?

A

When Power BI does not detect the correct type various problems can arise:

Prevent certain calculations
Deriving hierarchies
Creating proper relationships with tables
Inability to create a date hierarchy (e.g. recognised as a list of dates)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the best practice ways to evaluate columns?

A

Datatype
Evaluate column data types before loading data into a Power BI Data Model

Format data
Apply format in the column (e.g., currency, decimal etc)
Change the summarisation default

Date
However, it is a best practice to use a date table and turn off the auto date/time to get rid of the auto generated hierarchy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why combine multiple tables into a single table?

A

There are numerous reasons why you might need to combine queries or tables together.

Data model: too many tables exist, making it difficult to navigate an overly-complicated data model.
Redundant tables: a) several tables have a similar role OR b) table has only a column or two that can fit into a different table.
Single source of truth: reports and dashboards are derived from one-source-of-truth
Custom Column: You want to use several columns from different tables in a custom column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two ways to combine tables?

A

Append (horizontal): add rows to the bottom of the table

Merge (vertical): adding columns to the side of the table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you append tables?

A

Select the tables you want to append from “Available tables” and move to “Tables to append“. Can append queries as new table or as part of existing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you merge tables?

A

Select the tables you want to merge and the column that is matching between the tables. Then select one of the following options:
Left Outer - Displays all rows from the first table and only the matching rows from the second.
Full Outer - Displays all rows from both tables.
Inner - Displays the matched rows between the two tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is best practice when combining tables?

A

(1) Step 1: Edit/Format first
- Delete all extraneous columns
- Keep only the pertinent columns
- Rename so the different tables have the same named columns

(2) Step 2: Join tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is profiling data in Power BI?

A

Profiling data is about evaluating data quality: determining anomalies, examining and developing the underlying data structures, and querying data statistics such as row counts, value distributions, minimum and maximum values, averages, and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is profiling data important?

A
This concept is important because it allows you to shape and organize the data by
 Remove outliers 
 Find anomalies 
 Check distributions
 Investigate specific data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain the different profiling tools?

A

Column Distribution
Distribution visualisation at the top of columns
Displays the counts of distinct and unique values

Column Quality
Gives the percentage of valid, error and empty values

Column Profile
Column profile gives you a more in-depth look into the statistics within the columns for the first 1,000 rows of data.
Count of rows can indicate whether the import is successful or not
Column statistics = a more in-depth look into the statistics within the columns for the first 1,000 rows of data including, summary statics, zero counts etc
The Value distribution = the count for each distinct value within the column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a common troubleshoot with profile tools?

A

When you are not using the correct data type (e.g. numbers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is advanced editor?

A
  • Each time you shape data in Power Query, you create a step in the Power Query process.
  • Each cleaning step that you made was likely created by using the graphical interface, but Power Query uses the M language behind the scenes.

These steps are available to read by using the Power Query Advanced Editor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is M code written?

A
  • Each step is one or two lines of code
  • M code is written top-down (e.g. later steps refer to earlier steps to the left of the equal sign)
  • It is case sensitive
  • Starts with let and ends with in
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the difference between DAX and M code?

A

DAX

  • App = Power BI desktop
  • Purpose = used to transform and create new columns within Power BI desktop

M-code (Mashup)

  • App = language used by Power Query
  • Purpose = (a) used to connect data (b) simplify, combine and transform data in Power Query Editor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are table transforms?

A

Transformation on whole tables, think about the column vs table icon in DAX

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Name the table transforms?

A
Promote or demote column headers
 Filter by row position 
 Filter by values 
 Choose or remove columns 
 Grouping or summarising rows
 Unpivot or Pivot
 Transpose 
 Reverse rows 
 Data types
 Dealing with errors 
 Working with duplicates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are column transforms?

A

Transforms values within columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Name the column transforms?

A
Fill values 
 Sort columns 
 Rename columns 
 Move columns 
 Replace Values 
 Parse text 
 Extract Text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How do you record profile statistics?

A
  • Column quality = copy quality metrics
  • Column distribution = copy distribution
  • Column profile = copy value distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is default profile behaviour and what is the alternative?

A
  • To check the first 1000 rows (however this isn’t a full reflection of data)
  • Change to entire dataset so you can see a more complete picture (however this could slow Power Query editor down)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does the column chart show in column profile?

A
  • Shows shape of data in that column

- Is the distribution uniform

30
Q

What is table schema?

A
  • Prefixes and suffixes associated with particular database tables
31
Q

Name some of the value distribution features?

A
  • Can select (…) and group the values
32
Q

What does simplifying the data mean?

A
  • Resolve inconsistencies
  • Unexpected or null values
  • Data quality issues
  • User-friendly values
33
Q

Name some complex query types?

A

Nested cells:

  • Errors
  • Lists
  • Records
  • Attributes
  • Tables

Selecting any of these links adds another query step

34
Q

Why might Power Query use its autodetect feature?

A
  • When the data source does not communicate the data type
35
Q

How can you disable auto-detect?

A
  • Options & Settings > options > CURRENT FILE > Data Loaded > Type Detection
36
Q

Why might you disable auto-detect?

A
  • Sometimes it can lead to incorrect formatting which has downstream implications
  • Replace errors otherwise Power Query may convert to text
37
Q

What are the drawbacks to data formatting in Power BI?

A
  • Not all data types are available (e.g., percentage, currency etc)
  • Further formatting can occur in Data view within Power BI desktop
38
Q

What is a general rule of thumb when it comes to data formatting?

A
  • If you are not going to perform mathematical calculations on it then it is not worth being numeric (e.g., Postcodes have numbers but will not be used in a calculations)
39
Q

Why might you see errors even if you have corrected the data type?

A
  • Power Query caches output rather than updating immediately
  • Click on another query after data conversion to force changes
40
Q

Why is locale important?

A
  • Change date format to match local area
41
Q

Explain the problem with incorrect locale?

A
  • Incorrect locale can result in faulty data (e.g., 9/11 in US and 11/9 in UK)
  • Power BI cannot detect this nuance
42
Q

How might we use column quality?

A
  • Allows us to check if the queried data has loaded correctly
  • Get an initial check on data quality
  • Nulls are not included in error count (e.g., null not recognised as text)
43
Q

Are column profile tools always reliable?

A
  • No because it shows the first 1000 rows

- Right-Click and change to entire dataset

44
Q

What does the column distribution chart show?

A
  • The shape of the data
45
Q

Name an advanced feature of the column distribution tool?

A
  • Group values to profile different values
46
Q

How can you use replace values?

A
  • Match entire cell: will only replace if the value is a complete match rather than just a part match
  • Replace using special characters: null is an example of special character
47
Q

How can you identify the root cause of query errors in Power Query?

A
  • Checking nested contents (e.g., errors, lists, records and tables)
48
Q

How can you use keys in Power BI?

A

Can combine tables in several ways:

  • Combine tables (in Power Query or DAX)
  • Create relationships (in Power BI in Model view)
49
Q

Explain the similarities and differences of keys for joins and relationships?

A

Keys for joins

  • Not necessary to create composite key
  • Column data type important
  • Can use multiple columns

Keys for relationships

  • More forgiving of data type
  • Only creates relationships on single pair of columns
  • May need to merge columns first if you have composite key
50
Q

Explain how to combine columns?

A

Two options

  • 1) Create new column
  • 2) Merge in place

n.b. column selection order important (order of text follows order of column)

51
Q

How can you get an accurate row count?

A
  • Data view will give an exact answer once loaded
52
Q

What is a common way to break queries?

A
  • When you change query steps

- The last steps query the previous steps

53
Q

Explain what pivoting does?

A
  • Creates a new column for each value
54
Q

Explain what unpivoting does?

A
  • Selected columns are converted into 2 columns
55
Q

What does transpose do?

A
  • Swaps the rows and columns
56
Q

Explain the features of adding columns?

A
  • Column from examples - Power Query attempts to write a transformation sequence that reflects your actions
  • Custom column - type own M formula
  • Invoke custom function - invokes a custom function for every row of a table
  • Conditional column - specify if-then-else logic
  • Index column - creates a sequential column that’s starts and increments the value you specify
  • Duplicate column - copy of column you select
57
Q

What are the table transformations?

A
  • Reduce or keep rows
  • Pivot or unpivot
  • Adding columns
58
Q

Name a troubleshoot with custom column?

A
  • Creates a column without a data type so you have to specify the data type manually before loading to ensure it loads correctly
59
Q

What does the double-arrows in a header do for Lists?

A

Expand list

  • Expand to new rows: creates a new for each row inside the row’s list
  • Extract values: keep row numbers the same and concatenate list rows into one cell with a separator you specify
60
Q

How can you preview a nested cell?

A

Select the cell with the hyperlink

61
Q

Name some nested function in M code?

A
  • Braces = a list

- Double period = range

62
Q

Explain append?

A
  • Joins when there are multiple matching columns

- Columns which don’t line up with new table(s) will be filled with null values (table gets taller)

63
Q

Explain how to use merge?

A
  • Need a set of keys or matching values

- Specify which rows of the first table should be combined with which rows of the second table

64
Q

Explain the different joins in merge? Give an example of how this works?

A
  • Left outer = all from first match from second table
  • Right outer = all from second matching from first table
  • Full outer = all rows from both
  • Inner = only matching rows
  • Left anti = rows only in first
  • Right anti = rows only in second

The merge joins are all about the a) columns that have matching values and b) the order in which you select the tables. So if there is 1112 in the first table and 1234 in the second table we will only see the additional columns for 1(3 times) and 2 (once) for a first outer join…Conversely a right outer join would be about which columns in the second match from columns in the first

65
Q

What does the double-arrows in a header do for Tables?

A

Expands the table

  • Expand: select the columns from the joined table that will be added to the current table (matching rows duplicated)
  • Aggregate: will aggregate rows and won’t duplicate any rows…aggregating means you can apply statistical functions to the columns that have joined
66
Q

What is camel case?

A

When words are lowercase to uppercase….camelCase

67
Q

Why are naming conventions important?

A
  • Report builders may want to browse data model fields and will need to have some intuitive understanding of what each field means (aim to make the fields understandable to a non-technical person),
68
Q

Why use Advanced Editor?

A
  • Although you can use the pre-built transformation buttons in Power Query Editor there is a lot more flexibility when you write M code
69
Q

Explain the M code around the first initial query steps?

A
  • Source = is the only step that is the same
  • Navigation = all navigation steps are grouped in Power Query
  • Dealing with spaces = when you have special characters (e.g., a space) for a command M uses quotation marks and a prefixes them with a hashtag (e.g., #”Changed Type”)
70
Q

How can you see the M code for one query step at a time?

A
  • View > Formula bar