2B. Clean, transform, and load data in Power BI Flashcards
What are some advantages to clean(ing) data?
- Measures and columns produce more accurate results when they perform aggregations and calculations.
- Tables are organized, where users can find the data in an intuitive manner.
- Duplicates are removed, making data navigation simpler. It will also produce columns that can be used in slicers and filters.
- A complicated column can be split into two, simpler columns. Multiple columns can be combined into one column for readability.
- Codes and integers can be replaced with human readable values.
What are some basic transformation you can do when you begin cleaning your data?
- Identify column headers and names
- Promote headers
- Rename columns
- Remove top rows
- Remove columns
- Unpivot columns
- Pivot columns
- Rename queries
- Replace values (e.g. spelling mistakes)
- Replace null values (only if nulls should be counted as zero, for example)
- Remove duplicates
- Change column data types
- Combine multiple tables into a single table
What principle should guide you when thinking about which columns to keep and which to remove?
Examine each column and ask yourself if you really need the data that it contains. If you don’t plan on using that data in a report, the column adds no value to your semantic model. Therefore, the column should be removed. You can always add the column later, if your requirements change over time.
What is a benefit to unpivoting columns (i.e. a benefit of tall tables)?
It streamlines the process of creating DAX measures on the data later. By completing this process, you have now created a simpler way of slicing the data
What is a benefit to pivoting columns?
You can use the Pivot Column feature to convert your flat data into a table that contains an aggregate value for each unique value in a column. For example, you might want to use this feature to summarize data by using different math functions such as Count, Minimum, Maximum, Median, Average, or Sum.
On the Pivot Column window that displays, select a column from the Values Column list, such as Subcategory name. Expand the advanced options and select an option from the Aggregate Value Function list, such as Count (All), and then select OK.
What are some best practices for naming tables, columns, and values?
Naming conventions for tables, columns, and values have no fixed rules; however, we recommend that you use the language and abbreviations that are commonly used within your organization and that everyone agrees on and considers them as common terminology.
A best practice is to give your tables, columns, and measures descriptive business terms and replace underscores (“_”) with spaces. Be consistent with abbreviations, prefixes, and words like “number” and “ID.” Excessively short abbreviations can cause confusion if they are not commonly used within the organization.
Also, by removing prefixes or suffixes that you might use in table names and instead naming them in a simple format, you will help avoid confusion.
When replacing values, try to imagine how those values will appear on the report. Values that are too long might be difficult to read and fit on a visual. Values that are too short might be difficult to interpret. Avoiding acronyms in values is also a good idea, provided that the text will fit on the visual.
What is the default number of rows that Power Query loads into preview (and therefore bases column analyses on)?
1000 rows
What files are more likely to contain data type errors?
You have a higher chance of getting data type errors when you’re dealing with flat files, such as comma-separated values (.CSV) files and Excel workbooks (.XLSX), because data was entered manually into the worksheets and mistakes were made. Conversely, in databases, the data types are predefined when tables or views are created.
What are some implications of incorrect data types?
They will prevent you from creating certain calculations, deriving hierarchies, or creating proper relationships with other tables.
What are some reasons to combine tables into a single table?
- Too many tables exist, making it difficult to navigate an overly complicated semantic model.
- Several tables have a similar role.
- A table has only a column or two that can fit into a different table.
- You want to use several columns from different tables in a custom column.
What are the two ways of combining tables?
- Merging (adding columns from one table (or query) into another).
- Appending (adding rows of data to another table or query).
What is required when appending tables?
The pertinent columns that you require in your combined table must be named the same in your original data tables to see one consolidated view.
What is required when merging tables?
You must have a column that is the key between the two tables. This process is similar to the JOIN clause in SQL.
What are the three main join options when merging tables?
- Left Outer - Displays all rows from the first table and only the matching rows from the second.
- Full Outer - Displays all rows from both tables.
- Inner - Displays the matched rows between the two tables.
What is profiling data about, and why is it important?
It is about studying the nuances of the data: determining anomalies, examining and developing the underlying data structures, and querying data statistics such as row counts, value distributions, minimum and maximum values, averages, and so on.
This concept is important because it allows you to shape and organize the data so that interacting with the data and identifying the distribution of the data is uncomplicated, therefore helping to make your task of working with the data on the front end to develop report elements near effortless.