Sect 6.2- Data Cleaning Flashcards

1
Q

You don’t always have control over the format and type of data that you import from an external data source, such as a database, text file, or a Web page. Before you can analyze the data, you often need to clean it up. Fortunately, Excel has many features to help you get data in the precise format that you want. Sometimes, the task is straightforward and there is a specific feature that does the job for you. For example, you can easily use Spell Checker to clean up misspelled words in columns that contain comments or descriptions. Or, if you want to remove duplicate rows, you can quickly do this by using the Remove Duplicates dialog box.

At other times, you may need to manipulate one or more columns by using a formula to convert the imported values into new values. For example, if you want to remove trailing spaces, you can create a new column to clean the data by using a formula, filling down the new column, converting that new column’s formulas to values, and then removing the original column.

The basic steps for cleaning data are as follows:

A

Import the data from an external data source.

Create a backup copy of the original data in a separate workbook.

Ensure that the data is in a tabular format of rows and columns with: similar data in each column, all columns and rows visible, and no blank rows within the range. For best results, use an Excel table.

Do tasks that don’t require column manipulation first, such as spell-checking or using the Find and Replace dialog box.

Next, do tasks that do require column manipulation. The general steps for manipulating a column are:

Insert a new column (B) next to the original column (A) that needs cleaning.

Add a formula that will transform the data at the top of the new column (B).

Fill down the formula in the new column (B). In an Excel table, a calculated column is automatically created with values filled down.

Select the new column (B), copy it, and then paste as values into the new column (B).

Remove the original column (A), which converts the new column from B to A.

To periodically clean the same data source, consider recording a macro or writing code to automate the entire process. There are also a number of external add-ins written by third-party vendors, listed in the Third-party providers section, that you can consider using if you don’t have the time or resources to automate the process on your own.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Spell Checking

A

You can use a spell checker to not only find misspelled words, but to find values that are not used consistently, such as product or company names, by adding those values to a custom dictionary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Remove duplicated rows

A

Duplicate rows are a common problem when you import data. It is a good idea to filter for unique values first to confirm that the results are what you want before you remove duplicate values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

FInding and replacing text

A

You may want to remove a common leading string, such as a label followed by a colon and space, or a suffix, such as a parenthetic phrase at the end of the string that is obsolete or unnecessary. You can do this by finding instances of that text and then replacing it with no text or other text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Changing the case of text

A

Sometimes text comes in a mixed bag, especially when the case of text is concerned. Using one or more of the three Case functions, you can convert text to lowercase letters, such as e-mail addresses, uppercase letters, such as product codes, or proper case, such as names or book titles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Removing spaces and nonprinting characters from text

A

Sometimes text values contain leading, trailing, or multiple embedded space characters (Unicode character set values 32 and 160), or nonprinting characters (Unicode character set values 0 to 31, 127, 129, 141, 143, 144, and 157). These characters can sometimes cause unexpected results when you sort, filter, or search. For example, in the external data source, users may make typographical errors by inadvertently adding extra space characters, or imported text data from external sources may contain nonprinting characters that are embedded in the text. Because these characters are not easily noticed, the unexpected results may be difficult to understand. To remove these unwanted characters, you can use a combination of the TRIM, CLEAN, and SUBSTITUTE functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Fixing numbers and number signs

A

There are two main issues with numbers that may require you to clean the data: the number was inadvertently imported as text, and the negative sign needs to be changed to the standard for your organization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Fixing dates and times

A

Because there are so many different date formats, and because these formats may be confused with numbered part codes or other strings that contain slash marks or hyphens, dates and times often need to be converted and reformatted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Merging and splitting columns

A

A common task after importing data from an external data source is to either merge two or more columns into one, or split one column into two or more columns. For example, you may want to split a column that contains a full name into a first and last name. Or, you may want to split a column that contains an address field into separate street, city, region, and postal code columns. The reverse may also be true. You may want to merge a First and Last Name column into a Full Name column, or combine separate address columns into one column. Additional common values that may require merging into one column or splitting into multiple columns include product codes, file paths, and Internet Protocol (IP) addresses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Transforming and rearranging columns and rows

A

Most of the analysis and formatting features in Office Excel assume that the data exists in a single, flat two-dimensional table. Sometimes you may want to make the rows become columns, and the columns become rows. At other times, data is not even structured in a tabular format, and you need a way to transform the data from a nontabular to a tabular format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Reconciling table data by joining or matching

A

Occasionally, database administrators use Office Excel to find and correct matching errors when two or more tables are joined. This might involve reconciling two tables from different worksheets, for example, to see all records in both tables or to compare tables and find rows that don’t match.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data cleansing or data cleaning

A

the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.[1] Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

After cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores.

A

Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code), or with fuzzy or approximate string matching (such as correcting records that partially match existing, known records).

A

Some data cleansing solutions will clean data by cross-checking with a validated data set. A common data cleansing practice is data enhancement, where data is made more complete by adding related information. For example, appending addresses with any phone numbers related to that address. Data cleansing may also involve harmonization (or normalization) of data, which is the process of bringing together data of “varying file formats, naming conventions, and columns”,[2] and transforming it into one cohesive data set; a simple example is the expansion of abbreviations (“st, rd, etc.” to “street, road, etcetera”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Motivation

A

Administratively incorrect, inconsistent data can lead to false conclusions and misdirect investments on both public and private scales. For instance, the government may want to analyze population census figures to decide which regions require further spending and investment on infrastructure and services. In this case, it will be important to have access to reliable data to avoid erroneous fiscal decisions. In the business world, incorrect data can be costly. Many companies use customer information databases that record data like contact information, addresses, and preferences. For instance, if the addresses are inconsistent, the company will suffer the cost of resending mail or even losing customers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Validity: The degree to which the measures conform to defined business rules or constraints (see also Validity (statistics)). When modern database technology is used to design data-capture systems, validity is fairly easy to ensure: invalid data arises mainly in legacy contexts (where constraints were not implemented in software) or where inappropriate data-capture technology was used (e.g., spreadsheets, where it is very hard to limit what a user chooses to enter into a cell, if cell validation is not used). Data constraints fall into the following categories:

A

Data-Type Constraints – e.g., values in a particular column must be of a particular data type, e.g., Boolean, numeric (integer or real), date, etc.

Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.

Mandatory Constraints: Certain columns cannot be empty.

Unique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.

Set-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person’s sex may be Female, Male or Non-Binary.

Foreign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values. For example, in a US taxpayer database, the “state” column is required to belong to one of the US’s defined states or territories: the set of permissible states/territories is recorded in a separate State table. The term foreign key is borrowed from relational database terminology.

Regular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999–9999.

Cross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient’s date of discharge from the hospital cannot be earlier than the date of admission.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Accuracy

A

The degree of conformity of a measure to a standard or a true value - see also Accuracy and precision. Accuracy is very hard to achieve through data-cleansing in the general case because it requires accessing an external source of data that contains the true value: such “gold standard” data is often unavailable. Accuracy has been achieved in some cleansing contexts, notably customer contact data, by using external databases that match up zip codes to geographical locations (city and state) and also help verify that street addresses within these zip codes actually exist.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Completeness

A

The degree to which all required measures are known. Incompleteness is almost impossible to fix with data cleansing methodology: one cannot infer facts that were not captured when the data in question was initially recorded. (In some contexts, e.g., interview data, it may be possible to fix incompleteness by going back to the original source of data, i.e. re-interviewing the subject, but even this does not guarantee success because of problems of recall - e.g., in an interview to gather data on food consumption, no one is likely to remember exactly what one ate six months ago. In the case of systems that insist certain columns should not be empty, one may work around the problem by designating a value that indicates “unknown” or “missing”, but the supplying of default values does not imply that the data has been made complete.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Consistency

A

The degree to which a set of measures are equivalent in across systems (see also Consistency). Inconsistency occurs when two data items in the data set contradict each other: e.g., a customer is recorded in two different systems as having two different current addresses, and only one of them can be correct. Fixing inconsistency is not always possible: it requires a variety of strategies - e.g., deciding which data were recorded more recently, which data source is likely to be most reliable (the latter knowledge may be specific to a given organization), or simply trying to find the truth by testing both data items (e.g., calling up the customer).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

uniformity

A

The degree to which a set data measures are specified using the same units of measure in all systems ( see also Unit of measure). In datasets pooled from different locales, weight may be recorded either in pounds or kilos and must be converted to a single measure using an arithmetic transformation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

integrity

A

encompasses accuracy, consistency and some aspects of validation (see also data integrity) but is rarely used by itself in data-cleansing contexts because it is insufficiently specific. (For example, “referential integrity” is a term used to refer to the enforcement of foreign-key constraints above.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Data auditing

A

The data is audited with the use of statistical and database methods to detect anomalies and contradictions: this eventually indicates the characteristics of the anomalies and their locations. Several commercial software packages will let you specify constraints of various kinds (using a grammar that conforms to that of a standard programming language, e.g., JavaScript or Visual Basic) and then generate code that checks the data for violation of these constraints. This process is referred to below in the bullets “workflow specification” and “workflow execution.” For users who lack access to high-end cleansing software, Microcomputer database packages such as Microsoft Access or File Maker Pro will also let you perform such checks, on a constraint-by-constraint basis, interactively with little or no programming required in many cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Workflow specification

A

The detection and removal of anomalies are performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of high-quality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Workflow execution

A

In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient, even on large sets of data, which inevitably poses a trade-off because the execution of a data-cleansing operation can be computationally expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Post-processing and controlling

A

After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during the execution of the workflow is manually corrected, if possible. The result is a new cycle in the data-cleansing process where the data is audited again to allow the specification of an additional workflow to further cleanse the data by automatic processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Good quality source data has to do with “Data Quality Culture” and must be initiated at the top of the organization. It is not just a matter of implementing strong validation checks on input screens, because almost no matter how strong these checks are, they can often still be circumvented by the users. There is a nine-step guide for organizations that wish to improve data quality:

A

Declare a high-level commitment to a data quality culture

Drive process reengineering at the executive level

Spend money to improve the data entry environment

Spend money to improve application integration

Spend money to change how processes work

Promote end-to-end team awareness

Promote interdepartmental cooperation

Publicly celebrate data quality excellence

Continuously measure and improve data quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Parsing:

A

for the detection of syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification. This is similar to the way a parser works with grammars and languages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Data Transformation

A

Data transformation allows the mapping of the data from its given format into the format expected by the appropriate application. This includes value conversions or translation functions, as well as normalizing numeric values to conform to minimum and maximum values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Duplicate elimination

A

Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity. Usually, data is sorted by a key that would bring duplicate entries closer together for faster identification.

30
Q

Statistical methods

A

By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. Although the correction of such data is difficult since the true value is not known, it can be resolved by setting the values to an average or other statistical value. Statistical methods can also be used to handle missing values which can be replaced by one or more plausible values, which are usually obtained by extensive data augmentation algorithms.

31
Q

System

A

The essential job of this system is to find a suitable balance between fixing dirty data and maintaining the data as close as possible to the original data from the source production system. This is a challenge for the Extract, transform, load architect. The system should offer an architecture that can cleanse data, record quality events and measure/control quality of data in the data warehouse. A good start is to perform a thorough data profiling analysis that will help define to the required complexity of the data cleansing system and also give an idea of the current data quality in the source system(s).

32
Q

Part of the data cleansing system is a set of diagnostic filters known as quality screens. They each implement a test in the data flow that, if it fails, records an error in the Error Event Schema. Quality screens are divided into three categories:

A

Column screens. Testing the individual column, e.g. for unexpected values like NULL values; non-numeric values that should be numeric; out of range values; etc.

Structure screens. These are used to test for the integrity of different relationships between columns (typically foreign/primary keys) in the same or different tables. They are also used for testing that a group of columns is valid according to some structural definition to which it should adhere.

Business rule screens. The most complex of the three tests. They test to see if data, maybe across multiple tables, follow specific business rules. An example could be, that if a customer is marked as a certain type of customer, the business rules that define this kind of customer should be adhered to.

33
Q

When a quality screen records an error, it can either stop the dataflow process, send the faulty data somewhere else than the target system or tag the data.

A

The latter option is considered the best solution because the first option requires, that someone has to manually deal with the issue each time it occurs and the second implies that data are missing from the target system (integrity) and it is often unclear what should happen to these data.

34
Q

Most data cleansing tools have limitations in usability:

A

Project costs: costs typically in the hundreds of thousands of dollars

Time: mastering large-scale data-cleansing software is time-consuming

Security: cross-validation requires sharing information, giving an application access across systems, including sensitive legacy systems

35
Q

Error event schema

A

The error event schema holds records of all error events thrown by the quality screens. It consists of an error event Fact table with foreign keys to three dimension tables that represent date (when), batch job (where) and screen (who produced error). It also holds information about exactly when the error occurred and the severity of the error. Also, there is an error event detail fact table with a foreign key to the main table that contains detailed information about in which table, record and field the error occurred and the error condition.

36
Q

Validity

A

The degree to which the data conform to defined business rules or constraints.

Data-Type Constraints: values in a particular column must be of a particular datatype, e.g., boolean, numeric, date, etc.

Range Constraints: typically, numbers or dates should fall within a certain range.

Mandatory Constraints: certain columns cannot be empty.

Unique Constraints: a field, or a combination of fields, must be unique across a dataset.

Set-Membership constraints: values of a column come from a set of discrete values, e.g. enum values. For example, a person’s gender may be male or female.

Foreign-key constraints: as in relational databases, a foreign key column can’t have a value that does not exist in the referenced primary key.

Regular expression patterns: text fields that have to be in a certain pattern. For example, phone numbers may be required to have the pattern (999) 999–9999.

Cross-field validation: certain conditions that span across multiple fields must hold. For example, a patient’s date of discharge from the hospital cannot be earlier than the date of admission.

37
Q

Accuracy

A

The degree to which the data is close to the true values.

While defining all possible valid values allows invalid values to be easily spotted, it does not mean that they are accurate.

A valid street address mightn’t actually exist. A valid person’s eye colour, say blue, might be valid, but not true (doesn’t represent the reality).

Another thing to note is the difference between accuracy and precision. Saying that you live on the earth is, actually true. But, not precise. Where on the earth?. Saying that you live at a particular street address is more precise.

38
Q

Completeness

A

The degree to which all required data is known.

Missing data is going to happen for various reasons. One can mitigate this problem by questioning the original source if possible, say re-interviewing the subject.

Chances are, the subject is either going to give a different answer or will be hard to reach again.

39
Q

Consistency

A

The degree to which the data is consistent, within the same data set or across multiple data sets.

Inconsistency occurs when two values in the data set contradict each other.

A valid age, say 10, mightn’t match with the marital status, say divorced. A customer is recorded in two different tables with two different addresses.

Which one is true?.

40
Q

Uniformity

A

The degree to which the data is specified using the same unit of measure.

The weight may be recorded either in pounds or kilos. The date might follow the USA format or European format. The currency is sometimes in USD and sometimes in YEN.

And so data must be converted to a single measure unit.

41
Q

The workflow

A

The workflow is a sequence of three steps aiming at producing high-quality data and taking into account all the criteria we’ve talked about.

Inspection: Detect unexpected, incorrect, and inconsistent data.

Cleaning: Fix or remove the anomalies discovered.

Verifying: After cleaning, the results are inspected to verify correctness.

Reporting: A report about the changes made and the quality of the currently stored data is recorded.

What you see as a sequential process is, in fact, an iterative, endless process. One can go from verifying to inspection when new flaws are detected.

42
Q

Inspection

A

Inspecting the data is time-consuming and requires using many methods for exploring the underlying data for error detection.

43
Q

Data profiling

A

A summary statistics about the data, called data profiling, is really helpful to give a general idea about the quality of the data.

For example, check whether a particular column conforms to particular standards or pattern. Is the data column recorded as a string or number?.

How many values are missing?. How many unique values in a column, and their distribution?. Is this data set is linked to or have a relationship with another?.

44
Q

Visualizations

A

By analyzing and visualizing the data using statistical methods such as mean, standard deviation, range, or quantiles, one can find values that are unexpected and thus erroneous.

For example, by visualizing the average income across the countries, one might see there are some outliers (link has an image). Some countries have people who earn much more than anyone else. Those outliers are worth investigating and are not necessarily incorrect data.

45
Q

Software packages

A

Several software packages or libraries available at your language will let you specify constraints and check the data for violation of these constraints.

Moreover, they can not only generate a report of which rules were violated and how many times but also create a graph of which columns are associated with which rules.

The age, for example, can’t be negative, and so the height. Other rules may involve multiple columns in the same row, or across datasets.

46
Q

Cleaning

A

Data cleaning involve different techniques based on the problem and the data type. Different methods can be applied with each has its own trade-offs.

Overall, incorrect data is either removed, corrected, or imputed.

47
Q

Irrelevant data

A

Irrelevant data are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve.

For example, if we were analyzing data about the general health of the population, the phone number wouldn’t be necessary — column-wise.

Similarly, if you were interested in only one particular country, you wouldn’t want to include all other countries. Or, study only those patients who went to the surgery, we wouldn’t include everyone — row-wise.

Only if you are sure that a piece of data is unimportant, you may drop it. Otherwise, explore the correlation matrix between feature variables.

And even though you noticed no correlation, you should ask someone who is domain expert. You never know, a feature that seems irrelevant, could be very relevant from a domain perspective such as a clinical perspective.

48
Q

Duplicates

A

Duplicates are data points that are repeated in your dataset.

It often happens when for example

Data are combined from different sources
The user may hit submit button twice thinking the form wasn’t actually submitted.
A request to online booking was submitted twice correcting wrong information that was entered accidentally in the first time.
A common symptom is when two users have the same identity number. Or, the same article was scrapped twice.

And therefore, they simply should be removed.

49
Q

Type conversion

A

Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix timestamp (number of seconds), and so on.

Categorical values can be converted into and from numbers if needed.

This is can be spotted quickly by taking a peek over the data types of each column in the summary (we’ve discussed above).

A word of caution is that the values that can’t be converted to the specified type should be converted to NA value (or any), with a warning being displayed. This indicates the value is incorrect and must be fixed.

50
Q

Syntax errors

A

Remove white spaces: Extra white spaces at the beginning or the end of a string should be removed.

” hello world “ => “hello world

51
Q

Pad strings

A

Strings can be padded with spaces or other characters to a certain width. For example, some numerical codes are often represented with prepending zeros to ensure they always have the same number of digits.

313 => 000313 (6 digits)

52
Q

Fix typos

A

Strings can be entered in many different ways, and no wonder, can have mistakes.

Gender
m
Male
fem.
FemalE
Femle
This categorical variable is considered to have 5 different classes, and not 2 as expected: male and female since each value is different.

A bar plot is useful to visualize all the unique values. One can notice some values are different but do mean the same thing i.e. “information_technology” and “IT”. Or, perhaps, the difference is just in the capitalization i.e. “other” and “Other”.

Therefore, our duty is to recognize from the above data whether each value is male or female. How can we do that?.

The first solution is to manually map each value to either “male” or “female”.

dataframe[‘gender’].map({‘m’: ‘male’, fem.’: ‘female’, …})

The second solution is to use pattern match. For example, we can look for the occurrence of m or M in the gender at the beginning of the string.

re.sub(r”\^m$”, ‘Male’, ‘male’, flags=re.IGNORECASE)

The third solution is to use fuzzy matching: An algorithm that identifies the distance between the expected string(s) and each of the given one. Its basic implementation counts how many operations are needed to turn one string into another.

Gender male female
m 3 5
Male 1 3
fem. 5 3
FemalE 3 2
Femle 3 1

Watch out for values like “0”, “Not Applicable”, “NA”, “None”, “Null”, or “INF”, they might mean the same thing: The value is missing.

53
Q

Standardize

A

Our duty is to not only recognize the typos but also put each value in the same standardized format.

For strings, make sure all values are either in lower or upper case.

For numerical values, make sure all values have a certain measurement unit.

The hight, for example, can be in meters and centimetres. The difference of 1 meter is considered the same as the difference of 1 centimetre. So, the task here is to convert the heights to one single unit.

For dates, the USA version is not the same as the European version. Recording the date as a timestamp (a number of milliseconds) is not the same as recording the date as a date object.

54
Q

Scaling/ Transformation

A

Scaling means to transform your data so that it fits within a specific scale, such as 0–100 or 0–1.

For example, exam scores of a student can be re-scaled to be percentages (0–100) instead of GPA (0–5).

It can also help in making certain types of data easier to plot. For example, we might want to reduce skewness to assist in plotting (when having such many outliers). The most commonly used functions are log, square root, and inverse.

Scaling can also take place on data that has different measurement units.

Student scores on different exams say, SAT and ACT, can’t be compared since these two exams are on a different scale. The difference of 1 SAT score is considered the same as the difference of 1 ACT score. In this case, we need re-scale SAT and ACT scores to take numbers, say, between 0–1.

By scaling, we can plot and compare different scores.

55
Q

Normalization

A

While normalization also rescales the values into a range of 0–1, the intention here is to transform the data so that it is normally distributed. Why?

In most cases, we normalize the data if we’re going to be using statistical methods that rely on normally distributed data. How?

One can use the log function, or perhaps, use one of these methods.

Depending on the scaling method used, the shape of the data distribution might change. For example, the “Standard Z score” and “Student’s t-statistic” (given in the link above) preserve the shape, while the log function mighn’t.

56
Q

Missing values

A

Given the fact the missing values are unavoidable leaves us with the question of what to do when we encounter them. Ignoring the missing data is the same as digging holes in a boat; It will sink.

There are three, or perhaps more, ways to deal with them.

57
Q

Drop.

A

If the missing values in a column rarely happen and occur at random, then the easiest and most forward solution is to drop observations (rows) that have missing values.

If most of the column’s values are missing, and occur at random, then a typical decision is to drop the whole column.

This is particularly useful when doing statistical analysis, since filling in the missing values may yield unexpected or biased results.

58
Q

Impute.

A

It means to calculate the missing value based on other observations. There are quite a lot of methods to do that.

59
Q

using statistical values like mean, median. However, none of these guarantees unbiased data, especially if there are many missing values.

A

Mean is most useful when the original data is not skewed, while the median is more robust, not sensitive to outliers, and thus used when data is skewed.

In a normally distributed data, one can get all the values that are within 2 standard deviations from the mean. Next, fill in the missing values by generating random numbers between (mean — 2 * std) & (mean + 2 * std)

rand = np.random.randint(average_age - 2std_age, average_age + 2std_age, size = count_nan_age)
dataframe[“age”][np.isnan(dataframe[“age”])] = rand

60
Q

Using a linear regression.

A

Based on the existing data, one can calculate the best fit line between two variables, say, house price vs. size m².

It is worth mentioning that linear regression models are sensitive to outliers.

61
Q

Hot-deck: Copying values from other similar records. This is only useful if you have enough available data. And, it can be applied to numerical and categorical data.

A

One can take the random approach where we fill in the missing value with a random value. Taking this approach one step further, one can first divide the dataset into two groups (strata), based on some characteristic, say gender, and then fill in the missing values for different genders separately, at random.

In sequential hot-deck imputation, the column containing missing values is sorted according to auxiliary variable(s) so that records that have similar auxiliaries occur sequentially. Next, each missing value is filled in with the value of the first following available record.

What is more interesting is that 𝑘 nearest neighbour imputation, which classifies similar records and put them together, can also be utilized. A missing value is then filled out by finding first the 𝑘 records closest to the record with missing values. Next, a value is chosen from (or computed out of) the 𝑘 nearest neighbours. In the case of computing, statistical methods like mean (as discussed before) can be used.

62
Q

Flag

A

ome argue that filling in the missing values leads to a loss in information, no matter what imputation method we used.

That’s because saying that the data is missing is informative in itself, and the algorithm should know about it. Otherwise, we’re just reinforcing the pattern already exist by other features.

This is particularly important when the missing data doesn’t happen at random. Take for example a conducted survey where most people from a specific race refuse to answer a certain question.

Missing numeric data can be filled in with say, 0, but has these zeros must be ignored when calculating any statistical value or plotting the distribution.

While categorical data can be filled in with say, “Missing”: A new category which tells that this piece of data is missing.

63
Q

Missing values are not the same as default values. For instance, zero can be interpreted as either missing or default, but not both.

A

Missing values are not “unknown”. A conducted research where some people didn’t remember whether they have been bullied or not at the school, should be treated and labelled as unknown and not missing.

Every time we drop or impute values we are losing information. So, flagging might come to the rescue.

64
Q

Outliers

A

They are values that are significantly different from all other observations. Any data value that lies more than (1.5 * IQR) away from the Q1 and Q3 quartiles is considered an outlier.

Outliers are innocent until proven guilty. With that being said, they should not be removed unless there is a good reason for that.

For example, one can notice some weird, suspicious values that are unlikely to happen, and so decides to remove them. Though, they worth investigating before removing.

It is also worth mentioning that some models, like linear regression, are very sensitive to outliers. In other words, outliers might throw the model off from where most of the data lie.

65
Q

In-record & cross-datasets errors

A

These errors result from having two or more values in the same row or across datasets that contradict with each other.

For example, if we have a dataset about the cost of living in cities. The total column must be equivalent to the sum of rent, transport, and food.

city rent transportation food total
libson 500 20 40 560
paris 750 40 60 850

Similarly, a child can’t be married. An employee’s salary can’t be less than the calculated taxes.

The same idea applies to related data across different datasets.

66
Q

Verifying

A

When done, one should verify correctness by re-inspecting the data and making sure it rules and constraints do hold.

For example, after filling out the missing data, they might violate any of the rules and constraints.

It might involve some manual correction if not possible otherwise.

67
Q

Reporting

A

Reporting how healthy the data is, is equally important to cleaning.

As mentioned before, software packages or libraries can generate reports of the changes made, which rules were violated, and how many times.

In addition to logging the violations, the causes of these errors should be considered. Why did they happen in the first place?.

68
Q

How the data is collected, and under what conditions?

A

The environment where the data was collected does matter. The environment includes, but not limited to, the location, timing, weather conditions, etc.

Questioning subjects about their opinion regarding whatever while they are on their way to work is not the same as while they are at home. Patients under a study who have difficulties using the tablets to answer a questionnaire might throw off the results.

69
Q

What does the data represent

A

Does it include everyone? Only the people in the city?. Or, perhaps, only those who opted to answer because they had a strong opinion about the topic.

70
Q

What are the methods used to clean the data and why?

A

Different methods can be better in different situations or with different data types.

71
Q

Do you invest the time and money in improving the process?

A

Investing in people and the process is as critical as investing in the technology.