Course-4 Process data from dirty to clean Flashcards
Data Analysis Rule of thumb
- A strong analysis depends on the integrity of the data.
- Its important to check that the data you use aligns with the business objective.
Data integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.
Data replication
The Process of storing data in multiple locations
Data Transfer
The process of copying data from a storage device to memory, or from one computer to another
Data manipulation
The process of changing data to make it more organised and easier to read
Other threats to data integrity
- Human error
- Viruses
- Malware
- Hacking
- System failures
Types of insufficient data
-Data from only one source
- Data that keeps updating
- Outdated data
- Geographically- Limited data
Ways to address insufficient data
- identify trends with the available data
- Wait for more data if time allows
- Talk with stakeholders and adjust your objective.
- Look for a new dataset
Ways to address insufficient data
- identify trends with the available data
- Wait for more data if time allows
- Talk with stakeholders and adjust your objective.
- Look for a new dataset
Population
All possible data values in a certain dataset
Sample size
A part of a population that is representative of the population
Sampling bias
A sample isn’t representative of the population as a whole
Random sampling
A way of selecting a sample from a population so that every possible sample type has an equal chance of being chosen.
Margin of error
Since the sample size is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population.
Statistical Power
The probability of getting meaningful results from a test.
Hypothesis testing
A way to see if a survey or experiment has meaningful results.
Statistically significant
If a test is statistically significant, it means the results of the best are real and not an error caused by a random chance.
Example
Usually, you need a statistical power of at least 0.8% or 80% to consider your results statistically significant.
Confidence level
The probability that your sample size accurately reflects the greater population.
Example
Having a 99% confidence level is ideal, but most industries hope for at least a 90% or 95% per cent confidence level.
Margin of error
The maximum amount that the sample results are expected to differ from those of the actual population.
Estimated response rate
If you are running a survey of individuals, this is the percentage of people you expect will compete for your survey out of those who received the survey.
To calculate margin of error you need
- Population size
- Sample size
- Confidence level
DATEIF
A spreadsheet function that calcualtes the number of days, months, or years between two dates
Dirty data
Data that is incomplete, incorrect, or irrelevant to the problem you’re trying to solve.
Clean data
Data that is complete, correct, and relevant to the problem your trying to solve
Data engineers
Transform data into a useful format for analysis and give it a reliable infrastructure
Data warehousing specialists
Develop processes and procedures to effectively store and organise data.
Null
An indication that a value does not exist in a dataset.
Field
A single piece of information from a row or column of a spreadsheet
Field length
A tool for determining how many characters can be keyed into a field
Data validation
A tool for checking the accuracy and quality of data before adding or importing it.
Validity
The Concept of using data integrity principles ton ensure measures conform to defined business rules or constraints.
Validity examples
Data collected five years ago used technology that is not approved or supported by the business.
Accuracy
The degree of conformity of a measure to a standard or a true value.
Accuracy examples
Addresses in the business database are identified as incorrect when compared to the public postal service database.
Completeness
The degree to which all required measures are known.
Completeness example
Null/missing value for the item number of employees per store.
Consistency
The degree to which a set of measures is equivalent across systems.
Consistency example
Date of store opening stored in both MM/DD/YYYY and MM/YY formats.
Merger
An agreement that unites two organisations into a single new one.
Data merging
The process of combining two or more datasets into a single dataset.
Compatibility
How well two or more datasets are able to work together.
Questions analysts ask while merging two data bases.
- Do I have all the data I need?
- Does the data I need exist within these datasets?
- Does the data need to be cleaned, or are they ready for me to use?
- Are the datasets cleaned to t he same standard?
Transposing
The user converts the data from the current long format (more rows than columns) to the wide format (more columns than rows).
Conditional formatting
A spreadsheet tool that changes how cells appear when values meet specific conditions.
Remove duplicates
A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
Text String
A group of characters within a cell, most often composed of letters, numbers or both.
Split
A tool that divides text around a specified character and puts each fragment into a new or separate cell.
Specified text separator
Delimiter
Concatenate
A function that joins multiple text strings into a single string
Function
A set of instructions that performs a specific calculation using the data in a spreadsheet
COUNTIF
A function that returns the number of cells that match a specified value
Syntax
A predetermined structure that incudes all required information and its proper placement
COUNTIF Function example
= COUNTIF( range, “value”)
LEN
A function that tells you the length of a text string by counting the number of characters it contains.
LEN function
= LEN (range)
LEFT
A function that gives you a set number of characters from the left side of the text string.
RIGHT
A function that gives you a set number of characters from the right side of a text string.
Left function example
= Left ( range, number of characters)
Right function example
=Right (range,number of characters)
MID
A function that gives you a segment from the middle of a text string.
MID function example
=MID ( range, reference starting point, number of middle characters)
CONCATENATE
= CONCATENATE( item-1, Item 2)
Trim
A function that removes leading, trailing, and repeated spaces in data.
Trim function syntax
=Trim(range)
Sorting
Arranging data into a meaningful order makes it easier to understand, analyze, and visualise.
Filtering
Showing only the data that meets a specific criteria while hiding the rest.
Pivot table
A data summarization tool that is used in data processing.
VLOOKUP
Vertical Lookup
VLOOKUP
A function that searches for a particular value in a column to return a corresponding piece of information.
VLOOKUP Syntax
=VLOOKUP (data to look up, ‘where to look’! Range , column, false)
Data Mapping
The process of matching fields from one data source to another.
Schema
A way of describing how something is organised
Data Cleaning Tools
- Data validation
- Conditional formatting
- COUNTIF
- Sorting
- Filtering
Week-3 Content
- Different data cleaning functions in spreadsheets and SQL
- How SQL can be used to clean large data sets
- Apply basic SQL functions for transforming data and cleaning strings
Spreadsheets VS SQL
Spreadsheets
- Generated with a program
- Access to the data you input
- Stored locally
-Small datasets
- Working independently
- Built-in functionalities
SQL
-A language used to interact with database programs
-Can pull information from different sources in the database
-Stored across a database
- Larger datasets
- Tracks changes across the team
- Useful across multiple programs
CAST
Can be used to convert anything from one data type to another
Float
A number that contains a decimal
Typecasting
Converting data from one type to another
CONCAT()
Adds strings together to create new text strings that can be used as unique keys
COALESCE
Can be used to return non-null values in a list.
Verification
A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable.
Changelog
A file containing a chronologically ordered list of modifications made to a project.
See the big picture when verifiying data-cleaning
1) Consider the business problem
2) Consider the goal
3) Consider the data
Remove duplicates
A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
Find and replace
A tool that looks for a specified search term in a spreadsheet and allows you to replace it with something else.
COUNTA
A function that counts the total number of values within a specified range
CASE statement
The CASE statement goes through one or more conditions and returns a value as soon a condition is met
Documentation
The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort.
Documentation
The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort.
Data documentation benefits
-Recover data-cleaning errors
- Inform other users of changes
- Determine the quality of data
-
PAR
Problem
Action
Result
Transferable skills
Skills and qualities that can transfer from one job or industry to another
PAR Example
- Problem: Previously-absent workflow procedures.
- Action: Implemented and communicated daily workflow procedures.
Result: 15% Increase in productivity.
Soft skills
Non-Technical skills traits and behaviors that relate to how you work.
Junior or associate data analysts
- Healthcare analyst
- Marketing analyst
- Business intelligence analyst
- ## Financial analyst