Course-4 Process data from dirty to clean Flashcards
Data Analysis Rule of thumb
- A strong analysis depends on the integrity of the data.
- Its important to check that the data you use aligns with the business objective.
Data integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.
Data replication
The Process of storing data in multiple locations
Data Transfer
The process of copying data from a storage device to memory, or from one computer to another
Data manipulation
The process of changing data to make it more organised and easier to read
Other threats to data integrity
- Human error
- Viruses
- Malware
- Hacking
- System failures
Types of insufficient data
-Data from only one source
- Data that keeps updating
- Outdated data
- Geographically- Limited data
Ways to address insufficient data
- identify trends with the available data
- Wait for more data if time allows
- Talk with stakeholders and adjust your objective.
- Look for a new dataset
Ways to address insufficient data
- identify trends with the available data
- Wait for more data if time allows
- Talk with stakeholders and adjust your objective.
- Look for a new dataset
Population
All possible data values in a certain dataset
Sample size
A part of a population that is representative of the population
Sampling bias
A sample isn’t representative of the population as a whole
Random sampling
A way of selecting a sample from a population so that every possible sample type has an equal chance of being chosen.
Margin of error
Since the sample size is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population.
Statistical Power
The probability of getting meaningful results from a test.
Hypothesis testing
A way to see if a survey or experiment has meaningful results.
Statistically significant
If a test is statistically significant, it means the results of the best are real and not an error caused by a random chance.
Example
Usually, you need a statistical power of at least 0.8% or 80% to consider your results statistically significant.
Confidence level
The probability that your sample size accurately reflects the greater population.
Example
Having a 99% confidence level is ideal, but most industries hope for at least a 90% or 95% per cent confidence level.
Margin of error
The maximum amount that the sample results are expected to differ from those of the actual population.
Estimated response rate
If you are running a survey of individuals, this is the percentage of people you expect will compete for your survey out of those who received the survey.
To calculate margin of error you need
- Population size
- Sample size
- Confidence level
DATEIF
A spreadsheet function that calcualtes the number of days, months, or years between two dates
Dirty data
Data that is incomplete, incorrect, or irrelevant to the problem you’re trying to solve.
Clean data
Data that is complete, correct, and relevant to the problem your trying to solve
Data engineers
Transform data into a useful format for analysis and give it a reliable infrastructure
Data warehousing specialists
Develop processes and procedures to effectively store and organise data.
Null
An indication that a value does not exist in a dataset.
Field
A single piece of information from a row or column of a spreadsheet
Field length
A tool for determining how many characters can be keyed into a field
Data validation
A tool for checking the accuracy and quality of data before adding or importing it.
Validity
The Concept of using data integrity principles ton ensure measures conform to defined business rules or constraints.
Validity examples
Data collected five years ago used technology that is not approved or supported by the business.
Accuracy
The degree of conformity of a measure to a standard or a true value.
Accuracy examples
Addresses in the business database are identified as incorrect when compared to the public postal service database.
Completeness
The degree to which all required measures are known.
Completeness example
Null/missing value for the item number of employees per store.