Course-4 Process data from dirty to clean Flashcards
Data Analysis Rule of thumb
- A strong analysis depends on the integrity of the data.
- Its important to check that the data you use aligns with the business objective.
Data integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.
Data replication
The Process of storing data in multiple locations
Data Transfer
The process of copying data from a storage device to memory, or from one computer to another
Data manipulation
The process of changing data to make it more organised and easier to read
Other threats to data integrity
- Human error
- Viruses
- Malware
- Hacking
- System failures
Types of insufficient data
-Data from only one source
- Data that keeps updating
- Outdated data
- Geographically- Limited data
Ways to address insufficient data
- identify trends with the available data
- Wait for more data if time allows
- Talk with stakeholders and adjust your objective.
- Look for a new dataset
Ways to address insufficient data
- identify trends with the available data
- Wait for more data if time allows
- Talk with stakeholders and adjust your objective.
- Look for a new dataset
Population
All possible data values in a certain dataset
Sample size
A part of a population that is representative of the population
Sampling bias
A sample isn’t representative of the population as a whole
Random sampling
A way of selecting a sample from a population so that every possible sample type has an equal chance of being chosen.
Margin of error
Since the sample size is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population.
Statistical Power
The probability of getting meaningful results from a test.
Hypothesis testing
A way to see if a survey or experiment has meaningful results.
Statistically significant
If a test is statistically significant, it means the results of the best are real and not an error caused by a random chance.
Example
Usually, you need a statistical power of at least 0.8% or 80% to consider your results statistically significant.
Confidence level
The probability that your sample size accurately reflects the greater population.
Example
Having a 99% confidence level is ideal, but most industries hope for at least a 90% or 95% per cent confidence level.
Margin of error
The maximum amount that the sample results are expected to differ from those of the actual population.
Estimated response rate
If you are running a survey of individuals, this is the percentage of people you expect will compete for your survey out of those who received the survey.
To calculate margin of error you need
- Population size
- Sample size
- Confidence level
DATEIF
A spreadsheet function that calcualtes the number of days, months, or years between two dates