Course-4 Process data from dirty to clean Flashcards

1
Q

Data Analysis Rule of thumb

A
  • A strong analysis depends on the integrity of the data.
  • Its important to check that the data you use aligns with the business objective.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data integrity

A

The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data replication

A

The Process of storing data in multiple locations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Transfer

A

The process of copying data from a storage device to memory, or from one computer to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data manipulation

A

The process of changing data to make it more organised and easier to read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Other threats to data integrity

A
  • Human error
  • Viruses
  • Malware
  • Hacking
  • System failures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of insufficient data

A

-Data from only one source
- Data that keeps updating
- Outdated data
- Geographically- Limited data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ways to address insufficient data

A
  • identify trends with the available data
  • Wait for more data if time allows
  • Talk with stakeholders and adjust your objective.
  • Look for a new dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ways to address insufficient data

A
  • identify trends with the available data
  • Wait for more data if time allows
  • Talk with stakeholders and adjust your objective.
  • Look for a new dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Population

A

All possible data values in a certain dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sample size

A

A part of a population that is representative of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sampling bias

A

A sample isn’t representative of the population as a whole

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Random sampling

A

A way of selecting a sample from a population so that every possible sample type has an equal chance of being chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Margin of error

A

Since the sample size is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Statistical Power

A

The probability of getting meaningful results from a test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hypothesis testing

A

A way to see if a survey or experiment has meaningful results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Statistically significant

A

If a test is statistically significant, it means the results of the best are real and not an error caused by a random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Example

A

Usually, you need a statistical power of at least 0.8% or 80% to consider your results statistically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Confidence level

A

The probability that your sample size accurately reflects the greater population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Example

A

Having a 99% confidence level is ideal, but most industries hope for at least a 90% or 95% per cent confidence level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Margin of error

A

The maximum amount that the sample results are expected to differ from those of the actual population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Estimated response rate

A

If you are running a survey of individuals, this is the percentage of people you expect will compete for your survey out of those who received the survey.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

To calculate margin of error you need

A
  • Population size
  • Sample size
  • Confidence level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

DATEIF

A

A spreadsheet function that calcualtes the number of days, months, or years between two dates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Dirty data
Data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve.
25
Clean data
Data that is complete, correct, and relevant to the problem your trying to solve
26
Data engineers
Transform data into a useful format for analysis and give it a reliable infrastructure
27
Data warehousing specialists
Develop processes and procedures to effectively store and organise data.
28
Null
An indication that a value does not exist in a dataset.
29
Field
A single piece of information from a row or column of a spreadsheet
30
Field length
A tool for determining how many characters can be keyed into a field
31
Data validation
A tool for checking the accuracy and quality of data before adding or importing it.
32
Validity
The Concept of using data integrity principles ton ensure measures conform to defined business rules or constraints.
33
Validity examples
Data collected five years ago used technology that is not approved or supported by the business.
34
Accuracy
The degree of conformity of a measure to a standard or a true value.
35
Accuracy examples
Addresses in the business database are identified as incorrect when compared to the public postal service database.
36
Completeness
The degree to which all required measures are known.
37
Completeness example
Null/missing value for the item number of employees per store.
38
Consistency
The degree to which a set of measures is equivalent across systems.
39
Consistency example
Date of store opening stored in both MM/DD/YYYY and MM/YY formats.
40
Merger
An agreement that unites two organisations into a single new one.
41
Data merging
The process of combining two or more datasets into a single dataset.
42
Compatibility
How well two or more datasets are able to work together.
43
Questions analysts ask while merging two data bases.
- Do I have all the data I need? - Does the data I need exist within these datasets? - Does the data need to be cleaned, or are they ready for me to use? - Are the datasets cleaned to t he same standard?
44
Transposing
The user converts the data from the current long format (more rows than columns) to the wide format (more columns than rows).
45
Conditional formatting
A spreadsheet tool that changes how cells appear when values meet specific conditions.
46
Remove duplicates
A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
47
Text String
A group of characters within a cell, most often composed of letters, numbers or both.
48
Split
A tool that divides text around a specified character and puts each fragment into a new or separate cell.
49
Specified text separator
Delimiter
50
Concatenate
A function that joins multiple text strings into a single string
51
Function
A set of instructions that performs a specific calculation using the data in a spreadsheet
52
COUNTIF
A function that returns the number of cells that match a specified value
53
Syntax
A predetermined structure that incudes all required information and its proper placement
54
COUNTIF Function example
= COUNTIF( range, "value")
55
LEN
A function that tells you the length of a text string by counting the number of characters it contains.
56
LEN function
= LEN (range)
57
LEFT
A function that gives you a set number of characters from the left side of the text string.
58
RIGHT
A function that gives you a set number of characters from the right side of a text string.
59
Left function example
= Left ( range, number of characters)
60
Right function example
=Right (range,number of characters)
61
MID
A function that gives you a segment from the middle of a text string.
62
MID function example
=MID ( range, reference starting point, number of middle characters)
63
CONCATENATE
= CONCATENATE( item-1, Item 2)
64
Trim
A function that removes leading, trailing, and repeated spaces in data.
65
Trim function syntax
=Trim(range)
66
Sorting
Arranging data into a meaningful order makes it easier to understand, analyze, and visualise.
67
Filtering
Showing only the data that meets a specific criteria while hiding the rest.
68
Pivot table
A data summarization tool that is used in data processing.
69
VLOOKUP
Vertical Lookup
70
VLOOKUP
A function that searches for a particular value in a column to return a corresponding piece of information.
71
VLOOKUP Syntax
=VLOOKUP (data to look up, 'where to look'! Range , column, false)
72
Data Mapping
The process of matching fields from one data source to another.
73
Schema
A way of describing how something is organised
74
Data Cleaning Tools
- Data validation - Conditional formatting - COUNTIF - Sorting - Filtering
75
Week-3 Content
- Different data cleaning functions in spreadsheets and SQL - How SQL can be used to clean large data sets - Apply basic SQL functions for transforming data and cleaning strings
76
Spreadsheets VS SQL
Spreadsheets - Generated with a program - Access to the data you input - Stored locally -Small datasets - Working independently - Built-in functionalities SQL -A language used to interact with database programs -Can pull information from different sources in the database -Stored across a database - Larger datasets - Tracks changes across the team - Useful across multiple programs
77
CAST
Can be used to convert anything from one data type to another
78
Float
A number that contains a decimal
79
Typecasting
Converting data from one type to another
80
CONCAT()
Adds strings together to create new text strings that can be used as unique keys
81
COALESCE
Can be used to return non-null values in a list.
82
Verification
A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable.
83
Changelog
A file containing a chronologically ordered list of modifications made to a project.
84
See the big picture when verifiying data-cleaning
1) Consider the business problem 2) Consider the goal 3) Consider the data
85
Remove duplicates
A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.
86
Find and replace
A tool that looks for a specified search term in a spreadsheet and allows you to replace it with something else.
87
COUNTA
A function that counts the total number of values within a specified range
88
CASE statement
The CASE statement goes through one or more conditions and returns a value as soon a condition is met
89
Documentation
The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort.
89
Documentation
The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort.
90
Data documentation benefits
-Recover data-cleaning errors - Inform other users of changes - Determine the quality of data -
91
PAR
Problem Action Result
92
Transferable skills
Skills and qualities that can transfer from one job or industry to another
93
PAR Example
- Problem: Previously-absent workflow procedures. - Action: Implemented and communicated daily workflow procedures. Result: 15% Increase in productivity.
94
Soft skills
Non-Technical skills traits and behaviors that relate to how you work.
95
Junior or associate data analysts
- Healthcare analyst - Marketing analyst - Business intelligence analyst - Financial analyst -