Course-4 Process data from dirty to clean Flashcards

1
Q

Data Analysis Rule of thumb

A
  • A strong analysis depends on the integrity of the data.
  • Its important to check that the data you use aligns with the business objective.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data integrity

A

The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data replication

A

The Process of storing data in multiple locations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Transfer

A

The process of copying data from a storage device to memory, or from one computer to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data manipulation

A

The process of changing data to make it more organised and easier to read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Other threats to data integrity

A
  • Human error
  • Viruses
  • Malware
  • Hacking
  • System failures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Types of insufficient data

A

-Data from only one source
- Data that keeps updating
- Outdated data
- Geographically- Limited data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ways to address insufficient data

A
  • identify trends with the available data
  • Wait for more data if time allows
  • Talk with stakeholders and adjust your objective.
  • Look for a new dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ways to address insufficient data

A
  • identify trends with the available data
  • Wait for more data if time allows
  • Talk with stakeholders and adjust your objective.
  • Look for a new dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Population

A

All possible data values in a certain dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sample size

A

A part of a population that is representative of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sampling bias

A

A sample isn’t representative of the population as a whole

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Random sampling

A

A way of selecting a sample from a population so that every possible sample type has an equal chance of being chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Margin of error

A

Since the sample size is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Statistical Power

A

The probability of getting meaningful results from a test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hypothesis testing

A

A way to see if a survey or experiment has meaningful results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Statistically significant

A

If a test is statistically significant, it means the results of the best are real and not an error caused by a random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Example

A

Usually, you need a statistical power of at least 0.8% or 80% to consider your results statistically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Confidence level

A

The probability that your sample size accurately reflects the greater population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Example

A

Having a 99% confidence level is ideal, but most industries hope for at least a 90% or 95% per cent confidence level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Margin of error

A

The maximum amount that the sample results are expected to differ from those of the actual population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Estimated response rate

A

If you are running a survey of individuals, this is the percentage of people you expect will compete for your survey out of those who received the survey.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

To calculate margin of error you need

A
  • Population size
  • Sample size
  • Confidence level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

DATEIF

A

A spreadsheet function that calcualtes the number of days, months, or years between two dates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Dirty data

A

Data that is incomplete, incorrect, or irrelevant to the problem you’re trying to solve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Clean data

A

Data that is complete, correct, and relevant to the problem your trying to solve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Data engineers

A

Transform data into a useful format for analysis and give it a reliable infrastructure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Data warehousing specialists

A

Develop processes and procedures to effectively store and organise data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Null

A

An indication that a value does not exist in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Field

A

A single piece of information from a row or column of a spreadsheet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Field length

A

A tool for determining how many characters can be keyed into a field

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Data validation

A

A tool for checking the accuracy and quality of data before adding or importing it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Validity

A

The Concept of using data integrity principles ton ensure measures conform to defined business rules or constraints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Validity examples

A

Data collected five years ago used technology that is not approved or supported by the business.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Accuracy

A

The degree of conformity of a measure to a standard or a true value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Accuracy examples

A

Addresses in the business database are identified as incorrect when compared to the public postal service database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Completeness

A

The degree to which all required measures are known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Completeness example

A

Null/missing value for the item number of employees per store.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Consistency

A

The degree to which a set of measures is equivalent across systems.

39
Q

Consistency example

A

Date of store opening stored in both MM/DD/YYYY and MM/YY formats.

40
Q

Merger

A

An agreement that unites two organisations into a single new one.

41
Q

Data merging

A

The process of combining two or more datasets into a single dataset.

42
Q

Compatibility

A

How well two or more datasets are able to work together.

43
Q

Questions analysts ask while merging two data bases.

A
  • Do I have all the data I need?
  • Does the data I need exist within these datasets?
  • Does the data need to be cleaned, or are they ready for me to use?
  • Are the datasets cleaned to t he same standard?
44
Q

Transposing

A

The user converts the data from the current long format (more rows than columns) to the wide format (more columns than rows).

45
Q

Conditional formatting

A

A spreadsheet tool that changes how cells appear when values meet specific conditions.

46
Q

Remove duplicates

A

A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.

47
Q

Text String

A

A group of characters within a cell, most often composed of letters, numbers or both.

48
Q

Split

A

A tool that divides text around a specified character and puts each fragment into a new or separate cell.

49
Q

Specified text separator

A

Delimiter

50
Q

Concatenate

A

A function that joins multiple text strings into a single string

51
Q

Function

A

A set of instructions that performs a specific calculation using the data in a spreadsheet

52
Q

COUNTIF

A

A function that returns the number of cells that match a specified value

53
Q

Syntax

A

A predetermined structure that incudes all required information and its proper placement

54
Q

COUNTIF Function example

A

= COUNTIF( range, “value”)

55
Q

LEN

A

A function that tells you the length of a text string by counting the number of characters it contains.

56
Q

LEN function

A

= LEN (range)

57
Q

LEFT

A

A function that gives you a set number of characters from the left side of the text string.

58
Q

RIGHT

A

A function that gives you a set number of characters from the right side of a text string.

59
Q

Left function example

A

= Left ( range, number of characters)

60
Q

Right function example

A

=Right (range,number of characters)

61
Q

MID

A

A function that gives you a segment from the middle of a text string.

62
Q

MID function example

A

=MID ( range, reference starting point, number of middle characters)

63
Q

CONCATENATE

A

= CONCATENATE( item-1, Item 2)

64
Q

Trim

A

A function that removes leading, trailing, and repeated spaces in data.

65
Q

Trim function syntax

A

=Trim(range)

66
Q

Sorting

A

Arranging data into a meaningful order makes it easier to understand, analyze, and visualise.

67
Q

Filtering

A

Showing only the data that meets a specific criteria while hiding the rest.

68
Q

Pivot table

A

A data summarization tool that is used in data processing.

69
Q

VLOOKUP

A

Vertical Lookup

70
Q

VLOOKUP

A

A function that searches for a particular value in a column to return a corresponding piece of information.

71
Q

VLOOKUP Syntax

A

=VLOOKUP (data to look up, ‘where to look’! Range , column, false)

72
Q

Data Mapping

A

The process of matching fields from one data source to another.

73
Q

Schema

A

A way of describing how something is organised

74
Q

Data Cleaning Tools

A
  • Data validation
  • Conditional formatting
  • COUNTIF
  • Sorting
  • Filtering
75
Q

Week-3 Content

A
  • Different data cleaning functions in spreadsheets and SQL
  • How SQL can be used to clean large data sets
  • Apply basic SQL functions for transforming data and cleaning strings
76
Q

Spreadsheets VS SQL

A

Spreadsheets
- Generated with a program
- Access to the data you input
- Stored locally
-Small datasets
- Working independently
- Built-in functionalities
SQL
-A language used to interact with database programs
-Can pull information from different sources in the database
-Stored across a database
- Larger datasets
- Tracks changes across the team
- Useful across multiple programs

77
Q

CAST

A

Can be used to convert anything from one data type to another

78
Q

Float

A

A number that contains a decimal

79
Q

Typecasting

A

Converting data from one type to another

80
Q

CONCAT()

A

Adds strings together to create new text strings that can be used as unique keys

81
Q

COALESCE

A

Can be used to return non-null values in a list.

82
Q

Verification

A

A process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable.

83
Q

Changelog

A

A file containing a chronologically ordered list of modifications made to a project.

84
Q

See the big picture when verifiying data-cleaning

A

1) Consider the business problem
2) Consider the goal
3) Consider the data

85
Q

Remove duplicates

A

A tool that automatically searches for and eliminates duplicate entries from a spreadsheet.

86
Q

Find and replace

A

A tool that looks for a specified search term in a spreadsheet and allows you to replace it with something else.

87
Q

COUNTA

A

A function that counts the total number of values within a specified range

88
Q

CASE statement

A

The CASE statement goes through one or more conditions and returns a value as soon a condition is met

89
Q

Documentation

A

The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort.

89
Q

Documentation

A

The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort.

90
Q

Data documentation benefits

A

-Recover data-cleaning errors
- Inform other users of changes
- Determine the quality of data
-

91
Q

PAR

A

Problem
Action
Result

92
Q

Transferable skills

A

Skills and qualities that can transfer from one job or industry to another

93
Q

PAR Example

A
  • Problem: Previously-absent workflow procedures.
  • Action: Implemented and communicated daily workflow procedures.
    Result: 15% Increase in productivity.
94
Q

Soft skills

A

Non-Technical skills traits and behaviors that relate to how you work.

95
Q

Junior or associate data analysts

A
  • Healthcare analyst
  • Marketing analyst
  • Business intelligence analyst
  • ## Financial analyst