5 Data Wrangling and Manipulation Flashcards

1
Q

What is the primary focus of a data analyst’s work?

A

Preparing data for use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data wrangling?

A

The process of cleaning and shaping data into a specific format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the term ‘manipulation’ mean in the context of data?

A

Handling and managing data in a skillful way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

List the main topics covered in the data-wrangling process.

A
  • Merging data
  • Calculating derived and reduced variables
  • Parsing your data
  • Recoding variables
  • Shaping data with common functions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a key variable?

A

A variable that is present in both tables being merged, allowing rows to be matched

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an inner join?

A

A join that includes only values found in both of the old tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: Inner joins include all data points from both tables.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an outer join?

A

A join that includes every data point from both tables, regardless of matches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a left join?

A

A join that contains all data points from the left table and matching values from the right table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a right join?

A

A join that contains all data points from the right table and matching values from the left table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is data blending?

A

A temporary link between two tables through a left join without creating a new table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between concatenation and appending?

A

Concatenation merges data in a series; appending adds a new value to the end of an existing series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define derived variables.

A

Variables generated based on observed data using logic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are metrics in the context of derived variables?

A

Derived variables that calculate a number to gauge the status of a data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are flags in the context of derived variables?

A

Categorical variables that summarize the status of another variable or data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Fill in the blank: The majority of Key Performance Indicators (KPIs) are _______.

A

[metrics]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the consequence of using both derived variables and the variables used to calculate them in an analytical model?

A

It can cause multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the purpose of a key table in a database schema?

A

To store key variables that allow for merging of other tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the most conservative join type that results in a smaller final table?

A

Inner join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain the concept of pairwise deletion.

A

A technique used in conjunction with outer joins to maximize data use from small datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does parsing data involve?

A

Breaking down chunks of text into usable formats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is recoding in the data-wrangling process?

A

The process of changing variable values for clarity or analysis purposes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the visual representation of an inner join?

A

A Venn diagram showing the overlap between two datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a common characteristic of outer joins?

A

They can produce many null values when data points do not match

25
Q

How is Speed calculated?

A

Speed is calculated by dividing Distance by Time.

26
Q

What does the Speed variable represent in the context of training for a race?

A

The Speed variable is a KPI that shows performance and progress towards a goal.

27
Q

What are metrics in data analytics?

A

Metrics are derived variables that show quantitative data.

28
Q

What are %ags in data analytics?

A

%ags are derived variables that show qualitative data.

29
Q

What are flags used for in data analytics?

A

Flags are categorical variables that summarize the status of another variable or data point.

30
Q

What are reduction variables?

A

Reduction variables, or aggregate variables, reduce the volume of data by summarizing multiple variables.

31
Q

List some basic methods of aggregation.

A
  • Average
  • Sum
  • Maximum
  • Minimum
  • Count
  • Distinct Count
32
Q

What is parsing in data analytics?

A

Parsing is breaking a single large piece of data into several smaller pieces that can be easily identified and processed.

33
Q

What is tokenization in Natural Language Processing (NLP)?

A

Tokenization is the process of breaking up text into words, with each becoming its own object or token.

34
Q

What is recoding in data analytics?

A

Recoding is turning variables into a different format, such as translating quantitative variables into qualitative variables.

35
Q

How can numeric variables be recoded into categories?

A

Numeric variables can be recoded into categories based on ranges.

36
Q

What is dummy coding?

A

Dummy coding creates a new binary variable for every possible category in the original variable.

37
Q

Why should you drop one dummy-coded variable from a model?

A

Dropping one dummy-coded variable prevents multicollinearity by avoiding perfect prediction among variables.

38
Q

What is the challenge with date variables in data analytics?

A

Date variables are handled differently by every program, making them difficult to work with.

39
Q

What are conditional operators in programming?

A

Conditional operators are code snippets that allow for the creation of conditional logic.

40
Q

List the four basic conditional operators.

A
  • IF
  • AND
  • OR
  • NOT
41
Q

What does transposing data involve?

A

Transposing data involves changing the axis of the data, turning columns into rows and vice versa.

42
Q

What are system functions in data analytics?

A

System functions provide information about file paths and the local environment during data-wrangling.

43
Q

What is the primary focus of derived variables?

A

Derived variables focus on summarizing data.

44
Q

What is the importance of parsing data for NLP?

A

Parsing data is necessary for translating language into actionable data in NLP.

45
Q

Fill in the blank: The process of breaking a sentence into words is known as _______.

A

tokenization

46
Q

True or False: Reduction variables are used to increase the volume of data.

47
Q

Fill in the blank: Recoding variables can help to translate quantitative variables into _______.

A

qualitative variables

48
Q

What is the purpose of creating a SpeedCategory variable?

A

The SpeedCategory variable groups speeds based on average performance during a race.

49
Q

The following picture represents what kind of join?

A

Inner join

An inner join adds only the data that both datasets have in common to the new table.

50
Q

Only using the Distinct Count of a dataset is an example of what?

A

Reduction

Distinct Count is used to summarize data and reduces the amount of data to process.

51
Q

The following is an example of what concept? Data = ‘This is a sentence?’ Data = [‘This’, ‘is’, ‘a’, ‘sentence’, ‘?’]

A

Parsing

Parsing involves breaking down large chunks of data into smaller, processable pieces.

52
Q

The following is an example of what concept?

A

Dummy coding

Dummy coding creates a new variable for every possible outcome of a categorical variable.

53
Q

Which of the following is a logical operator? A. IF B. NOT C. OR D. All of the above are logical operators

A

All of the above are logical operators

Common logical operators include IF, AND, OR, and NOT.

54
Q

Fill in the blank: Distinct Count is an example of _______.

55
Q

Fill in the blank: Parsing is the concept of breaking down large chunks of data into _______.

A

smaller pieces

56
Q

Fill in the blank: Dummy coding is a specific type of recoding that creates a new variable for every possible _______ of a categorical variable.

57
Q

True or False: An inner join includes all data from both datasets.

A

False

An inner join includes only data that both datasets have in common.

58
Q

What is the main purpose of using Distinct Count in data analysis?

A

To summarize data and reduce processing load