5 Data Wrangling and Manipulation Flashcards
What is the primary focus of a data analyst’s work?
Preparing data for use
What is data wrangling?
The process of cleaning and shaping data into a specific format
What does the term ‘manipulation’ mean in the context of data?
Handling and managing data in a skillful way
List the main topics covered in the data-wrangling process.
- Merging data
- Calculating derived and reduced variables
- Parsing your data
- Recoding variables
- Shaping data with common functions
What is a key variable?
A variable that is present in both tables being merged, allowing rows to be matched
What is an inner join?
A join that includes only values found in both of the old tables
True or False: Inner joins include all data points from both tables.
False
What is an outer join?
A join that includes every data point from both tables, regardless of matches
What is a left join?
A join that contains all data points from the left table and matching values from the right table
What is a right join?
A join that contains all data points from the right table and matching values from the left table
What is data blending?
A temporary link between two tables through a left join without creating a new table
What is the difference between concatenation and appending?
Concatenation merges data in a series; appending adds a new value to the end of an existing series
Define derived variables.
Variables generated based on observed data using logic
What are metrics in the context of derived variables?
Derived variables that calculate a number to gauge the status of a data point
What are flags in the context of derived variables?
Categorical variables that summarize the status of another variable or data point
Fill in the blank: The majority of Key Performance Indicators (KPIs) are _______.
[metrics]
What is the consequence of using both derived variables and the variables used to calculate them in an analytical model?
It can cause multicollinearity
What is the purpose of a key table in a database schema?
To store key variables that allow for merging of other tables
What is the most conservative join type that results in a smaller final table?
Inner join
Explain the concept of pairwise deletion.
A technique used in conjunction with outer joins to maximize data use from small datasets
What does parsing data involve?
Breaking down chunks of text into usable formats
What is recoding in the data-wrangling process?
The process of changing variable values for clarity or analysis purposes
What is the visual representation of an inner join?
A Venn diagram showing the overlap between two datasets
What is a common characteristic of outer joins?
They can produce many null values when data points do not match
How is Speed calculated?
Speed is calculated by dividing Distance by Time.
What does the Speed variable represent in the context of training for a race?
The Speed variable is a KPI that shows performance and progress towards a goal.
What are metrics in data analytics?
Metrics are derived variables that show quantitative data.
What are %ags in data analytics?
%ags are derived variables that show qualitative data.
What are flags used for in data analytics?
Flags are categorical variables that summarize the status of another variable or data point.
What are reduction variables?
Reduction variables, or aggregate variables, reduce the volume of data by summarizing multiple variables.
List some basic methods of aggregation.
- Average
- Sum
- Maximum
- Minimum
- Count
- Distinct Count
What is parsing in data analytics?
Parsing is breaking a single large piece of data into several smaller pieces that can be easily identified and processed.
What is tokenization in Natural Language Processing (NLP)?
Tokenization is the process of breaking up text into words, with each becoming its own object or token.
What is recoding in data analytics?
Recoding is turning variables into a different format, such as translating quantitative variables into qualitative variables.
How can numeric variables be recoded into categories?
Numeric variables can be recoded into categories based on ranges.
What is dummy coding?
Dummy coding creates a new binary variable for every possible category in the original variable.
Why should you drop one dummy-coded variable from a model?
Dropping one dummy-coded variable prevents multicollinearity by avoiding perfect prediction among variables.
What is the challenge with date variables in data analytics?
Date variables are handled differently by every program, making them difficult to work with.
What are conditional operators in programming?
Conditional operators are code snippets that allow for the creation of conditional logic.
List the four basic conditional operators.
- IF
- AND
- OR
- NOT
What does transposing data involve?
Transposing data involves changing the axis of the data, turning columns into rows and vice versa.
What are system functions in data analytics?
System functions provide information about file paths and the local environment during data-wrangling.
What is the primary focus of derived variables?
Derived variables focus on summarizing data.
What is the importance of parsing data for NLP?
Parsing data is necessary for translating language into actionable data in NLP.
Fill in the blank: The process of breaking a sentence into words is known as _______.
tokenization
True or False: Reduction variables are used to increase the volume of data.
False
Fill in the blank: Recoding variables can help to translate quantitative variables into _______.
qualitative variables
What is the purpose of creating a SpeedCategory variable?
The SpeedCategory variable groups speeds based on average performance during a race.
The following picture represents what kind of join?
Inner join
An inner join adds only the data that both datasets have in common to the new table.
Only using the Distinct Count of a dataset is an example of what?
Reduction
Distinct Count is used to summarize data and reduces the amount of data to process.
The following is an example of what concept? Data = ‘This is a sentence?’ Data = [‘This’, ‘is’, ‘a’, ‘sentence’, ‘?’]
Parsing
Parsing involves breaking down large chunks of data into smaller, processable pieces.
The following is an example of what concept?
Dummy coding
Dummy coding creates a new variable for every possible outcome of a categorical variable.
Which of the following is a logical operator? A. IF B. NOT C. OR D. All of the above are logical operators
All of the above are logical operators
Common logical operators include IF, AND, OR, and NOT.
Fill in the blank: Distinct Count is an example of _______.
Reduction
Fill in the blank: Parsing is the concept of breaking down large chunks of data into _______.
smaller pieces
Fill in the blank: Dummy coding is a specific type of recoding that creates a new variable for every possible _______ of a categorical variable.
outcome
True or False: An inner join includes all data from both datasets.
False
An inner join includes only data that both datasets have in common.
What is the main purpose of using Distinct Count in data analysis?
To summarize data and reduce processing load