Data Wrangling Flashcards
What is Data Wrangling?
Is the process of retrieving and conforming data into a form we can use. This can be done through data cleaning, and finding data in files, databases, and web api’s.
Common data formats?
CSV, JSON, XML.
What is a JSON file?
JSON is a file format that bears a resemblance to a series of python dictionaries, with each “dict”/JSON object’s key being the column name(in csv terms/format) and the value for it being the value for a single data point(or row in csv).
What is a Schema?
These are essentially the blueprints for how we want to organize our database.
What is Sanity Checking?
Is the process of checking if the data we’re using makes sense. This can be done with a few questions:
- Does the data make sense?
- Is there a problem?
- Does the data look like we expect it to?
What are the three questions for Sanity Checking
- Does the data make sense?
- Is there a problem?
- Does the data look like we expect it to?
Why are values missing from a dataset?
system malfunctions when recording or people choosing not to enter in data for a variety of different reasons like basic human error or survey respondents choosing not to answer.
What is imputation?
Is the act of inputting information for missing values using a variety of methods to ensure data accuracy and consistency.
Why should we impute?
this method is especially useful for maintaining representation in our data and when we have very little data to work with, so sparing data isn’t an option.