Data Wrangling Flashcards

Question 1

Q

What is Data Wrangling?

Answer

A

Is the process of retrieving and conforming data into a form we can use. This can be done through data cleaning, and finding data in files, databases, and web api’s.

Question 2

Q

Common data formats?

Answer

A

CSV, JSON, XML.

Question 3

Q

What is a JSON file?

Answer

A

JSON is a file format that bears a resemblance to a series of python dictionaries, with each “dict”/JSON object’s key being the column name(in csv terms/format) and the value for it being the value for a single data point(or row in csv).

Question 4

Q

What is a Schema?

Answer

A

These are essentially the blueprints for how we want to organize our database.

Question 5

Q

What is Sanity Checking?

Answer

A

Is the process of checking if the data we’re using makes sense. This can be done with a few questions:

Does the data make sense?
Is there a problem?
Does the data look like we expect it to?

Question 6

Q

What are the three questions for Sanity Checking

Answer

A

Does the data make sense?
Is there a problem?
Does the data look like we expect it to?

Question 7

Q

Why are values missing from a dataset?

Answer

A

system malfunctions when recording or people choosing not to enter in data for a variety of different reasons like basic human error or survey respondents choosing not to answer.

Question 8

Q

What is imputation?

Answer

A

Is the act of inputting information for missing values using a variety of methods to ensure data accuracy and consistency.

Question 9

Q

Why should we impute?

Answer

A

this method is especially useful for maintaining representation in our data and when we have very little data to work with, so sparing data isn’t an option.