Data Wrangling Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is Data Wrangling?

A

Is the process of retrieving and conforming data into a form we can use. This can be done through data cleaning, and finding data in files, databases, and web api’s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Common data formats?

A

CSV, JSON, XML.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a JSON file?

A

JSON is a file format that bears a resemblance to a series of python dictionaries, with each “dict”/JSON object’s key being the column name(in csv terms/format) and the value for it being the value for a single data point(or row in csv).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Schema?

A

These are essentially the blueprints for how we want to organize our database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Sanity Checking?

A

Is the process of checking if the data we’re using makes sense. This can be done with a few questions:

  • Does the data make sense?
  • Is there a problem?
  • Does the data look like we expect it to?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three questions for Sanity Checking

A
  • Does the data make sense?
  • Is there a problem?
  • Does the data look like we expect it to?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why are values missing from a dataset?

A

system malfunctions when recording or people choosing not to enter in data for a variety of different reasons like basic human error or survey respondents choosing not to answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is imputation?

A

Is the act of inputting information for missing values using a variety of methods to ensure data accuracy and consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why should we impute?

A

this method is especially useful for maintaining representation in our data and when we have very little data to work with, so sparing data isn’t an option.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly