Cleaning Data in Python Flashcards

1
Q

Print first 5 rows of DataFrame

Print last 5 rows of DataFrame

A

print(dataframename.head())

print(dataframename.tail())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Get a list of columns of the DataFrame

A

dataframename.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Check dimensions of a DataFrame

A

dataframename.shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Get information about DataFrame

A

dataframename.info()

#Shows number of rows and columns, column names, number of non-missing values in column, and type of data in each column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Count frequency in a DataFrame column (2 equivalent ways)

A

dataframename. column1.value_counts(dropna=False)
* #Can use this one if column1 name doesn’t contain any special characters or spaces and it’s not the name of a python function*

dataframename[‘column1’].value_counts(dropna=False)

#dropna=False will make it count number of missing values as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

chain together value_counts() and head() methods

A

dataframename.column1.value_counts(dropna=False).head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Summary statistics for a DataFrame

A

dataframename.describe()

# Will produce summary statistics of numeric data including count (number of non-missing values), mean, std, min, 25%, 50% (median), 75%, max

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Make a histogram from a DataFrame column

A

import matplotlib.pyplot as plt

dataframename. columnname.plot(kind=’hist’)
plt. show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Make a boxplot from a DataFrame column

A

dataframename. boxplot(column=’columnname’, by=’groupingvariable’)
plt. show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Make a scatter plot from two DataFrame columns

A

dataframe.plot(kind=’scatter’, x=column1, y=column2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 3 principles of Tidy Data

A
  1. Columns represent separate variables
  2. Rows represent individual observations
  3. Observational units form tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Melt together two columns of a DataFrame

A

pd.melt(frame=datatframename, id_vars=’column1’, value_vars=[‘column2, ‘column3’], var_name=’columnname’, value_name=’valuesname’)

  • # id_vars are the columns that you want to keep the same. Note that for multiple id_vars need to make a list[]*
  • # value_vars are the columns that you want to melt. If none specified, all columns (except ones in id_vars) will melt*
  • # var_name is the name of the melted column#value_name is the name for the column of values*
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Pivot one column into two columns

A

new_df = dataframename.pivot(index=’column1’, columns=’column2’, values=’column3’)

  • # index – columns that want to keep the same. For multiple entries, make a list[]*
  • # columns – column that want to pivot into separate columns.*
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Pivot one column into two columns when there are duplicate values for a variable you’re trying to pivot. Aggregate the duplicate values by taking their average

A

new_df = dataframename.pivot_table(index=’column1’, columns=’column2’, values=’column3’, aggfunc=np.mean)

  • # index –columns that want to keep the same. For multiple entries, make a list[]*
  • # columns - column we want to pivot into separate columns*
  • # values – values that will be used to fill the columns after pivoting*
  • # aggfunc – tells python how to handle duplicate values (np.mean is the default)*
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Flatten the columns of a pivoted DataFrame

A

dataframename.reset_index()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What string method can you use to split a string at a delimiter?

A

.split(‘delimiter’)

Default delimiter is space

17
Q

Which method can you use to retrieve a a value from a list?

A

listname.str.get(index)

18
Q

How do you concatenate rows of data?

A

concatenated = pd.concat([dataframe1, dataframe2])

19
Q

How do you concatenate columns of data?

A

concatenated = pd.concat([dataframe1, dataframe2], axis=1)

#axis = 1 defines column-wise concatenation. Default is axis=0

20
Q

How do you merge data in cases where concatenating won’t work?

A

left_on and right_on define the common columns (keys) between the two dataframes

pd.merge(left=dataframe1, right=dataframe2, on=None, left_on=’keydf1’, right_on=’keydf2’)

# on= if the common columns have same name, can use on= to specify. If names are different, can omit this

21
Q

How do you view the datatypes of the columns in a DataFrame?

A

print(dataframename.dtypes)

22
Q

How do you convert a DataFrame column into string (object dtype)

A

dataframename[‘column1’] = dataframename[‘column1’].astype(str)

23
Q

How to you convert a DataFrame column into category dtype?

Why might it be beneficial to use this data type?

A

dataframename[‘column1’] = dataframename[‘column1].astype(‘category’)

Saves memory

24
Q

How do you convert a column containing missing values described by non-numeric characters into numeric data?

A

dataframename[‘column1’] = pd.to_numeric(dataframename[‘column1’], errors = ‘coerce’)

#errors = ‘coerce’ tells python to turn invalid values (e.g. a dash) to missing (NaN)

25
Q

How do you use regular expressions to match the pattern of a string?

A

import re

pattern = re.compile(‘pattern’)

result = pattern.match(‘string’)

bool(result) outputs True

or

re.match(pattern=’pattern’, string=’string’)

26
Q

How do you extract multiple numbers from a string using regular expressions?

A

matches = re.findall(‘pattern’, ‘string’)

27
Q

How do you apply a function to all columns of a DataFrame?

A

dataframename.apply(functionname,axis=0)

28
Q

How do you apply a function to all rows of a DataFrame?

A

dataframename.apply(functionname,axis=1)