3.1 Dataframe Basics Flashcards

1
Q

What is “pandas”?

A

Pandas is a third‑party Python library that gives you types and
functions for working with tabular data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is “DataFrame”

A

DataFrame, is a container type like a list or dict and holds a single data table.

One column of a Data Frame is its own type, called a Series, which you’ll also sometimes use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why does “path” have to be imported?

A

Though path is part of the standard library (i.e. no third party installation necessary), we still have to import it in order to use it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you import “pandas”?

A

“import pandas as pd”

When importing Pandas, the convention is to import it under the name pd.

This lets us use any Pandas function by calling pd. (i.e. pd dot — type the period) and the name of our function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the function used to read a CSV file?

A

“pd.read_csv”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What path method can be used as a clean way of reading files?

A

First grab the path to the folder where files are that you want to load in, and put this in a variable (eg. DATA_DIR).

Then use the path.join() method, which allows you to append the DATA_DIR path and add a single string (eg. ‘shots.csv’) - which is the name of the particular file you want to open.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you check the “type”?

A

print(type())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What method allows you to print the first rows of data (default = 5)?

A

shots.head() = first 5
shots.head(x) = first X

(where “shots” is the var where data is loaded)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What method allows you to output all the columns?

A

print(shots.columns)

(where “shots” is the var where data is loaded)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What method allows you to output the number of rows and columns?

A

print(shots.shape)

(where “shots” is the var where the data is loaded)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you refer a single column in a dataframe?

A

print(shots[‘name’].head())

where ‘name’ is a column, and “shots” is the var where the dara is loaded

Referring to a single column in a DataFrame is similar to returning a value from a dictionary, you put
the name of the column (usually a string) in brackets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the type() when you retrieve a single column from a DataFrame?

A

A single column is a Series, not a DataFrame (quite technical).

Can check by using type(shots[‘name’)]

Where ‘name’ is a single column in the dataframe “shots”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can a series be turned into a one-column DataFrame?

A

Calling the to_frame method will turn any Series into a one‑column DataFrame

In:

type(shots[‘name’].to_frame().head())

Out:

pandas.core.frame.DataFrame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you refer multiple columns in a DataFrame?

A

To refer to multiple columns in a DataFrame, you pass it a list. The result — unlike the single column
case —is another DataFrame.

shots[[‘name’, ‘foot’, ‘goal’, ‘period’]].head()

Where ‘name’, ‘foot’, ‘goal’ and ‘period’ are columns in the dataframe “shots”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is important to remember when calling a list of columns from a DataFrame?

A

One column:
shots[‘name’]

Multiple columns:
shots[[‘name’, ‘foot’, ‘goal’, ‘period’]]

Multiple columns has double brackets as you’re putting a list with your column names inside another pair of brackets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an index?

A

You can think of the index as a built‑in column of row IDs.

Pandas lets you specify which column to use as the index when loading your data.

If you don’t, the default is a series of numbers starting from 0 and going up to the number of rows.

17
Q

How do you change the index of a DataFrame?

A

shots.set_index(‘shot_id’).head()

Where ‘shot_id’ is a single column in the dataframe “shots” that you want to set as the index

18
Q

What is the problem with setting a new index?

A

set_index returns a new, copy of
the DataFrame with the index we want. It does not actually do anything to our original old shots Dataframe.

If you run shots.head() it will print another copy of the old dataframe.

To make it permanent, we have to set the inplace argument to true:

shots.set_index(‘shot_id’,inplace=True)
–> Now shot_id is the new index, even when you run shots.head() again

19
Q

What does set_index do? And what is the opposite of it?

A

Set_Index sets a particular column as the INDEX for the dataframe.

The opposite of set_index is reset_index, which sets the index to 0, 1, 2… and so on - and this turns the previous index to a regular column.

20
Q

What does the .loc property do?

A

The .loc property of the DataFrame object allows the return of specified rows and/or columns from that DataFrame.

shots_ot = shots.loc[((shots[‘period’]== ‘E1’)|
(shots[‘period’]== ‘E2’)),
[‘name’,’goal’,’period’]]

–> creates a mini subset of dataframe with only the overtime shots (E1 and E2), and only outputs the name goal and period columns.

21
Q

Which DataFrame method allows you to sort data by name?

A

shots_ot.sort_values(‘name’, inplace=True)

where “name” is a column in the “shorts_ot” dataframe. “Inplace = True” to make it permanent

22
Q

Whilst “pd.read_csv” is the method to read a CSV file, what is the method to save a CSV file?

A

shots_ot.to_csv(path.join(DATA_DIR, ‘shots_ot.csv’))

–> the output methods are called on the DataFrame itself (shots_ot)

23
Q

What must be done to ignore the index when saving from PANDAS to a CSV file?

A

Set index to false!

shots_ot.to_csv(path.join(DATA_DIR, ‘shots_ot_no_index.csv’),
index=False)

This is mainly used when the index is just a default range of numbers and you might not want to write it.