Pandas Flashcards

https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html

1
Q

What is Pandas?

A

It is a newer package built on top of NumPy, providing an efficient implementation of a DataFrame. These are multi-dimensional arrays with attached row and column labels, often allowing heterogeneous types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are advantages of Pandas over NumPy?

A

Pandas provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time and for which NumPy is not sufficient.

NumPy limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three fundamental Pandas data structures?

A

Series, DataFrame and Index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Pandas Series?

A

A Pandas Series is a one-dimensional array of indexed data values. The Series wraps a sequence of values and indices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you create a Pandas Series?

A

It can be created from a list or array as follows:

From list:
data = pd.Series([0.25, 0.5, 0.75, 1.0])

From array
x = np.arange(4,10)
data = pd.Series(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the values attribute of a Series return?

A

A NumPy array of the data values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the index attribute of a Series return?

A

An array-like object of type pd.Index containing the index of the Series.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you access elements of a Series?

A

Use the associated index e.g. for a Series x

x[0] # returns first element
x[0:3] # returns first 3 elements along with their index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the essential difference between a Series and a NumPy array?

A

The presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Does a Series index need to be an integer?

A

No, it can be values of any type e.g.

data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[‘a’, ‘b’, ‘c’, ‘d’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can a Series be thought of as a specialization of a Python dictionary?

A

Yes. A Series is a structure that maps typed keys to a set of typed values. This makes it more efficient than Dictionaries for certain operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Pandas DataFrame?

A

A Pandas DataFrame is a two-dimensional array with flexible row indices and flexible column names.

Can think of DF as a sequence of aligned Series objects, where they share the same index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the index attribute of a DataFrame return?

A

An Index object containing the index of the DF.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the columns attribute of a DataFrame return?

A

An Index object containing the column labels of the DF.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can a Pandas DataFrame be created from a Series?

A

pd.DataFrame(population, columns=[‘population’])

Where population is a Series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can a Pandas DataFrame be created from a Dictionary?

A

data = [{‘a’: i, ‘b’: 2 * i}
for i in range(3)]

pd.DataFrame(data)

NB. Even if some keys in Dict are missing, Pandas will fill in with NaN

17
Q

How can a Pandas DataFrame be created from a Dictionary of Series objects?

A

pd.DataFrame({‘population’: population,
‘area’: area})

Where population and area are Series.

18
Q

How can a Pandas DataFrame be created from a 2-D NumPy array?

A

pd.DataFrame(np.random.rand(3, 2),
columns=[‘foo’, ‘bar’],
index=[‘a’, ‘b’, ‘c’])

If columns and index value omitted, integer indexes will be used for both

19
Q

How is a Pandas Series like a Dictionary?

A

The Series object provides a mapping from a collection of keys to a collection of values (where index is the key)

data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[‘a’, ‘b’, ‘c’, ‘d’])

data[‘b’] # returns 0.5

20
Q

In what way can you extend a Series?

A

data[‘e’] = 1.25

where ‘e’ is the index
1.25 is the value

21
Q

Why is there potential confusion with slicing of Series objects?

A

Confusion can arise when slicing Series objects when they have explicit integer indexes. With a Series object called data:

Accessing data[1] will return the element at the explicit index of 1
Accessing data [1:3] will use the implicit python style and return values between the 2nd and 3rd elements.

22
Q

What is the preferred method of accessing Series for indexing and slicing?

A

The .loc and iloc attributes

data. loc[1] # the explicit index value of 1
data. loc[1:3] # the explicit slice from index 1 -> 3

data. iloc[1] # the implicit python index value at element 1
data. iloc[1:3] # the implicit python slice from 2nd -> 3rd elements

23
Q

How do we access individual column (Series) of a DataFrame?

A

Use dictionary style indexing with column name e.g.

data = pd.DataFrame({‘area’:area, ‘pop’:pop})
data[‘area’]

NB. Can also use data.area

24
Q

What does the values attribute of a DataFrame return?

A

A 2-Dimensional array of the data

25
Q

How do we use .iloc to access DataFrames?

A

Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

data.iloc[:3, :2]

26
Q

How do we use .loc to access DataFrames?

A

Using the loc indexer, we can index the underlying array as if it is a simple NumPy array (using the explicit index and column names), but the DataFrame index and column labels are maintained in the result:

data.loc[:’Illinois’, :’pop’]

27
Q

How do we use .ix to access DataFrames?

A

The .ix indexer is a hybrid of .loc and .iloc

data.ix[:3, :’pop’]

28
Q

What does data.loc[data.density > 100, [‘pop’, ‘density’]] demonstrate?

A

This is the use of masking (data.density >100) and fancy indexing ( [‘pop’, ‘density’] ) to select back data values from a DataFrame.

29
Q

Are Pandas object indices preserved when using NumPy ufuncs?

A

Yes

30
Q

What does NaN mean?

A

NaN - Not a Number, indicates missing data.

Missing values are filled in with NaN by default.

31
Q

What happens when concatenating DataFrames or Series objects with different indices/column headings?

A

The Pandas objects are effectively unioned, and any matching index/column values will be brought together. Elements which don;t match will be populated with NaN.

32
Q

When adding Series/DataFrames with non-matching index/columns what keyword can be used to replace NaN values?

A

Use the fill_value keyword to stipulate a filling method.

33
Q

Is the alignment of Series/DataFrames indices and columns preserved when performing operations in Pandas?

A

Yes. Broadcasting is used to ensure that the correct operations are applied to rows and columns.