Python Pandas Flashcards

1
Q

Define the Pandas/Python pandas?

A

Pandas is defined as an open-source library that provides high-performance data manipulation in Python.

The name of Pandas is derived from the word Panel Data, which means an Econometrics from Multidimensional data. It can be used for data analysis in Python and developed by Wes McKinney in 2008.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Mention the different types of Data Structures in Pandas?

A

Pandas provide two data structures, which are supported by the pandas library, Series, and DataFrames. Both of these data structures are built on top of the NumPy.

A Series is a one-dimensional data structure in pandas, whereas the DataFrame is the two-dimensional data structure in pandas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define Series in Pandas?

A

A Series is defined as a one-dimensional array that is capable of storing various data types. The row labels of series are called the index. By using a ‘series’ method, we can easily convert the list, tuple, and dictionary into series. A Series cannot contain multiple columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we calculate the standard deviation from the Series?

A

The Pandas std() is defined as a function for calculating the standard deviation of the given set of numbers, DataFrame, column, and rows.

Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define DataFrame in Pandas?

A

A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consists of the following properties:

The columns can be heterogeneous types like int and bool.
It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as “columns” in the case of columns and “index” in case of rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the significant features of the pandas Library?

A

The key features of the panda’s library are as follows:

Memory Efficient
Data Alignment
Reshaping
Merge and join
Time Series
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain Reindexing in pandas?

A

df.reindex()

Reindexing is used to conform DataFrame to a new index with optional filling logic. It places NA/NaN in that location where the values are not present in the previous index. It returns a new object unless the new index is produced as equivalent to the current one, and the value of copy becomes False. It is used to change the index of the rows and columns of the DataFrame.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the name of Pandas library tools used to create a scatter plot matrix?

A

Scatter_matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define the different ways a DataFrame can be created in pandas?

A

We can create a DataFrame using following ways:

Lists
Dict of ndarrays
Example-1: Create a DataFrame using List:

import pandas as pd    
# a list of strings    
a = ['Python', 'Pandas']    
# Calling DataFrame constructor on list    
info = pd.DataFrame(a)    
print(info)   

Output:

0 0   Python 1   Pandas Example-2: Create a DataFrame from dict of ndarrays:

import pandas as pd
info = {‘ID’ :[101, 102, 103],’Department’ :[‘B.Sc’,’B.Tech’,’M.Tech’,]}
info = pd.DataFrame(info)
print (info)

Output:

   ID      Department 0      101        B.Sc 1      102        B.Tech 2      103        M.Tech
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain Categorical data in Pandas?

A

A Categorical data is defined as a Pandas data type that corresponds to a categorical variable in statistics. A categorical variable is generally used to take a limited and usually fixed number of possible values. Examples: gender, country affiliation, blood type, social class, observation time, or rating via Likert scales. All values of categorical data are either in categories or np.nan.

This data type is useful in the following cases:

It is useful for a string variable that consists of only a few different values. If we want to save some memory, we can convert a string variable to a categorical variable.

It is useful for the lexical order of a variable that is not the same as the logical order (?one?, ?two?, ?three?) By converting into a categorical and specify an order on the categories, sorting and min/max is responsible for using the logical order instead of the lexical order.
It is useful as a signal to other Python libraries because this column should be treated as a categorical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How will you create a series from dict in Pandas?

A

A Series is defined as a one-dimensional array that is capable of storing various data types.

We can create a Pandas Series from Dictionary:

Create a Series from dict:

We can also create a Series from dict. If the dictionary object is being passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order to construct the index.

If index is passed, then values correspond to a particular label in the index will be extracted from the dictionary.

info = {'x' : 0., 'y' : 1., 'z' : 2.}    
a = pd.Series(info)    
print (a)   

Output:

x 0.0
y 1.0
z 2.0
dtype: float64

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can we create a copy of the series in Pandas?

A

.copy()
some_series.copy(deep=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How will you create an empty DataFrame in Pandas?

A

importing the pandas library

info = pd.DataFrame()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How will you add a column to a pandas DataFrame?

A

importing the pandas library

info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']),    
             'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}    

info = pd.DataFrame(info)

print (“Add new column by passing series”)
info[‘three’]=pd.Series([20,40,60],index=[‘a’,’b’,’c’])
print (info)
print (“Add new column using existing DataFrame columns”)
info[‘four’]=info[‘one’]+info[‘three’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to Rename the Index or Columns of a Pandas DataFrame?

A

You can use the .rename method to give different values to the columns or the index values of DataFrame.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to iterate over a Pandas DataFrame?

A

You can iterate over the rows of the DataFrame by using for loop in combination with an iterrows() call on the DataFrame.

for index, row in df.iterrows():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to get the items not common to both series A and series B?

A

We get all the items of p1 and p2 not common to both using below example:

import pandas as pd  
import numpy as np  
p1 = pd.Series([2, 4, 6, 8, 10])  
p2 = pd.Series([8, 10, 12, 14, 16])  
p1[~p1.isin(p2)]  
p_u[~p_u.isin(p_i)]  

Output:

0     2
1     4
2     6
5    12
6    14
7    16
dtype: int64
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to get frequency counts of unique items of a series?

A
p= pd.Series(np.take(list('pqrstu'), np.random.randint(6, size=17)))  
p.value_counts()  

It uses np.random.randint(6, size=17) to generate an array of 17 random integers between 0 and 5 (inclusive).

The np.take(list(‘pqrstu’), …) part maps these integers to the characters ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, and ‘u’.
So, p contains a random sequence of these characters.

Output:

s    4
r    4
q    3
p    3
u    3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How to convert a numpy array to a dataframe of given shape?

A

Input

We can reshape the series p into a dataframe with 6 rows and 2 columns as below example:

p = pd.Series(np.random.randint(1, 7, 35))  
p = pd.Series(np.random.randint(1, 7, 35))  
info = pd.DataFrame(p.values.reshape(7,5))  
print(info)  

Output:

0  1  2  3  4
0  3  2  5  5  1
1  3  2  5  5  5
2  1  3  1  2  6
3  1  1  1  2  2
4  3  5  3  3  3
5  2  5  3  6  4
6  3  6  6  6  5
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How can we convert a Series to DataFrame?

A

The Pandas Series.to_frame() function is used to convert the series object to the DataFrame.

Series.to_frame(name=None)
name: Refers to the object. Its Default value is None. If it has one value, the passed name will be substituted for the series name.

s = pd.Series([“a”, “b”, “c”],
name=”vals”)
s.to_frame()
Output:

   vals 0          a 1          b 2          c
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is Pandas NumPy array?

A

Numerical Python (Numpy) is defined as a Python package used for performing the various numerical computations and processing of the multidimensional and single-dimensional array elements. The calculations using Numpy arrays are faster than the normal Python array.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How can we convert DataFrame into a NumPy array?

A

For performing some high-level mathematical functions, we can convert Pandas DataFrame to numpy arrays. It uses the DataFrame.to_numpy() function.

The DataFrame.to_numpy() function is applied to the DataFrame that returns the numpy ndarray.

DataFrame.to_numpy(dtype=None, copy=False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can we convert DataFrame into an excel file?

A

We can export the DataFrame to the excel file by using the to_excel() function.

To write a single object to the excel file, we have to specify the target file name. If we want to write to multiple sheets, we need to create an ExcelWriter object with target filename and also need to specify the sheet in the file in which we have to write.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can we sort the DataFrame?

A

We can efficiently perform sorting in the DataFrame through different kinds:

By label
By Actual value
By label

The DataFrame can be sorted by using the sort_index() method. It can be done by passing the axis arguments and the order of sorting. The sorting is done on row labels in ascending order by default.

By Actual Value

It is another kind through which sorting can be performed in the DataFrame. Like index sorting, sort_values() is a method for sorting the values.

It also provides a feature in which we can specify the column name of the DataFrame with which values are to be sorted. It is done by passing the ‘by’ argument.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Time Series in Pandas?

A

The Time series data is defined as an essential source for information that provides a strategy that is used in various businesses. From a conventional finance industry to the education industry, it consists of a lot of details about the time.

Time series forecasting is the machine learning modeling that deals with the Time Series data for predicting future values through Time Series modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is Time Offset?

A

The offset specifies a set of dates that conform to the DateOffset. We can create the DateOffsets to move the dates forward to valid dates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Define Time Periods?

A

The Time Periods represent the time span, e.g., days, years, quarter or month, etc. It is defined as a class that allows us to convert the frequency to the periods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How to convert String to date?

A

The below code demonstrates how to convert the string to date:

fromdatetime import datetime

# Define dates as the strings       
dmy_str1 = 'Wednesday, July 14, 2018'    
dmy_str2 = '14/7/17'    
dmy_str3 = '14-07-2017'    
# Define dates as the datetime objects    
dmy_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')    
dmy_dt2 = datetime.strptime(date_str2, '%m/%d/%y')    
dmy_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')    
#Print the converted dates    
print(dmy_dt1)    
print(dmy_dt2)    
print(dmy_dt3)    

Output:

2017-07-14 00:00:00
2017-07-14 00:00:00
2018-07-14 00:00:00

28
Q

What is Data Aggregation?

A

The main task of Data Aggregation is to apply some aggregation to one or more columns. It uses the following:

sum: It is used to return the sum of the values for the requested axis.
min: It is used to return a minimum of the values for the requested axis.
max: It is used to return a maximum values for the requested axis.

29
Q

What is Pandas Index?

A

Pandas Index is defined as a vital tool that selects particular rows and columns of data from a DataFrame. Its task is to organize the data and to provide fast accessing of data. It can also be called a Subset Selection.

30
Q

Define Multiple Indexing?

A

Multiple indexing is defined as essential indexing because it deals with data analysis and manipulation, especially for working with higher dimensional data. It also enables us to store and manipulate data with the arbitrary number of dimensions in lower-dimensional data structures like Series and DataFrame.

31
Q

Define ReIndexing?

A

Reindexing is used to change the index of the rows and columns of the DataFrame. We can reindex the single or multiple rows by using the reindex() method. Default values in the new index are assigned NaN if it is not present in the DataFrame.

DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)

32
Q

How to Set the index?

A

We can set the index column while making a data frame. But sometimes, a data frame is made from two or more data frames, and then the index can be changed using this method.

33
Q

How to Reset the index?

A

The Reset index of the DataFrame is used to reset the index by using the ‘reset_index’ command. If the DataFrame has a MultiIndex, this method can remove one or more levels.

34
Q

Describe Data Operations in Pandas?

A

In Pandas, there are different useful data operations for DataFrame, which are as follows:

Row and column selection
We can select any row and column of the DataFrame by passing the name of the rows and columns. When you select it from the DataFrame, it becomes one-dimensional and considered as Series.

Filter Data
We can filter the data by providing some of the boolean expressions in DataFrame.

Null values
A Null value occurs when no data is provided to the items. The various columns may contain no values, which are usually represented as NaN.

35
Q

Define GroupBy in Pandas?

A

In Pandas, groupby() function allows us to rearrange the data by utilizing them on real-world data sets. Its primary task is to split the data into various groups. These groups are categorized based on some criteria. The objects can be divided from any of their axes.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

36
Q

Where did these questions come from?

A

Questions were sourced from:

https://www.javatpoint.com/python-pandas-interview-questions

37
Q

df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list(‘abcde’),
columns=[‘x’,’y’,’z’, 8, 9])

x   y   z   8   9 a   0   1   2   3   4 b   5   6   7   8   9 c  10  11  12  13  14 d  15  16  17  18  19 e  20  21  22  23  24

–> get c - e for x - z and get all values from the fourth column

A

> > > df.loc[‘c’: , :’z’] # rows ‘c’ and onwards AND columns up to ‘z’
x y z
c 10 11 12
d 15 16 17
e 20 21 22

> > > df.iloc[:, 3] # all rows, but only the column at index location 3
a 3
b 8
c 13
d 18
e 23

38
Q

python remove index from pd dataframe

A

reset_index(drop=True) to drop the index column. This method resets the index and discards the existing index column

39
Q

Create a sample DataFrame with an index name from a dict

A

df = pd.DataFrame.from_dict({
‘Name’: [‘Jane’, ‘Nik’, ‘Kate’, ‘Melissa’],
‘Age’: [10, 35, 34, 23]
}).set_index(‘Name’)

40
Q

drop a column from a pd df by name and by index. e.g. remove the 2nd column (index 1)

A

df = df.drop(columns=[‘column_name’])
df = df.drop(df.columns[1], axis=1)

41
Q

remove duplicate values from a column in a Pandas DataFrame

A

df.drop_duplicates(subset=[‘Name’]) (for only the duplicates in the Name column)

42
Q

loc and ix

A

.loc[] (Label-Based Indexing):
.loc[] allows you to access rows and columns using labels (such as column names or index labels).
Syntax: df.loc[row_label, column_label]
Examples:
To select a specific row by label: df.loc[2] (selects the third row)
To filter rows based on a condition: df.loc[df[‘Age’] > 30]
To update a specific cell value: df.loc[1, ‘Name’] = ‘Kate’

.ix[ is deprecated

43
Q

iloc

A

.iloc[] (Position-Based Indexing):
.iloc[] is used for integer-based indexing. It allows you to access rows and columns by their position (integer index).
Syntax: df.iloc[row_index, column_index]
Examples:
To select the second row: df.iloc[1]
To slice rows and columns: df.iloc[1:4, 0:2]
To update a specific cell value: df.iloc[0, 2] = 42

44
Q

Return the sorted, unique values that are in BOTH of the input arrays.

A

numpy.intersect1d(ar1, ar2, assume_unique=False, return_indices=False)

44
Q

Return the unique, sorted array of values that are in EITHER of the two input arrays.

A

numpy.union1d(ar1, ar2)

45
Q

Compute the q-th percentile of the data along the specified axis in np

A

numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, method=’linear’, keepdims=False, *, interpolation=None)

46
Q

Generator vs. RandomState in np

A

The Generator provides access to a wide range of distributions, and served as a replacement for RandomState. The main difference between the two is that Generator relies on an additional BitGenerator to manage state and generate the random bits, which are then transformed into random values from useful distributions.

47
Q

create a numpy array

A

(1) np.array
(2) np.zeros((x, y)) / np.ones((x, y))
(3) np.arange(x) / np.arange(x).reshape(y, z)

48
Q

force NumPy to print the entire array

A

np.set_printoptions(threshold=sys.maxsize) # sys module should be imported

49
Q

upcasting in numpy

A

When operating with arrays of different types, the type of the resulting array corresponds to the more general or precise one

50
Q

How are many unary operations implemented in Numpy?

A

as methods, e.g. a.sum() (a is a np array)

51
Q

NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy - how are these called?

A

“universal functions” (ufunc).

B = np.arange(3)
np.exp(B)
--> array([1.        , 2.71828183, 7.3890561 ])
np.sqrt(B)
--.> array([0.        , 1.        , 1.41421356])
52
Q

Multidimensional array: get each column in the second and third row of b

A

b[1:3, :]

53
Q

Can one-dimensional arrays in numpy be indexed, sliced and iterated over?

A

Yes, much like lists and other Python sequences.

54
Q

Produce a complete indexing tuple for e.g. an array with 5 axes - “shortcut”

A

dots (…) –> represent as many colons as needed to produce a complete indexing tuple. For example, if x is an array with 5 axes, then

x[1, 2, …] is equivalent to x[1, 2, :, :, :],

55
Q

iterate over each element in a multidimensional array b

A

for element in b.flat:

56
Q

get the shape of array a, flatten it, and transpose it

A

a.shape
a.ravel()
a.T

57
Q

difference between reshape and resize

A

The reshape function returns its argument with a modified shape, whereas the ndarray.resize method modifies the array itself:

58
Q

stack arrays horizontally and vertically

A

hstack and vstack

59
Q

a
array([[6., 7., 6., 9., 0., 5., 4., 0., 6., 8., 5., 2.],
[8., 5., 5., 7., 1., 8., 6., 7., 1., 8., 1., 0.]])

–> Split a after the third and the fourth column

A

np.hsplit(a, (3, 4))
–>
[array([[6., 7., 6.],
[8., 5., 5.]]), array([[9.],
[7.]]), array([[0., 5., 4., 0., 6., 8., 5., 2.],
[1., 8., 6., 7., 1., 8., 1., 0.]])]

60
Q

a = np.array([4., 2.]) –> view a as a 2D column vector

A

a[:, newaxis]
–>
array([[4.],
[2.]])

61
Q

Does slicing anarray returns a view of it?

A

Yes

62
Q

view vs base

A

c = a.view()
c is a –> False
c.base is a –> True

63
Q

how does concatenate for a and b work?

A

np.concatenate((a, b), axis=0) –> axis=0 is vertical, 1 is horizontal, 0 is 1dim (series)

64
Q

Singular Value Decomposition

What is SVD?

A
  • (SVD) is a fundamental matrix factorization technique used in linear algebra and numerical computing
  • decomposes a given matrix A into three matrices: U, S, and V^H (the conjugate transpose of V).
  • For a 2D matrix A, the SVD factorization is expressed as: [ A = U \Sigma V^H ] where:
  • U is a unitary matrix (with orthonormal columns).
  • Σ (Sigma) is a diagonal matrix containing the singular values of A.
  • V^H is the conjugate transpose of another unitary matrix V.
65
Q
A
66
Q
A
67
Q
A