Pandas Interview Questions Flashcards
Define the Pandas/Python pandas?
Pandas is defined as an open-source library that provides high-performance data manipulation in Python.
The name of Pandas is derived from the word Panel Data, which means an Econometrics from Multidimensional data. It can be used for data analysis in Python and developed by Wes McKinney in 2008.
It can perform five significant steps that are required for processing and analysis of data irrespective of the origin of the data:
- load
- manipulate
- prepare
- model
- analyze
Mention the different types of Data Structures in Pandas?
Pandas provide two data structures, which are supported by the pandas library, Series, and DataFrames. Both of these data structures are built on top of the NumPy.
A Series is a one-dimensional data structure in pandas, whereas the DataFrame is the two-dimensional data structure in pandas.
Define Series in Pandas?
A Series is defined as a one-dimensional array that is capable of storing various data types. The row labels of series are called the index. By using a ‘series’ method, we can easily convert the list, tuple, and dictionary into series. A Series cannot contain multiple columns.
How can we calculate the standard deviation from the Series?
The Pandas std() is defined as a function for calculating the standard deviation of the given set of numbers, DataFrame, column, and rows.
Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Define DataFrame in Pandas?
A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consists of the following properties:
The columns can be heterogeneous types like int and bool.
It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as “columns” in the case of columns and “index” in case of rows.
What are the significant features of the pandas Library?
The key features of the panda’s library are as follows:
Memory Efficient Data Alignment Reshaping Merge and join Time Series
Explain Reindexing in pandas?
Reindexing is used to conform DataFrame to a new index with optional filling logic. It places NA/NaN in that location where the values are not present in the previous index. It returns a new object unless the new index is produced as equivalent to the current one, and the value of copy becomes False. It is used to change the index of the rows and columns of the DataFrame.
What is the name of Pandas library tools used to create a scatter plot matrix?
Scatter_matrix
Define the different ways a DataFrame can be created in pandas?
We can create a DataFrame using following ways:
Lists
Dict of ndarrays
Example-1: Create a DataFrame using List:
import pandas as pd # a list of strings a = ['Python', 'Pandas'] # Calling DataFrame constructor on list info = pd.DataFrame(a) print(info)
Output:
0 0 Python 1 Pandas Example-2: Create a DataFrame from dict of ndarrays:
import pandas as pd
info = {‘ID’ :[101, 102, 103],’Department’ :[‘B.Sc’,’B.Tech’,’M.Tech’,]}
info = pd.DataFrame(info)
print (info)
Output:
ID Department 0 101 B.Sc 1 102 B.Tech 2 103 M.Tech
Explain Categorical data in Pandas?
A Categorical data is defined as a Pandas data type that corresponds to a categorical variable in statistics. A categorical variable is generally used to take a limited and usually fixed number of possible values. Examples: gender, country affiliation, blood type, social class, observation time, or rating via Likert scales. All values of categorical data are either in categories or np.nan.
This data type is useful in the following cases:
It is useful for a string variable that consists of only a few different values. If we want to save some memory, we can convert a string variable to a categorical variable.
It is useful for the lexical order of a variable that is not the same as the logical order (?one?, ?two?, ?three?) By converting into a categorical and specify an order on the categories, sorting and min/max is responsible for using the logical order instead of the lexical order.
It is useful as a signal to other Python libraries because this column should be treated as a categorical variable.
How will you create a series from dict in Pandas?
A Series is defined as a one-dimensional array that is capable of storing various data types.
We can create a Pandas Series from Dictionary:
Create a Series from dict:
We can also create a Series from dict. If the dictionary object is being passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order to construct the index.
If index is passed, then values correspond to a particular label in the index will be extracted from the dictionary.
import pandas as pd import numpy as np info = {'x' : 0., 'y' : 1., 'z' : 2.} a = pd.Series(info) print (a)
Output:
x 0.0
y 1.0
z 2.0
dtype: float64
How can we create a copy of the series in Pandas?
We can create the copy of series by using the following syntax:
pandas.Series.copy
Series.copy(deep=True)
The above statements make a deep copy that includes a copy of the data and the indices. If we set the value of deep to False, it will neither copy the indices nor the data.
Note: If we set deep=True, the data will be copied, and the actual python objects will not be copied recursively, only the reference to the object will be copied.
How will you create an empty DataFrame in Pandas?
A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) It is defined as a standard way to store data and has two different indexes, i.e., row index and column index.
Create an empty DataFrame:
The below code shows how to create an empty DataFrame in Pandas:
# importing the pandas library import pandas as pd info = pd.DataFrame() print (info)
Output:
Empty DataFrame
Columns: []
Index: []
How will you add a column to a pandas DataFrame?
We can add any new column to an existing DataFrame. The below code demonstrates how to add any new column to an existing DataFrame:
# importing the pandas library import pandas as pd info = {'one' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'two' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f'])}
info = pd.DataFrame(info)
Add a new column to an existing DataFrame object
print (“Add new column by passing series”)
info[‘three’]=pd.Series([20,40,60],index=[‘a’,’b’,’c’])
print (info)
print (“Add new column using existing DataFrame columns”)
info[‘four’]=info[‘one’]+info[‘three’]
print (info)
Output:
Add new column by passing series one two three a 1.0 1 20.0 b 2.0 2 40.0 c 3.0 3 60.0 d 4.0 4 NaN e 5.0 5 NaN f NaN 6 NaN
Add new column using existing DataFrame columns
one two three four
a 1.0 1 20.0 21.0
b 2.0 2 40.0 42.0
c 3.0 3 60.0 63.0
d 4.0 4 NaN NaN
e 5.0 5 NaN NaN
f NaN 6 NaN NaN
How to add an Index, row, or column to a Pandas DataFrame?
Adding an Index to a DataFrame
Pandas allow adding the inputs to the index argument if you create a DataFrame. It will make sure that you have the desired index. If you don?t specify inputs, the DataFrame contains, by default, a numerically valued index that starts with 0 and ends on the last row of the DataFrame.
Adding Rows to a DataFrame
We can use .loc, iloc, and ix to insert the rows in the DataFrame.
The loc basically works for the labels of our index. It can be understood as if we insert in loc[4], which means we are looking for that values of DataFrame that have an index labeled 4.
The iloc basically works for the positions in the index. It can be understood as if we insert in iloc[4], which means we are looking for the values of DataFrame that are present at index ‘4`.
The ix is a complex case because if the index is integer-based, we pass a label to ix. The ix[4] means that we are looking in the DataFrame for those values that have an index labeled 4. However, if the index is not only integer-based, ix will deal with the positions as iloc.
Adding Columns to a DataFrame
If we want to add the column to the DataFrame, we can easily follow the same procedure as adding an index to the DataFrame by using loc or iloc.
How to Delete Indices, Rows or Columns From a Pandas Data Frame?
Deleting an Index from Your DataFrame
If you want to remove the index from the DataFrame, you should have to do the following:
Reset the index of DataFrame.
Executing del df.index.name to remove the index name.
Remove duplicate index values by resetting the index and drop the duplicate values from the index column.
Remove an index with a row.
Deleting a Column from Your DataFrame
You can use the drop() method for deleting a column from the DataFrame.
The axis argument that is passed to the drop() method is either 0 if it indicates the rows and 1 if it drops the columns.
You can pass the argument inplace and set it to True to delete the column without reassign the DataFrame.
You can also delete the duplicate values from the column by using the drop_duplicates() method.
Removing a Row from Your DataFrame
By using df.drop_duplicates(), we can remove duplicate rows from the DataFrame.
You can use the drop() method to specify the index of the rows that we want to remove from the DataFrame.