W3 Flashcards
What is Pandas?
Pandas is a third-party library for data analysis, integrating low-level modelling tools such as importing, cleaning and aggregating tabular data.
- Main object: heterogeneous DataFrame
How do you install pandas to your computer?
conda install pandas
How do you import the pandas module?
import pandas as pd
How can you check the version of any module/library?
You can always check for the version of almost any library using:
__version__
EXAMPLE:
pd. __version__
What is tabular data?
Tabular data is data in a two-dimensional rectangular table structured with rows and columns
What is the data layout in tables?
- Rows:
* Each row represents one record or observation of an object or event
* Each row can have multiple pieces of information
* Each row has the same structure - Columns:
* Each column represents an
attribute or property of the observations
- Each column contains only one type of data
* Labeled with a header
NOTE: Each table contains a set of observations of the same kind of object or event.
What are CSV files?
Often tabular data is stored in CSV files.
- CSV stands for comma-separated values
- Values are separated by the delimiter: ,
- file has the extension .csv
How do we load CSV files into Python?
We can do so very easily with the use of: pd.read_csv() by providing the path to the file to use. Can also load files as pd.DataFrame
EXAMPLE:
auto = pd.read_csv(‘data/auto-mpg.csv’)
How can we make the code show the first 10 rows of a data set ‘auto’?
auto = pd.read_csv(‘data/auto-mpg.csv’)
auto.head(10)
What are other file types for tabular data?
Text files with extension .txt and Excel files (e.g. with extension xlsx ) are other common file types for tabular data.
NOTE: Here each field is separated by whitespace
When whitespace is used to separate data, what additional arguments do you need to remove it?
When whitespace is used to separate the data, we can still load the data into pandas DataFrame using read_csv().
- But we need to provide some additional arguments i.e: sep=’\s+’
EXAMPLE1:
salary = pd.read_csv(‘data/Auto.txt’, sep=’\s+’)
salary.head()
EXAMPLE2:
bitcoin = pd.read_excel(open(‘data/BTC-USD.xlsx’, ‘rb’))
What is the structure of Pandas data?
- DataFrame: 2D data structure for tabular data
- Similar to R data.frame
- Heterogeneous: Different columns can have different types
- Series: 1D homogeneous data, can be considered as the “columns”
- Index: Sequence of row labels
NOTE: DataFrame can be considered as a dictionary of Series that all share the same index. Series is similar to 1D np.ndarray but with row labels (Index).
What are the similarities and differences between Pandas DataFrame/Series and NumPy’s ndarray?
Similarities:
* Syntax is similar
* Fast vectorised operations
Differences:
* Pandas is for heterogeneous data
* Pandas is for 1 and 2-dimensional data only
* Pandas data are labelled by row labels
How can you you find the index of a dataset in Pandas?
We can use .index to get the Index.
auto.head()
auto.index
OUT: RangeIndex(start=0, stop=398, step=1)
type(auto.index)
OUT: pandas.core.indexes.range.RangeIndex
How can we get the column label from a pandas data structure?
We can use [] with the column label to get a “column” and we can see that it is a Series with the same Index.
EXAMPLE1:
auto[‘mpg’]
OUT:
0 18.0
1 15.0
2 18.0
3 16.0
4 17.0
…
393 27.0
394 44.0
395 32.0
396 28.0
397 31.0
Name: mpg, Length: 398, dtype: float64
EXAMPLE2:
type(auto[‘mpg’])
OUT: pandas.core.series.Series
EXAMPLE3:
auto[‘mpg’].index
OUT: RangeIndex(start=0, stop=398, step=1)
Do the rows have to be unique in a pandas index?
The row labels that constitute an index do not have to be unique nor numerical
EXAMPLE:
use the column “model year” as index:
auto_idx_by_year = auto.set_index(‘model year’)
auto_idx_by_year.head()
How can we select multiple columns in pandas and what do we get?
We can select multiple columns by providing a list of column labels
EXAMPLE:
auto[[‘mpg’, ‘weight’]].head()
NOTE. What we get is a DataFrame
What are the different techniques for selecting rows in Pandas?
- We can use slicing-like syntax with row numbers.
EXAMPLE1: For the first 2 rows:
auto[:2]
NOTE: Neither auto[0] or auto[[0,1]] work as the syntax is for column selection and there’s no columns called 0, 1 in auto.
- We can select the rows based on some conditions by df[condition (similar to NumPy)
EXAMPLE2: Select all rows with mpg is 21
auto[auto[‘mpg’] == 21]
- We can chain up multiple conditions by & and/or | like np.ndarray
EXAMPLE3: Select all rows for mpg is 21 and model year is 70:
auto[(auto[‘mpg’] == 21) & (auto[‘model year’] == 70)]
EXAMPLE4: Select all rows for the car name is either ‘datsun pl510’ or ‘datsun
1200’
auto[(auto[‘car name’] == ‘datsun pl510’) | (auto[‘car name’] == ‘datsun 1200’)]
How can you match multiple values in Pandas?
The isin() method makes it more convenient to find rows that match one of many possible values.
EXAMPLE1: Select all rows for the car name is either ‘datsun pl510’ or ‘datsun 1200’
auto[auto[‘car name’].isin([‘datsun pl510’, ‘datsun 1200’])]
NOTE: If you
want to match text starts with the same word, consider using .str.startswith()
EXAMPLE2:
Select all rows with the car name starting with “vw”:
auto[auto[‘car name’].str.startswith(‘vw’)]
How can you specify the selection for both rows and columns
To specify the selection for both rows and columns, or to select just a row multiple columns by slicing, use loc[] and iloc[]
What is loc[]?
loc[] allows you to select rows and columns by:
- Labels
- Boolean array
EXAMPLE1: Selecting rows by conditions, and columns by slicing on labels:
auto.loc[auto[‘car name’] == ‘datsun pl510’, ‘model year’:’car name’]
NOTE: Similar to NumPy 2D array, before the comma represents the rows to select, and after the comma represents the columns to select. When slicing with loc[], endpoints are included!
EXAMPLE2: selecting rows by a list of row labels:
auto.loc[[0,3]]
What is iloc[]?
iloc[] allows you to select rows and columns by the position indexes
EXAMPLE:
auto_idx_by_year.head()
auto_idx_by_year.iloc[[1,3]]
What are the advantages of loc[] over iloc[]?
Advantages of loc[] over iloc[]:
- Easier to read
- Harder to make mistakes
- May work even if the order of rows or columns changed
How can we do element wise calculations on Pandas, Series, and DataFrame?
marks = pd.DataFrame({‘ps_1’: [70, 100, 82], ‘ps_2’: [88, 92, 83]}, index=[‘Harry’, ‘Hermione’, ‘Ron’])
EXAMPLE1:
marks[‘ps_1’] + marks[‘ps_2’]
OUT: Harry 158
Hermione 192
Ron 165
EXAMPLE2: Alignment is based on the row label (Index), not the position:
ps_3 = pd.Series([100, 70, 90], index=[‘Hermione’, ‘Ron’, ‘Harry’])
(marks[‘ps_1’] + ps_3)/2
OUT: Harry 80.0
Hermione 100.0
Ron 76.0