W3 Flashcards

1
Q

What is Pandas?

A

Pandas is a third-party library for data analysis, integrating low-level modelling tools such as importing, cleaning and aggregating tabular data.

  • Main object: heterogeneous DataFrame
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you install pandas to your computer?

A

conda install pandas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you import the pandas module?

A

import pandas as pd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can you check the version of any module/library?

A

You can always check for the version of almost any library using:

__version__

EXAMPLE:

pd. __version__

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is tabular data?

A

Tabular data is data in a two-dimensional rectangular table structured with rows and columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the data layout in tables?

A
  1. Rows:
    * Each row represents one record or observation of an object or event
    * Each row can have multiple pieces of information
    * Each row has the same structure
  2. Columns:
    * Each column represents an
    attribute or property of the observations
    - Each column contains only one type of data
    * Labeled with a header

NOTE: Each table contains a set of observations of the same kind of object or event.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are CSV files?

A

Often tabular data is stored in CSV files.

  • CSV stands for comma-separated values
  • Values are separated by the delimiter: ,
  • file has the extension .csv
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we load CSV files into Python?

A

We can do so very easily with the use of: pd.read_csv() by providing the path to the file to use. Can also load files as pd.DataFrame

EXAMPLE:
auto = pd.read_csv(‘data/auto-mpg.csv’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we make the code show the first 10 rows of a data set ‘auto’?

A

auto = pd.read_csv(‘data/auto-mpg.csv’)

auto.head(10)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are other file types for tabular data?

A

Text files with extension .txt and Excel files (e.g. with extension xlsx ) are other common file types for tabular data.

NOTE: Here each field is separated by whitespace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When whitespace is used to separate data, what additional arguments do you need to remove it?

A

When whitespace is used to separate the data, we can still load the data into pandas DataFrame using read_csv().

  • But we need to provide some additional arguments i.e: sep=’\s+’

EXAMPLE1:
salary = pd.read_csv(‘data/Auto.txt’, sep=’\s+’)

salary.head()

EXAMPLE2:
bitcoin = pd.read_excel(open(‘data/BTC-USD.xlsx’, ‘rb’))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the structure of Pandas data?

A
  • DataFrame: 2D data structure for tabular data
  • Similar to R data.frame
  • Heterogeneous: Different columns can have different types
  • Series: 1D homogeneous data, can be considered as the “columns”
  • Index: Sequence of row labels

NOTE: DataFrame can be considered as a dictionary of Series that all share the same index. Series is similar to 1D np.ndarray but with row labels (Index).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the similarities and differences between Pandas DataFrame/Series and NumPy’s ndarray?

A

Similarities:
* Syntax is similar
* Fast vectorised operations

Differences:
* Pandas is for heterogeneous data
* Pandas is for 1 and 2-dimensional data only
* Pandas data are labelled by row labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you you find the index of a dataset in Pandas?

A

We can use .index to get the Index.

auto.head()

auto.index
OUT: RangeIndex(start=0, stop=398, step=1)

type(auto.index)
OUT: pandas.core.indexes.range.RangeIndex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we get the column label from a pandas data structure?

A

We can use [] with the column label to get a “column” and we can see that it is a Series with the same Index.

EXAMPLE1:
auto[‘mpg’]

OUT:
0 18.0
1 15.0
2 18.0
3 16.0
4 17.0

393 27.0
394 44.0
395 32.0
396 28.0
397 31.0
Name: mpg, Length: 398, dtype: float64

EXAMPLE2:
type(auto[‘mpg’])

OUT: pandas.core.series.Series

EXAMPLE3:
auto[‘mpg’].index

OUT: RangeIndex(start=0, stop=398, step=1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Do the rows have to be unique in a pandas index?

A

The row labels that constitute an index do not have to be unique nor numerical

EXAMPLE:
use the column “model year” as index:

auto_idx_by_year = auto.set_index(‘model year’)
auto_idx_by_year.head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can we select multiple columns in pandas and what do we get?

A

We can select multiple columns by providing a list of column labels

EXAMPLE:
auto[[‘mpg’, ‘weight’]].head()

NOTE. What we get is a DataFrame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the different techniques for selecting rows in Pandas?

A
  1. We can use slicing-like syntax with row numbers.
    EXAMPLE1: For the first 2 rows:

auto[:2]

NOTE: Neither auto[0] or auto[[0,1]] work as the syntax is for column selection and there’s no columns called 0, 1 in auto.

  1. We can select the rows based on some conditions by df[condition (similar to NumPy)
    EXAMPLE2: Select all rows with mpg is 21

auto[auto[‘mpg’] == 21]

  1. We can chain up multiple conditions by & and/or | like np.ndarray
    EXAMPLE3: Select all rows for mpg is 21 and model year is 70:

auto[(auto[‘mpg’] == 21) & (auto[‘model year’] == 70)]

EXAMPLE4: Select all rows for the car name is either ‘datsun pl510’ or ‘datsun
1200’

auto[(auto[‘car name’] == ‘datsun pl510’) | (auto[‘car name’] == ‘datsun 1200’)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How can you match multiple values in Pandas?

A

The isin() method makes it more convenient to find rows that match one of many possible values.

EXAMPLE1: Select all rows for the car name is either ‘datsun pl510’ or ‘datsun 1200’

auto[auto[‘car name’].isin([‘datsun pl510’, ‘datsun 1200’])]

NOTE: If you
want to match text starts with the same word, consider using .str.startswith()

EXAMPLE2:
Select all rows with the car name starting with “vw”:

auto[auto[‘car name’].str.startswith(‘vw’)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can you specify the selection for both rows and columns

A

To specify the selection for both rows and columns, or to select just a row multiple columns by slicing, use loc[] and iloc[]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is loc[]?

A

loc[] allows you to select rows and columns by:
- Labels
- Boolean array

EXAMPLE1: Selecting rows by conditions, and columns by slicing on labels:

auto.loc[auto[‘car name’] == ‘datsun pl510’, ‘model year’:’car name’]

NOTE: Similar to NumPy 2D array, before the comma represents the rows to select, and after the comma represents the columns to select. When slicing with loc[], endpoints are included!

EXAMPLE2: selecting rows by a list of row labels:
auto.loc[[0,3]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is iloc[]?

A

iloc[] allows you to select rows and columns by the position indexes

EXAMPLE:
auto_idx_by_year.head()

auto_idx_by_year.iloc[[1,3]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the advantages of loc[] over iloc[]?

A

Advantages of loc[] over iloc[]:
- Easier to read
- Harder to make mistakes
- May work even if the order of rows or columns changed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How can we do element wise calculations on Pandas, Series, and DataFrame?

A

marks = pd.DataFrame({‘ps_1’: [70, 100, 82], ‘ps_2’: [88, 92, 83]}, index=[‘Harry’, ‘Hermione’, ‘Ron’])

EXAMPLE1:
marks[‘ps_1’] + marks[‘ps_2’]

OUT: Harry 158
Hermione 192
Ron 165

EXAMPLE2: Alignment is based on the row label (Index), not the position:

ps_3 = pd.Series([100, 70, 90], index=[‘Hermione’, ‘Ron’, ‘Harry’])

(marks[‘ps_1’] + ps_3)/2

OUT: Harry 80.0
Hermione 100.0
Ron 76.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is data exploration and common steps?
Data exploration helps to develop a sound understanding of the data we need before doing the actual analysis. For example: * What does the data contain? * How many observations are there? What attributes do we have? * Do we need some more data to answer the questions that we have? * What kind of questions can we ask? * Are there anomalies or egregious issues? * Any interesting patterns? E.g. relationships between variables? * What models are appropriate? Some common steps of data exploration: * Quick look of the data and use metadata to learn about - Size of the data - Attributes * Check if there is any issue with the data (e.g. missing or wrong data) * Use descriptive statistics/graphs to get some insights about the data
26
How can we import and load the data into pandas, have a quick look at the data,
Use pd.read_csv(), then head() or tail() to look at the first/last few rows: import pandas as pd dc = pd.read_csv('data/dc-wikia-data.csv') display(dc.head(2)) display(dc.tail(2))
27
What is metadata?
Metadata is "data about data", i.e. data that provides information about the main data, but it is not part of the main data. EXAMPLE: For the DC character dataset: * Introduction of the data * Data source (including data range) * Information about the columns * If metadata is available, please always read it - do not assume!
28
What are some common possible issues with data?
* Missing data: NaN hints that the corresponding data is not available in the dataset, and potential data quality issues * Some pre-processing may be needed: - YEAR is now floating point numbers - Depending on the analysis, we may want to separate the information of actual name and universe from the column "name"
29
How can you get the dimensions of pd.DataFrame?
For pd.DataFrame, we can quickly check the dimension of the data by : .shape EXAMPLE: dc.shape OUT: (6896, 13) The shape is a tuple with the number of rows and the number of columns.
30
How can you check the typess of data for pd.DataFrame?
Check if the type of data is what you expected by dtypes or info(): EXAMPLE: dc.dtypes NOTE: * object is for "Python object". It may be used for textual data * For the variables of interest, we may want to convert "YEAR" and "APPEARANCES" to int and "SEX" to category * If we also work on other variables, we may for example want to convert "ALIVE" to bool * You can also use info() to check the type information
31
How can we check the amount of missing data for pd.DataFrame?
Check the amount of missing data - If there are too many missing values, the dataset maynot be useful. * We can check the number of missing data per column in Pandas with the following: dc.isnull().sum()
32
What are descriptive statistics and how can we generate them for each variable in Pandas?
A descriptive statistic quantitatively describes or summarises features from a collection of data. .describe() * We can use some simple descriptive statistics to have some more idea about the data like: - Central tendency: What is the "common" or "representative" value? - Data dispersion: The spread of the data E.g. Do we have a wide range of values? - Relationship between variables * We can use describe() to generate a selection of descriptive statistics on each variable (univariate): EXAMPLE: dc.describe(include='all')
33
RECAP: What are the different types of attributes/variables?
1. Quantitative data (i.e. numerical), which can either be: a) Discrete: a finite number of values are possible in any bounded interval. Example: Number of appearance of characters b) Continuous: an infinite number of values are possible in any bounded interval. Example: Height of people 2. Categorical (or qualitative) data: a) Ordinal: non-numerical but has ranking. Example: Level of Python of students b) Nominal: no inherent order among the values. Example: Eye colours of the characters
34
How can you attain the measures of central tendency in Pandas (mean, median, mode)?
Apart from getting the central tendency statistics from describe(), we can use methods like mean(), median() and mode(), similar to how it is done with np.ndarray NOTE: By default: Calculation is along the axis 0 (i.e. aggregating each column), Missing data is ignored when doing the calculation EXAMPLE1: To attain the mean on all columns with numerical data: dc.mean(numeric_only=True) EXAMPLE2: To find the median on one column dc['APPEARANCES'].median() EXAMPLE3: The mode of one column dc['APPEARANCES'].mode()
35
How does the mean and median differ in terms of central tendency?
* Mean is more sensitive to extreme values (outliers) when compared with median, or we say median is more robust * With outliers, median provides a better measure of "central" or "representative" value than mean * The difference between mean and median also gives us some idea about the skewness of the data
36
What are the measures of dispersion for quantitative variables?
The dispersion of a sample of observations measures the variation of the data. Common examples of measures of dispersion are: * Range - Range = Maximum value - Minimum value * Interquartile range (IQR) - Difference between the 75th percentile (Q3) and 25th percentiles (Q1 ) of the data - IQR = Q3− Q1 * Variance * Standard deviation - Square root of variance
37
How can you attain measures of dispersion in Pandas?
You can get some values from describe() for measuring dispersion. Alternatively, you can use some methods like max(), min(), quantile(), etc to measure dispersion for quantitative variables * Range: dc.APPEARANCES.max() - dc.APPEARANCES.min() * IQR: dc.APPEARANCES.quantile(0.75) - dc.APPEARANCES.quantile(0.25)
38
How do you attain the standard deviation in Pandas?
.std() To calculate the percentage between mu = x.mean() sd = x.std() #1 sd ((mu - sd <= x) & (x <= mu + sd)).mean() 2 sd ((mu - 2*sd <= x) & (x <= mu + 2*sd)).mean() EXAMPLE1: mu = dc.APPEARANCES.mean() sd = dc.APPEARANCES.std() 1 sd ((mu - sd <= dc.APPEARANCES) & (dc.APPEARANCES <= mu + sd)).mean() 2 sd ((mu - 2*sd <= dc.APPEARANCES) & (dc.APPEARANCES <= mu + 2*sd)).mean()
39
How can you get insights from categorical data?
We can use the number of distinct values and count (or relative frequency) to have some idea about the variability and distribution of categorical data 1. find the number of distinct values by nunique() EXAMPLE: dc['SEX'].nunique() 2. You can find out the unique values by unique() EXAMPLE: dc['SEX'].unique() OUT: array(['Male Characters', 'Female Characters', nan, 'Genderless Characters', 'Transgender Characters'], dtype=object)
40
How can we count the frequency for each category?
To count the appearance for each category, we can use value_counts() EXAMPLE: dc['SEX'].value_counts() To get tabulation of the frequencies, we use to_frame() EXAMPLE: dc['SEX'].value_counts().to frame()
41
How do you calculate percentage count for the frequency?
To calculate percentage count, use the additional argument normalize=True dc['SEX'].value_counts(normalize=True).to_frame().mul(100).round(1)
42
What happens if you read a CSV file without an argument
If we read the file using read_csv() without providing an argument: aex = pd.read_csv('data/AEX.csv') with the default index used, the first line in the file is considered as header, the "date" column has the type object
43
How can you read a file via its URL?
url = 'https://www.hkex.com.hk/eng/dwrc/search/dwFullList.csv' df = pd.read_csv(url, encoding='utf-16', sep='\t', skiprows=1, skipfooter=3, engine='python')
44
What is data wrangling?
Data wrangling (or data cleaning, data munging) refers to a variety of processes designed to transform raw data into more readily used formats. EXAMPLES: * Filtering: removing unnecessary or irrelevant data * Formating the data * Handle extreme outliers, missing, duplicate or wrong values in data * Merging multiple data sources * How to perform data wrangling depends on the data you are working on and the goal you are trying to achieve.
45
DC EXAMPLE: How do you load in the data set and view?
import pandas as pd dc = pd.read_csv('data/dc-wikia-data.csv') dc.head()
46
DC EXAMPLE: How can you view the data types?
dc.dtypes
47
DC EXAMPLE: How can you find descriptive statistics for the dataset?
dc.describe(include='all')
48
How can you discard rows or columns?
Discard the rows or columns by using drop()
49
DC EXAMPLE: How can you keep columns such as name, SEX, APPEARANCES, YEAR, ALIGN?
METHOD1: dc.drop('page_id', axis='columns', inplace=True) dc.head(2) axis='columns' means we drop the column(s), inplace means to change the original data METHOD2: Alternately, you can select the columns you want, BUT NOTE reassignment is needed: dc = dc[['name''SEX', 'APPEARANCES', 'YEAR', 'ALIGN']] display(dc.head(3)) display(dc.tail(3)) 'SEX' , 'APPEARANCES' , 'YEAR' , 'ALIGN']]
50
How can you filter data fulfilling certain conditions?
.between() or .isin and include the endpoints EXAMPLE1: Select data with YEAR between 2000 to 2010: dc[dc['YEAR'].between(2000, 2010)].head() EXAMPLE2: Select characters that are not male or female (note characters with no gender information will also be included): dc[~dc['SEX'].isin(['Male Characters', 'Female Characters'])].head()
51
How can you rename columns?
.rename() NOTE we have name in lowercase but other columns are in capital letters. We can unify it by renaming the name column using rename() EXAMPLE1: dc.rename(columns=str.upper).head() NOTE: This is with the use of the string function upper() i.e. str.upper('name') EXAMPLE2: Alternatively, we can provide columns={'name': 'NAME'} for which it indicates we want to change the column name from 'name' to 'NAME': dc.rename(columns={'name': 'NAME'}, dc.rename(columns={'name': 'NAME'}, inplace=True) dc.head()
52
How can you convert variables to the desired type?
astype() EXAMPLE: dc['SEX'] = dc['SEX'].astype('category') dc['ALIGN'] = dc['ALIGN'].astype('category') NOTE: We may not be able to convert data types into int as non-finite values (NA or inf) cannot be converted into int (error)
53
How do you sort the data by column or index?
SORT BY COLUMN: dc.sort_values(by='APPEARANCES', ascending=False).head() SORT BY INDEX: cases.sort index().head(3)
54
How can you remove duplicated data?
drop_duplicates() Sometimes we may have duplicate values in the data - For example, some students handed in the pre-course survey twice as they thought the first time the survey was not submitted properly EXAMPLE: students.drop_duplicates(inplace=True)
55
What is missing data?
Missing data occur when no data value is stored for the variable in an observation. * Missing data is a common problem and missing data can arise from various places in data: * Survey data: - Participants randomly miss some questions - A respondent chooses not to respond to questions like "Have you ever used generative AI tools for summative coursework?" * Study / experiment over time: - Participants drop out of the study in a medical research - Collecting a new variable partway through the data collection of a study * Others: - Corrupted results or measurements - Movie review: each user only explicitly expresses his or her preferences small subset of movies
56
How is missing data represented in Pandas?
NaN is used to represent missing data
57
How do you find missing data in Pandas?
isnull() and notnull() can be used to check if the data has missing data. isnull() provides the value True if the corresponding data is missing NOTE: Often, by default, Pandas ignores the missing data for you when calculating statistics
58
What are simple ways to handle missing data?
1. Drop the observations that have any missing values 2. Filling the missing data with some substituted values
59
What is an example when models require given data to be free from missing data?
We can fit a linear regression model when there is no missing data But not when there is missing data
60
How do you drop missing observations in Pandas?
We can use the method dropna() to drop NaN data, for which by default it keeps only rows with all attributes present here we explicitly create a copy dc_no_na = dc.dropna().copy() display(dc_no_na.head(3)) display(dc_no_na.tail(3))
61
What should you be cautious of when using dropna()?
By default, only rows with all attributes present are kept. If there are many missing data in some particular columns, many rows will be dropped.
62
What argument can you include to drop all rows with all only missing data?
you can use the additional argument: how='all' in dropna() students = pd.DataFrame({'names': ['Harry', 'Ron', np.nan, 'Hermione'], 'python_level': ['High', 'Low', np.nan, np.nan]}) students.dropna(how='all')
63
What are the implications of dropping observations?
1. Can result in losing a lot of data 2. May cause bias * For example, we have students' marks and we ask students to tell us the number of hours students spent on revising the course materials. Assume: - Students who spent a low number of hours in revision are more likely not answering the question - The number of hours and the mark is positively correlated - By discarding the observations that the number of hours is missing, the average mark from the data will be different See notes for EXAMPLES
64
Should you remove rows with missing data?
* Sometimes we need to fit data into a model that does not allow missing data. Then removing missing data is the simplest way to allow us to fit the model * Whether the data is missing at random * Sometimes we may want to fill in the missing data instead. For example: - There are too many missing data, and removing those rows will leave too few data points left - We know how we can fill the missing data, so it is better to fill the data than by removing them e.g. Missing UK coronavirus vaccine data in 2020 can be filled by 0
65
What is imputation?
Imputation is the process of replacing missing data with substituted values * We will talk about two ways in this course: - Impute the mean/median (if quantitative) or most common class (if nominal) for all missing values - Fill the missing data using data points before and after the missing data point (e.g. for time series)
66
How can you fill the missing data?
fillna() With Pandas, you can fill the missing data using fillna() and provide the values/methods to fill the missing data: EXAMPLE: dc_fill_na = pd.read_csv('data/dc-wikia-data.csv', usecols=['name', 'SEX', 'APPEARANCES', 'YEAR', 'ALIGN'] dc_fill_na.isnull().sum()
67
How can you replace missing data bt mean/median/mode?
assume 1 if missing dc_fill_na['APPEARANCES'] = dc_fill_na['APPEARANCES'].fillna(1) filling mean for quantitative data dc_fill_na['YEAR'] = dc_fill_na['YEAR'].fillna(round(dc_fill_na ['YEAR'].mean())) SEE NOTES FOR filling mode or a new category representing missing data for categorical data
68
What are some of the implications of imputation with mean/mode/median?
1. May cause bias 2. Filling with mode may not be appropriate 3. We are likely to "overestimate" or "underestimate"
69
What is time series data?
Time series is a series of data points indexed in time order, at successive equally spaced points in time (i.e. can be visualised via a line plot)
70
Filling missing data: How can we forward fill missing data and what does it mean?
We can use ffill() to 'forward fill' with the previous value (useful with time series data if there is a missing value) *It is a sensible way to fill the data here as there is missing data because the market was not open on those days EXAMPLE: ffill_price = price.ffill() pd.DataFrame({'original': price, 'ffill': ffill_price}).head(7)
71
Filling missing data: How can we linear interpolate, and what does it mean?
Another way you can fill the missing data is linear average, with the use of interpolate(method='linear') Linear averaging is a simple way to fill the missing data if there are some (simple) patterns in the data. EXAMPLE: linear_fill_price = price.interpolate(method='linear') pd.DataFrame({'original': price, 'linear_fill': linear_fill_price}).head(7)