W3 Flashcards

Question

What is data exploration and common steps?

Answer 1

Data exploration helps to develop a sound understanding of the data we need before doing the actual analysis. For example: * What does the data contain? * How many observations are there? What attributes do we have? * Do we need some more data to answer the questions that we have? * What kind of questions can we ask? * Are there anomalies or egregious issues? * Any interesting patterns? E.g. relationships between variables? * What models are appropriate? Some common steps of data exploration: * Quick look of the data and use metadata to learn about - Size of the data - Attributes * Check if there is any issue with the data (e.g. missing or wrong data) * Use descriptive statistics/graphs to get some insights about the data

Answer 2

Use pd.read_csv(), then head() or tail() to look at the first/last few rows: import pandas as pd dc = pd.read_csv('data/dc-wikia-data.csv') display(dc.head(2)) display(dc.tail(2))

Answer 3

Metadata is "data about data", i.e. data that provides information about the main data, but it is not part of the main data. EXAMPLE: For the DC character dataset: * Introduction of the data * Data source (including data range) * Information about the columns * If metadata is available, please always read it - do not assume!

Answer 4

* Missing data: NaN hints that the corresponding data is not available in the dataset, and potential data quality issues * Some pre-processing may be needed: - YEAR is now floating point numbers - Depending on the analysis, we may want to separate the information of actual name and universe from the column "name"

Answer 5

For pd.DataFrame, we can quickly check the dimension of the data by : .shape EXAMPLE: dc.shape OUT: (6896, 13) The shape is a tuple with the number of rows and the number of columns.

Answer 6

Check if the type of data is what you expected by dtypes or info(): EXAMPLE: dc.dtypes NOTE: * object is for "Python object". It may be used for textual data * For the variables of interest, we may want to convert "YEAR" and "APPEARANCES" to int and "SEX" to category * If we also work on other variables, we may for example want to convert "ALIVE" to bool * You can also use info() to check the type information

Answer 7

Check the amount of missing data - If there are too many missing values, the dataset maynot be useful. * We can check the number of missing data per column in Pandas with the following: dc.isnull().sum()

Answer 8

A descriptive statistic quantitatively describes or summarises features from a collection of data. .describe() * We can use some simple descriptive statistics to have some more idea about the data like: - Central tendency: What is the "common" or "representative" value? - Data dispersion: The spread of the data E.g. Do we have a wide range of values? - Relationship between variables * We can use describe() to generate a selection of descriptive statistics on each variable (univariate): EXAMPLE: dc.describe(include='all')

Answer 9

1. Quantitative data (i.e. numerical), which can either be: a) Discrete: a finite number of values are possible in any bounded interval. Example: Number of appearance of characters b) Continuous: an infinite number of values are possible in any bounded interval. Example: Height of people 2. Categorical (or qualitative) data: a) Ordinal: non-numerical but has ranking. Example: Level of Python of students b) Nominal: no inherent order among the values. Example: Eye colours of the characters

Answer 10

Apart from getting the central tendency statistics from describe(), we can use methods like mean(), median() and mode(), similar to how it is done with np.ndarray NOTE: By default: Calculation is along the axis 0 (i.e. aggregating each column), Missing data is ignored when doing the calculation EXAMPLE1: To attain the mean on all columns with numerical data: dc.mean(numeric_only=True) EXAMPLE2: To find the median on one column dc['APPEARANCES'].median() EXAMPLE3: The mode of one column dc['APPEARANCES'].mode()

Answer 11

* Mean is more sensitive to extreme values (outliers) when compared with median, or we say median is more robust * With outliers, median provides a better measure of "central" or "representative" value than mean * The difference between mean and median also gives us some idea about the skewness of the data

Answer 12

The dispersion of a sample of observations measures the variation of the data. Common examples of measures of dispersion are: * Range - Range = Maximum value - Minimum value * Interquartile range (IQR) - Difference between the 75th percentile (Q3) and 25th percentiles (Q1 ) of the data - IQR = Q3− Q1 * Variance * Standard deviation - Square root of variance

Answer 13

You can get some values from describe() for measuring dispersion. Alternatively, you can use some methods like max(), min(), quantile(), etc to measure dispersion for quantitative variables * Range: dc.APPEARANCES.max() - dc.APPEARANCES.min() * IQR: dc.APPEARANCES.quantile(0.75) - dc.APPEARANCES.quantile(0.25)

Answer 14

.std() To calculate the percentage between mu = x.mean() sd = x.std() #1 sd ((mu - sd <= x) & (x <= mu + sd)).mean() 2 sd ((mu - 2*sd <= x) & (x <= mu + 2*sd)).mean() EXAMPLE1: mu = dc.APPEARANCES.mean() sd = dc.APPEARANCES.std() 1 sd ((mu - sd <= dc.APPEARANCES) & (dc.APPEARANCES <= mu + sd)).mean() 2 sd ((mu - 2*sd <= dc.APPEARANCES) & (dc.APPEARANCES <= mu + 2*sd)).mean()

Answer 15

We can use the number of distinct values and count (or relative frequency) to have some idea about the variability and distribution of categorical data 1. find the number of distinct values by nunique() EXAMPLE: dc['SEX'].nunique() 2. You can find out the unique values by unique() EXAMPLE: dc['SEX'].unique() OUT: array(['Male Characters', 'Female Characters', nan, 'Genderless Characters', 'Transgender Characters'], dtype=object)

Answer 16

To count the appearance for each category, we can use value_counts() EXAMPLE: dc['SEX'].value_counts() To get tabulation of the frequencies, we use to_frame() EXAMPLE: dc['SEX'].value_counts().to frame()

Answer 17

To calculate percentage count, use the additional argument normalize=True dc['SEX'].value_counts(normalize=True).to_frame().mul(100).round(1)

Answer 18

If we read the file using read_csv() without providing an argument: aex = pd.read_csv('data/AEX.csv') with the default index used, the first line in the file is considered as header, the "date" column has the type object

Answer 19

url = 'https://www.hkex.com.hk/eng/dwrc/search/dwFullList.csv' df = pd.read_csv(url, encoding='utf-16', sep='\t', skiprows=1, skipfooter=3, engine='python')

Answer 20

Data wrangling (or data cleaning, data munging) refers to a variety of processes designed to transform raw data into more readily used formats. EXAMPLES: * Filtering: removing unnecessary or irrelevant data * Formating the data * Handle extreme outliers, missing, duplicate or wrong values in data * Merging multiple data sources * How to perform data wrangling depends on the data you are working on and the goal you are trying to achieve.

Answer 21

import pandas as pd dc = pd.read_csv('data/dc-wikia-data.csv') dc.head()

Answer 22

dc.describe(include='all')

Answer 23

Discard the rows or columns by using drop()

Answer 24

METHOD1: dc.drop('page_id', axis='columns', inplace=True) dc.head(2) axis='columns' means we drop the column(s), inplace means to change the original data METHOD2: Alternately, you can select the columns you want, BUT NOTE reassignment is needed: dc = dc[['name''SEX', 'APPEARANCES', 'YEAR', 'ALIGN']] display(dc.head(3)) display(dc.tail(3)) 'SEX' , 'APPEARANCES' , 'YEAR' , 'ALIGN']]

Answer 25

.between() or .isin and include the endpoints EXAMPLE1: Select data with YEAR between 2000 to 2010: dc[dc['YEAR'].between(2000, 2010)].head() EXAMPLE2: Select characters that are not male or female (note characters with no gender information will also be included): dc[~dc['SEX'].isin(['Male Characters', 'Female Characters'])].head()

Answer 26

.rename() NOTE we have name in lowercase but other columns are in capital letters. We can unify it by renaming the name column using rename() EXAMPLE1: dc.rename(columns=str.upper).head() NOTE: This is with the use of the string function upper() i.e. str.upper('name') EXAMPLE2: Alternatively, we can provide columns={'name': 'NAME'} for which it indicates we want to change the column name from 'name' to 'NAME': dc.rename(columns={'name': 'NAME'}, dc.rename(columns={'name': 'NAME'}, inplace=True) dc.head()

Answer 27

astype() EXAMPLE: dc['SEX'] = dc['SEX'].astype('category') dc['ALIGN'] = dc['ALIGN'].astype('category') NOTE: We may not be able to convert data types into int as non-finite values (NA or inf) cannot be converted into int (error)

Answer 28

SORT BY COLUMN: dc.sort_values(by='APPEARANCES', ascending=False).head() SORT BY INDEX: cases.sort index().head(3)

Answer 29

drop_duplicates() Sometimes we may have duplicate values in the data - For example, some students handed in the pre-course survey twice as they thought the first time the survey was not submitted properly EXAMPLE: students.drop_duplicates(inplace=True)

Answer 30

Missing data occur when no data value is stored for the variable in an observation. * Missing data is a common problem and missing data can arise from various places in data: * Survey data: - Participants randomly miss some questions - A respondent chooses not to respond to questions like "Have you ever used generative AI tools for summative coursework?" * Study / experiment over time: - Participants drop out of the study in a medical research - Collecting a new variable partway through the data collection of a study * Others: - Corrupted results or measurements - Movie review: each user only explicitly expresses his or her preferences small subset of movies

Answer 31

NaN is used to represent missing data

Answer 32

isnull() and notnull() can be used to check if the data has missing data. isnull() provides the value True if the corresponding data is missing NOTE: Often, by default, Pandas ignores the missing data for you when calculating statistics

Answer 33

1. Drop the observations that have any missing values 2. Filling the missing data with some substituted values

Answer 34

We can fit a linear regression model when there is no missing data But not when there is missing data

Answer 35

We can use the method dropna() to drop NaN data, for which by default it keeps only rows with all attributes present here we explicitly create a copy dc_no_na = dc.dropna().copy() display(dc_no_na.head(3)) display(dc_no_na.tail(3))

Answer 36

By default, only rows with all attributes present are kept. If there are many missing data in some particular columns, many rows will be dropped.

Answer 37

you can use the additional argument: how='all' in dropna() students = pd.DataFrame({'names': ['Harry', 'Ron', np.nan, 'Hermione'], 'python_level': ['High', 'Low', np.nan, np.nan]}) students.dropna(how='all')

Answer 38

1. Can result in losing a lot of data 2. May cause bias * For example, we have students' marks and we ask students to tell us the number of hours students spent on revising the course materials. Assume: - Students who spent a low number of hours in revision are more likely not answering the question - The number of hours and the mark is positively correlated - By discarding the observations that the number of hours is missing, the average mark from the data will be different See notes for EXAMPLES

Answer 39

* Sometimes we need to fit data into a model that does not allow missing data. Then removing missing data is the simplest way to allow us to fit the model * Whether the data is missing at random * Sometimes we may want to fill in the missing data instead. For example: - There are too many missing data, and removing those rows will leave too few data points left - We know how we can fill the missing data, so it is better to fill the data than by removing them e.g. Missing UK coronavirus vaccine data in 2020 can be filled by 0

Answer 40

Imputation is the process of replacing missing data with substituted values * We will talk about two ways in this course: - Impute the mean/median (if quantitative) or most common class (if nominal) for all missing values - Fill the missing data using data points before and after the missing data point (e.g. for time series)

Answer 41

fillna() With Pandas, you can fill the missing data using fillna() and provide the values/methods to fill the missing data: EXAMPLE: dc_fill_na = pd.read_csv('data/dc-wikia-data.csv', usecols=['name', 'SEX', 'APPEARANCES', 'YEAR', 'ALIGN'] dc_fill_na.isnull().sum()

Answer 42

assume 1 if missing dc_fill_na['APPEARANCES'] = dc_fill_na['APPEARANCES'].fillna(1) filling mean for quantitative data dc_fill_na['YEAR'] = dc_fill_na['YEAR'].fillna(round(dc_fill_na ['YEAR'].mean())) SEE NOTES FOR filling mode or a new category representing missing data for categorical data

Answer 43

1. May cause bias 2. Filling with mode may not be appropriate 3. We are likely to "overestimate" or "underestimate"

Answer 44

Time series is a series of data points indexed in time order, at successive equally spaced points in time (i.e. can be visualised via a line plot)

Answer 45

We can use ffill() to 'forward fill' with the previous value (useful with time series data if there is a missing value) *It is a sensible way to fill the data here as there is missing data because the market was not open on those days EXAMPLE: ffill_price = price.ffill() pd.DataFrame({'original': price, 'ffill': ffill_price}).head(7)

Answer 46

Another way you can fill the missing data is linear average, with the use of interpolate(method='linear') Linear averaging is a simple way to fill the missing data if there are some (simple) patterns in the data. EXAMPLE: linear_fill_price = price.interpolate(method='linear') pd.DataFrame({'original': price, 'linear_fill': linear_fill_price}).head(7)