Data analysis code Flashcards
What does axis=1 do in a DataFrame operation?
It moves along rows (horizontally)
What does axis=0 do in a DataFrame operation?
It moves down columns (vertically).
What is the purpose of the Shapiro-Wilk test?
It tests whether the data is normally distributed.
What is the interpretation of a p-value greater than 0.05 in the Shapiro-Wilk test?
Data is normally distributed; fail to reject the null hypothesis.
What does df.shape return?
A tuple with the number of rows and columns: (number of rows, number of columns).
What statistics does df.describe() provide?
It provides summary statistics: Count, Mean, Standard Deviation, Minimum, Quartiles (25%, 50%, 75%), and Maximum.
What does df.mean() do in a DataFrame?
It computes the mean of each column.
How do you access a specific column in a DataFrame?
Use the column name: df[‘column_name’].
What does the code df[‘column_name’].max() do?
It returns the maximum value in the specified column.
What assumptions does the paired t-test make?
It assumes no major outliers, independent observations, continuous dependent variable, and normally distributed dependent variable.
How do you read a CSV file into a DataFrame in Python?
df = pd.read_csv(‘../folder/name.filetype’)
How do you read an Excel file into a DataFrame in Python?
df = pd.read_excel(‘file_path’)
How do you read a tab-separated file into a DataFrame?
Use the sep=’\t’ parameter in pd.read_csv().
How do you handle missing data when reading a file?
Use na_values=’’ to replace ‘’ with NaN.
How do you specify the data type for integer columns in a DataFrame?
Use dtype=pd.Int64Dtype() to convert float to integers.
How do you rename columns when reading a file into a DataFrame?
Use header=None, names=[‘column1’, ‘column2’, …] when reading the file.
How do you skip rows from the top when reading a file?
Use the skiprows=… parameter.
How do you skip rows from the bottom when reading a file?
Use the skipfooter=… parameter.
How do you set a specific column as the index when reading a file?
Use index_col=1 to set the second column as the index
How do you update a specific value in a DataFrame?
Use df.at[‘row_name’, ‘column_name’] = new_value.
What does df.info() display?
It displays the number of entries, index range, columns, and non-null count per column.
What does the data type float64 represent?
It represents decimal numbers.
What does the data type int64 represent?
It represents whole numbers (integers).
What does the data type object represent?
It represents strings or words.
How do you display the last 5 rows of a DataFrame?
Use df.tail().
How do you sort the values in a DataFrame by a specific column?
Use df.sort_values(by=[‘column_name’], ascending=False).
How do you drop rows with missing values from a DataFrame?
Use df.dropna(inplace=True).
What is the effect of using inplace=True in df.dropna()?
It modifies the original DataFrame.
What is the effect of using inplace=False in df.dropna()?
It creates a new DataFrame and leaves the original unchanged.
How do you drop columns with missing values from a DataFrame?
Use df.dropna(axis=’columns’, inplace=True)
How do you drop rows that have less than 2 non-missing values?
Use df.dropna(thresh=2, inplace=True).
How do you drop rows based on missing values in specific columns?
Use df.dropna(subset=[‘column1’, ‘column2’], inplace=True).
How do you save a DataFrame as a CSV file?
Use df.to_csv(‘data_name.filetype’, index=False).
How do you save a DataFrame as an Excel file?
Use df.to_excel(‘data_name.filetype’, index=False).
How do you exclude the index when saving a DataFrame to a file?
Use index=False in the to_csv() or to_excel() methods.
How do you add a new column to a DataFrame?
df[‘Column name’] = [‘column contents’, ‘next value’, …]
How do you forward-fill NaN values in a column?
df[‘Column name’] = df[‘Column name’].fillna(method=”ffill”)
How do you remove a single column from a DataFrame?
df = df.drop(columns=’column name’)
How do you remove multiple columns from a DataFrame?
df = df.drop(columns=[‘name 1’, ‘name 2’])
How do you remove a specific row by its index?
df = df.drop(row_number)
How do you rearrange columns into a custom order?
cols = df.columns.tolist()
cols_new = [cols[1], cols[3], cols[2], cols[0]]
df = df[cols_new]
How do you sort the columns alphabetically?
cols_new = sorted(cols) then df = df[cols_new]
How do you sort the columns in reverse alphabetical order?
cols_new = sorted(cols, reverse=True) then df = df[cols_new]
How do you sort values by multiple columns in descending order?
df = df.sort_values(by=[col1, col2], ascending=False)
How do you create a new column by multiplying an existing column by a scalar?
df[‘new column’] = df[‘old column’] * 50
How do you apply a logarithmic transformation to a column?
df[‘new column’] = np.log(df[‘old column’])
How do you apply a square root transformation to a column?
df[‘new column’] = np.sqrt(df[‘old column’])
How do you concatenate two string columns in a DataFrame?
df[‘Soil_Drainage’] = df[‘Soil’] + ‘_’ + df[‘Drainage’]
How do you convert numerical data to strings before concatenating?
df[‘new_column’] = df[‘old col’] + ‘_’ + df[‘old col’].astype(str)
How do you split a column into multiple columns based on a delimiter?
df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_’, expand=True)
How do you limit the number of splits to just once when splitting a column?
df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_’, n=1, expand=True)
How do you use a specific substring for splitting a column?
df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_20’, expand=True)
How do you remove duplicate rows from a DataFrame?
data = data.drop_duplicates()
How do you reset the index after dropping duplicates from a DataFrame?
data = data.drop_duplicates().reset_index(drop=True)
How do you transpose a DataFrame to swap rows and columns?
df = df.T (This makes the index the column headers and the column headers the index).
How do you reset the index and move it to a column in a DataFrame?
Use df.reset_index(inplace=True) to reset the index, then df = df.drop(columns=’index’) to remove the newly created index column.
How do you select a range of rows using iloc in a DataFrame?
f.iloc[1:2] selects rows from position 1 up to but not including position 2.
How do you check the data types of each column in a DataFrame?
Use df.dtypes to check the data types of each column.
How do you change the data type of a column in a DataFrame?
Use df[‘col number’].astype(type you want) to cast a column to a different type.
What is the purpose of melting a DataFrame, and how do you do it?
Melting transforms data from wide form into long form, which is useful for certain types of analysis and visualization. Use pd.melt(dataframe, id_vars, value_vars, var_name=’new_var_name’, value_name=’new_value_name’).
What parameters do you need to specify when melting a DataFrame?
id_vars: Columns to keep as is (typically identifiers).
value_vars: Columns containing the values to melt.
var_name: The name for the new variable column.
value_name: The name for the new value column.
What is the benefit of melting a DataFrame?
Melting makes the dataset more suitable for graphing and further analysis, especially for visualizing or performing operations on variables in a consistent, long format.
What is casting (pivoting) in DataFrame manipulation, and how do you do it?
Pivoting (casting) reverses melting, converting a long-form DataFrame into a wide-form DataFrame by spreading values into new columns. Use pd.pivot(dataframe, columns=’column header’, index=[‘index header,…], values=’data column’).
What parameters do you need to specify when pivoting a DataFrame?
columns: The column whose unique values become the new column headers.
index: The column(s) that become the index of the DataFrame.
values: The column containing the data to populate the new table.
How do you reset the index after pivoting a DataFrame?
Use df_cast.reset_index(inplace=True) to convert any multi-level index back into regular columns.
How do you create a relational plot to show the relationship between two numerical variables?
Use sns.relplot(x=’sepal_length’, y=’petal_length’, data=iris).
How do you turn a relational plot into a line plot with a confidence interval?
Add kind=’line’ to the sns.relplot function. The shaded area represents the 95% confidence interval.
How do you differentiate points in a relational plot by a categorical variable using color?
Add hue=’column_name’ to the plot, where ‘column_name’ is the categorical variable
What does sns.lmplot do?
It creates a scatter plot with a linear regression line and a shaded 95% confidence interval.
How do you create a composite plot with both categorical and numerical variables?
Use sns.jointplot(x=’sepal_length’, y=’petal_length’, data=iris) or sns.pairplot(data=iris).
How do you create a heatmap in Seaborn?
Use sns.heatmap(data, annot=True, fmt=”.2f”, cmap=”viridis”, cbar=True).
How do you create multiple plots side by side?
Use fig, axes = plt.subplots(1, 2, figsize=(10, 5)) for two plots in one row
How do you create a distribution plot (histogram) for numerical variables?
Use sns.displot(x=’sepal_length’, data=iris). Add hue or col=’column_name’ for splitting based on a categorical variable.
How do you explore relationships between categorical variables using a categorical plot?
Use sns.catplot(x=’Species’, y=’petal_length’, data=iris). You can add swarm or strip for different plot styles.
How do you customize violin plots to display both categories of a variable on the same plot?
Add split=True to sns.violinplot to show both categories of the variable in the same plot.
How do you set different Seaborn plot styles?
Use sns.set_style(‘style’). Options:
whitegrid: White background with grid
darkgrid: Grey background with grid
dark: Grey background without grid
white: White background without grid
How do you suppress the top text on a Seaborn plot?
Always add a semicolon ; at the end of the Seaborn plotting code to suppress text like <seaborn.axisgrid.JointGrid at 0x7fa572bf5b50>.
How do you set the context for a Seaborn plot?
Use sns.set_context(‘context’). Options:
paper: Smaller text for publication
notebook: Default for notebooks
talk: Larger for presentations
poster: Largest for posters
How do you view and set color palettes in Seaborn?
To view: sns.color_palette(‘color name’)
To set: sns.set_palette(‘color name’) (e.g., sns.set_palette(‘tab10’) for the default color set)
How do you reset Seaborn’s default settings?
Use sns.reset_defaults() to reset color and figure size to Seaborn’s default settings
How do you control the size of figure-level plots (e.g., relplot, displot)?
Use sns.relplot(x=’’, y=’’, data=…, hue=’’, height=6, aspect=1.5).
height: Plot height in inches
aspect: Aspect ratio (width/height)
How do you control the size of axes-level plots (e.g., scatterplot, boxplot)?
Use plt.figure(figsize=(width, height)). Example: plt.figure(figsize=(9, 6)) for 9x6 inches.
How do you customize titles and axis labels for multiple plots in Seaborn?
Title: g.fig.suptitle(‘Title’, fontsize=…, y=…)
Axis labels: g.set_axis_labels(‘x label’, ‘y label’, fontsize=…)
Set y-axis limits: g.set(ylim=(0, 8))
How do you customize individual plots and legends in Seaborn and Matplotlib?
Title: plt.title(‘Title’, fontsize=…)
Remove legend: plt.legend(False)
Axis labels: plt.xlabel(‘x label’, fontsize=…), plt.ylabel(‘y label’, fontsize=…)
Customize legend: plt.legend(loc=’’, title=’’, frameon=False)
How do you save a plot as an image?
Use plt.savefig(‘filename.png’) to save the plot as an image (you can specify other formats like .jpg, .svg, etc.).
What are the four types of joins in Pandas?
Outer Join: Combines all rows from both dataframes, including non-overlapping rows.
Inner Join: Includes only rows common to both dataframes.
Left Join: Includes all rows from the left dataframe and matching rows from the right.
Right Join: Includes all rows from the right dataframe and matching rows from the left.
How do you combine multiple merge() commands in a single line?
df = dataframe1.merge(dataframe2).merge(dataframe3)
How do you specify custom keys for merging dataframes in Pandas?
df = df1.merge(df2, on=’column_name’).merge(df3, left_on=’df1_column’, right_on=’df2_column’)
What does pd.concat() do, and what are its key parameters?
Combines multiple datasets into one structure.
Key Parameters:
axis: 0 (vertical) or 1 (horizontal).
join: Type of join (default = outer).
keys: Labels to identify data sources.
What is the syntax for a vertical concatenation of dataframes?
df = pd.concat([df1, df2], axis=0)
How do you concatenate dataframes horizontally with an inner join?
df = pd.concat([df1.set_index(‘col_name’), df2.set_index(‘col_name’)], join=’inner’, axis=1)
How do you merge two dataframes on a common column?
df = dataframe1.merge(dataframe2, how=’join_type’, on=’common_column’)
Join types: inner, outer, left, right.
How do you merge dataframes with custom join columns?
df = df1.merge(df2, left_on=’col_df1’, right_on=’col_df2’, suffixes=[‘_df1’, ‘_df2’])
What happens when matching columns overlap during merging?
Pandas appends _x (from dataframe1) and _y (from dataframe2) to differentiate. Use suffixes to customize labels.
What is the difference between merge() and join() in Pandas?
merge(): Combines dataframes based on common columns.
join(): Combines dataframes on their indexes.
What is the default behavior of pd.concat() when the axis parameter is not specified?
By default, pd.concat() stacks dataframes vertically (axis=0).
How do you add labels to identify the source of data in concatenation?
Use the keys parameter:
df = pd.concat([df1, df2], keys=[‘data1’, ‘data2’], axis=0)
What is the syntax for an outer merge?
df = df1.merge(df2, how=’outer’, on=’common_column’)
How do you combine dataframes using join()?
df = dataframe1.join(dataframe2, how=’join_type’, lsuffix=’_left’, rsuffix=’_right’)
Default join type: Left join.
How do you merge dataframes when they share no common column?
Use left_on and right_on parameters:
df = df1.merge(df2, left_on=’col1_df1’, right_on=’col2_df2’)
How do you differentiate overlapping columns from merged dataframes?
Use the suffixes parameter:
suffixes=[‘_df1’, ‘_df2’]
What happens when concatenating dataframes with different indexes or columns?
Outer join (default): Includes all indexes or columns.
Inner join: Keeps only matching indexes or columns.
How do you align dataframes by their column names for concatenation?
Set indexes with .set_index() before concatenation:
df = pd.concat([df1.set_index(‘col’), df2.set_index(‘col’)], axis=1)
What is a key advantage of chaining merge() commands?
Efficiency and clarity when combining multiple datasets in a single line:
df = df1.merge(df2).merge(df3)
What is the difference between axis=0 and axis=1 in concatenation?
axis=0: Stacks dataframes vertically (rows).
axis=1: Stacks dataframes horizontally (columns).
What does the len() function do in Python when applied to a DataFrame or list?
It returns the total number of rows or items in the structure, excluding the zeroth index.
How do you access a column’s values in a DataFrame by its name?
Use count[‘column_name’].
How can you isolate specific columns in a DataFrame?
Use double square brackets, e.g., df[[‘column1’, ‘column2’]]. The columns will appear in the order specified.
How do you create a subset of rows from rows 1 to 4 using indexing?
Use count[1:5] (row 5 is excluded).
How would you display the first 7 rows of a DataFrame?
Use count[:7].
How would you display only the last row of a DataFrame?
Use count[-1:].
How do you access a specific value in a DataFrame using iloc?
Use integer positions, e.g., df.iloc[2, 3] for the value at row 2, column 3.
How do you select all rows but only the last column using iloc?
Use df.iloc[:, -1].
How does loc differ from iloc?
loc uses labels (row/column names), while iloc uses integer positions.
How do you retrieve a specific value at row 2 and column ‘field’ using loc?
Use df.loc[2, ‘field’].
How can you filter rows based on a column value being greater than 10?
Use df[df[‘column_name’] > 10].
How do you extract rows with specific values in a column using .isin()?
Use df[df[‘column_name’].isin([‘value1’, ‘value2’])]
How do you filter rows based on a string condition using .query()?
Use syntax like df.query(‘Column == “Value”’).
What is the purpose of the groupby function in Pandas?
It splits a DataFrame into groups based on column(s) and performs operations on each group (e.g., grouping rows by soil types or drainage levels).
What are some common aggregation functions used with groupby?
mean: Calculates the average for each group.
max: Finds the maximum value for each group.
min: Finds the minimum value for each group.
sum: Adds up values for each group.
How do you avoid warnings when using groupby on columns with mixed data types?
Use numeric_only=True to limit operations to numeric columns.
How can you calculate the mean of grouped data using groupby?
Syntax: df.groupby([‘set’]).mean()
Example: Groups by the “set” column and calculates the mean, producing a matrix with rows for “control” and “experiment.”
How do you count the occurrences of unique values in a column using groupby?
Use df.groupby(‘col_name’).size() to get a list of unique values and their counts.
What is the purpose of splitting columns in a DataFrame?
To split values in a column into multiple new columns based on a delimiter or character.
What is the syntax for splitting a column into multiple columns?
df[[‘new_col1’, ‘new_col2’]] = df[‘original_col’].str.split(‘delimiter’, n=number_of_splits, expand=True)
delimiter: Character where the split occurs.
n: Number of splits to perform.
expand=True: Ensures output is split into separate columns.
How do you split the string “coding” at the letter d?
df[[‘col1’, ‘col2’]] = df[‘coding’].str.split(‘d’, n=1, expand=True)
Result:
col1: “co”
col2: “ding”
What are two key notes about splitting columns?
Always use expand=True to create multiple columns.
Use the n parameter to control how many splits occur if the delimiter appears multiple times.
What is the purpose of increasing sample size in correlation analysis?
Increasing the sample size improves the precision of the correlation estimate, reduces uncertainty in the p-value, and makes the results more reliable.
How do you calculate the Pearson correlation coefficient in Python?
result = pearsonr(df[‘col1’], df[‘col2’])
print(f’r = {result.statistic:.2f}’)
What does a Pearson correlation coefficient r value of 0.46 and a p-value of 0.03 indicate?
The p-value of 0.03 indicates there is a 3% chance that the correlation occurred by random chance, assuming there is no real correlation (null hypothesis). Since the p-value is low (< 0.05), we may reject the null hypothesis and consider the correlation statistically significant.
What is the difference between a statistic and a parameter?
Statistic: A numerical property of a sample (e.g., sample correlation coefficient r).
Parameter: A numerical property of the population (e.g., population correlation coefficient ρ).
What is the purpose of the p-value in hypothesis testing?
The p-value helps assess the significance of the observed correlation. It represents the probability of observing a test statistic as extreme as the observed one, assuming the null hypothesis is true.
How do you interpret a p-value in correlation analysis?
If the p-value < 0.05, we reject the null hypothesis (no correlation) and consider the correlation significant.
If the p-value > 0.05, the correlation might be due to random chance and we fail to reject the null hypothesis.
How can you calculate a 95% confidence interval for a Pearson correlation?
ci = result.confidence_interval()
ci.low, ci.high
What is the purpose of the Shapiro-Wilk test?
The Shapiro-Wilk test assesses if the data follows a normal distribution. The null hypothesis is that the data comes from a normally distributed population.
How do you apply the Shapiro-Wilk test in Python?
result = Shapiro(df[‘col_name’])
How do you handle non-normally distributed data in Python?
to transform non-normally distributed data, apply a log transformation:
df[‘log-variable’] = np.log10(df[‘col_name’])
What is the purpose of a t-test for independent samples?
A t-test for independent samples tests whether two independent samples have different means
How do you perform a t-test for independent samples in Python?
result = ttest_ind(subset[‘col1’], subset[‘col2’])
How do you modify the y-axis scale for large values in a scatter plot?
ax.set_yscale(‘log’)
What is the probability of observing a correlation of at least 0.46, assuming no real correlation exists?
The p-value gives the probability of observing a correlation of at least 0.46 by chance. If the p-value is small (e.g., < 0.05), it suggests that the observed correlation is statistically significant.
What is the purpose of a swarm plot?
Swarm plots are used to compare a numerical variable with two categories, allowing us to visually assess the difference in means between those categories.
What is the best use of a scatterplot (lmplot)?
Scatterplots (lmplot) are ideal for visualizing the relationship between two numerical variables. It combines a scatter plot with a regression line to show the relationship.
What is an explanatory variable in regression?
The explanatory variable is the independent variable (cause). It is expected to explain the variation in the response variable.
What is a response variable in regression?
The response variable is the dependent variable (effect). It is expected to change in response to the explanatory variable.
What is the model formula for regression?
The relationship between the response and explanatory variables is expressed as:
response_variable ~ explanatory_variable
What does the t-value represent in regression?
The t-value measures how many standard errors the coefficient is away from zero.
Example: t-value = 0.79.
What does the p-value indicate in regression?
The p-value indicates the probability that the coefficient is zero in the population.
Example: p-value = 0.433. Since p > 0.05, we accept the null hypothesis (no significant effect).
What is a 95% Confidence Interval (CI) in regression?
The 95% CI is a range within which the true population parameter is likely to fall.
Example: CI = (-0.25, 0.109). Since the CI does not contain 0, we cannot reject the null hypothesis.
How do you interpret the relationship between the y-axis and x-axis in linear models?
The mean of the y-axis is given by the equation:
Mean of y = intercept + slope * x
What must be mentioned in the conclusion when interpreting relationships?
It is important to mention whether the relationship is linear or logarithmic and state if there is a statistically significant relationship.