Data analysis code Flashcards

1
Q

What does axis=1 do in a DataFrame operation?

A

It moves along rows (horizontally)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does axis=0 do in a DataFrame operation?

A

It moves down columns (vertically).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the purpose of the Shapiro-Wilk test?

A

It tests whether the data is normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the interpretation of a p-value greater than 0.05 in the Shapiro-Wilk test?

A

Data is normally distributed; fail to reject the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does df.shape return?

A

A tuple with the number of rows and columns: (number of rows, number of columns).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What statistics does df.describe() provide?

A

It provides summary statistics: Count, Mean, Standard Deviation, Minimum, Quartiles (25%, 50%, 75%), and Maximum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does df.mean() do in a DataFrame?

A

It computes the mean of each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you access a specific column in a DataFrame?

A

Use the column name: df[‘column_name’].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the code df[‘column_name’].max() do?

A

It returns the maximum value in the specified column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What assumptions does the paired t-test make?

A

It assumes no major outliers, independent observations, continuous dependent variable, and normally distributed dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you read a CSV file into a DataFrame in Python?

A

df = pd.read_csv(‘../folder/name.filetype’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you read an Excel file into a DataFrame in Python?

A

df = pd.read_excel(‘file_path’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you read a tab-separated file into a DataFrame?

A

Use the sep=’\t’ parameter in pd.read_csv().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you handle missing data when reading a file?

A

Use na_values=’’ to replace ‘’ with NaN.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you specify the data type for integer columns in a DataFrame?

A

Use dtype=pd.Int64Dtype() to convert float to integers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you rename columns when reading a file into a DataFrame?

A

Use header=None, names=[‘column1’, ‘column2’, …] when reading the file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you skip rows from the top when reading a file?

A

Use the skiprows=… parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you skip rows from the bottom when reading a file?

A

Use the skipfooter=… parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you set a specific column as the index when reading a file?

A

Use index_col=1 to set the second column as the index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you update a specific value in a DataFrame?

A

Use df.at[‘row_name’, ‘column_name’] = new_value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does df.info() display?

A

It displays the number of entries, index range, columns, and non-null count per column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the data type float64 represent?

A

It represents decimal numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the data type int64 represent?

A

It represents whole numbers (integers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the data type object represent?

A

It represents strings or words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do you display the last 5 rows of a DataFrame?

A

Use df.tail().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do you sort the values in a DataFrame by a specific column?

A

Use df.sort_values(by=[‘column_name’], ascending=False).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How do you drop rows with missing values from a DataFrame?

A

Use df.dropna(inplace=True).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the effect of using inplace=True in df.dropna()?

A

It modifies the original DataFrame.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the effect of using inplace=False in df.dropna()?

A

It creates a new DataFrame and leaves the original unchanged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How do you drop columns with missing values from a DataFrame?

A

Use df.dropna(axis=’columns’, inplace=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How do you drop rows that have less than 2 non-missing values?

A

Use df.dropna(thresh=2, inplace=True).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How do you drop rows based on missing values in specific columns?

A

Use df.dropna(subset=[‘column1’, ‘column2’], inplace=True).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How do you save a DataFrame as a CSV file?

A

Use df.to_csv(‘data_name.filetype’, index=False).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How do you save a DataFrame as an Excel file?

A

Use df.to_excel(‘data_name.filetype’, index=False).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How do you exclude the index when saving a DataFrame to a file?

A

Use index=False in the to_csv() or to_excel() methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How do you add a new column to a DataFrame?

A

df[‘Column name’] = [‘column contents’, ‘next value’, …]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How do you forward-fill NaN values in a column?

A

df[‘Column name’] = df[‘Column name’].fillna(method=”ffill”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How do you remove a single column from a DataFrame?

A

df = df.drop(columns=’column name’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How do you remove multiple columns from a DataFrame?

A

df = df.drop(columns=[‘name 1’, ‘name 2’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How do you remove a specific row by its index?

A

df = df.drop(row_number)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How do you rearrange columns into a custom order?

A

cols = df.columns.tolist()
cols_new = [cols[1], cols[3], cols[2], cols[0]]
df = df[cols_new]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How do you sort the columns alphabetically?

A

cols_new = sorted(cols) then df = df[cols_new]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

How do you sort the columns in reverse alphabetical order?

A

cols_new = sorted(cols, reverse=True) then df = df[cols_new]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How do you sort values by multiple columns in descending order?

A

df = df.sort_values(by=[col1, col2], ascending=False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How do you create a new column by multiplying an existing column by a scalar?

A

df[‘new column’] = df[‘old column’] * 50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

How do you apply a logarithmic transformation to a column?

A

df[‘new column’] = np.log(df[‘old column’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How do you apply a square root transformation to a column?

A

df[‘new column’] = np.sqrt(df[‘old column’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

How do you concatenate two string columns in a DataFrame?

A

df[‘Soil_Drainage’] = df[‘Soil’] + ‘_’ + df[‘Drainage’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

How do you convert numerical data to strings before concatenating?

A

df[‘new_column’] = df[‘old col’] + ‘_’ + df[‘old col’].astype(str)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

How do you split a column into multiple columns based on a delimiter?

A

df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_’, expand=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

How do you limit the number of splits to just once when splitting a column?

A

df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_’, n=1, expand=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

How do you use a specific substring for splitting a column?

A

df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_20’, expand=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

How do you remove duplicate rows from a DataFrame?

A

data = data.drop_duplicates()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

How do you reset the index after dropping duplicates from a DataFrame?

A

data = data.drop_duplicates().reset_index(drop=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

How do you transpose a DataFrame to swap rows and columns?

A

df = df.T (This makes the index the column headers and the column headers the index).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

How do you reset the index and move it to a column in a DataFrame?

A

Use df.reset_index(inplace=True) to reset the index, then df = df.drop(columns=’index’) to remove the newly created index column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

How do you select a range of rows using iloc in a DataFrame?

A

f.iloc[1:2] selects rows from position 1 up to but not including position 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

How do you check the data types of each column in a DataFrame?

A

Use df.dtypes to check the data types of each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

How do you change the data type of a column in a DataFrame?

A

Use df[‘col number’].astype(type you want) to cast a column to a different type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is the purpose of melting a DataFrame, and how do you do it?

A

Melting transforms data from wide form into long form, which is useful for certain types of analysis and visualization. Use pd.melt(dataframe, id_vars, value_vars, var_name=’new_var_name’, value_name=’new_value_name’).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What parameters do you need to specify when melting a DataFrame?

A

id_vars: Columns to keep as is (typically identifiers).
value_vars: Columns containing the values to melt.
var_name: The name for the new variable column.
value_name: The name for the new value column.

62
Q

What is the benefit of melting a DataFrame?

A

Melting makes the dataset more suitable for graphing and further analysis, especially for visualizing or performing operations on variables in a consistent, long format.

63
Q

What is casting (pivoting) in DataFrame manipulation, and how do you do it?

A

Pivoting (casting) reverses melting, converting a long-form DataFrame into a wide-form DataFrame by spreading values into new columns. Use pd.pivot(dataframe, columns=’column header’, index=[‘index header,…], values=’data column’).

64
Q

What parameters do you need to specify when pivoting a DataFrame?

A

columns: The column whose unique values become the new column headers.
index: The column(s) that become the index of the DataFrame.
values: The column containing the data to populate the new table.

65
Q

How do you reset the index after pivoting a DataFrame?

A

Use df_cast.reset_index(inplace=True) to convert any multi-level index back into regular columns.

66
Q

How do you create a relational plot to show the relationship between two numerical variables?

A

Use sns.relplot(x=’sepal_length’, y=’petal_length’, data=iris).

67
Q

How do you turn a relational plot into a line plot with a confidence interval?

A

Add kind=’line’ to the sns.relplot function. The shaded area represents the 95% confidence interval.

68
Q

How do you differentiate points in a relational plot by a categorical variable using color?

A

Add hue=’column_name’ to the plot, where ‘column_name’ is the categorical variable

69
Q

What does sns.lmplot do?

A

It creates a scatter plot with a linear regression line and a shaded 95% confidence interval.

70
Q

How do you create a composite plot with both categorical and numerical variables?

A

Use sns.jointplot(x=’sepal_length’, y=’petal_length’, data=iris) or sns.pairplot(data=iris).

71
Q

How do you create a heatmap in Seaborn?

A

Use sns.heatmap(data, annot=True, fmt=”.2f”, cmap=”viridis”, cbar=True).

72
Q

How do you create multiple plots side by side?

A

Use fig, axes = plt.subplots(1, 2, figsize=(10, 5)) for two plots in one row

73
Q

How do you create a distribution plot (histogram) for numerical variables?

A

Use sns.displot(x=’sepal_length’, data=iris). Add hue or col=’column_name’ for splitting based on a categorical variable.

74
Q

How do you explore relationships between categorical variables using a categorical plot?

A

Use sns.catplot(x=’Species’, y=’petal_length’, data=iris). You can add swarm or strip for different plot styles.

75
Q

How do you customize violin plots to display both categories of a variable on the same plot?

A

Add split=True to sns.violinplot to show both categories of the variable in the same plot.

76
Q

How do you set different Seaborn plot styles?

A

Use sns.set_style(‘style’). Options:

whitegrid: White background with grid
darkgrid: Grey background with grid
dark: Grey background without grid
white: White background without grid

77
Q

How do you suppress the top text on a Seaborn plot?

A

Always add a semicolon ; at the end of the Seaborn plotting code to suppress text like <seaborn.axisgrid.JointGrid at 0x7fa572bf5b50>.

78
Q

How do you set the context for a Seaborn plot?

A

Use sns.set_context(‘context’). Options:

paper: Smaller text for publication
notebook: Default for notebooks
talk: Larger for presentations
poster: Largest for posters

79
Q

How do you view and set color palettes in Seaborn?

A

To view: sns.color_palette(‘color name’)
To set: sns.set_palette(‘color name’) (e.g., sns.set_palette(‘tab10’) for the default color set)

80
Q

How do you reset Seaborn’s default settings?

A

Use sns.reset_defaults() to reset color and figure size to Seaborn’s default settings

81
Q

How do you control the size of figure-level plots (e.g., relplot, displot)?

A

Use sns.relplot(x=’’, y=’’, data=…, hue=’’, height=6, aspect=1.5).

height: Plot height in inches
aspect: Aspect ratio (width/height)

82
Q

How do you control the size of axes-level plots (e.g., scatterplot, boxplot)?

A

Use plt.figure(figsize=(width, height)). Example: plt.figure(figsize=(9, 6)) for 9x6 inches.

83
Q

How do you customize titles and axis labels for multiple plots in Seaborn?

A

Title: g.fig.suptitle(‘Title’, fontsize=…, y=…)
Axis labels: g.set_axis_labels(‘x label’, ‘y label’, fontsize=…)
Set y-axis limits: g.set(ylim=(0, 8))

84
Q

How do you customize individual plots and legends in Seaborn and Matplotlib?

A

Title: plt.title(‘Title’, fontsize=…)
Remove legend: plt.legend(False)
Axis labels: plt.xlabel(‘x label’, fontsize=…), plt.ylabel(‘y label’, fontsize=…)
Customize legend: plt.legend(loc=’’, title=’’, frameon=False)

85
Q

How do you save a plot as an image?

A

Use plt.savefig(‘filename.png’) to save the plot as an image (you can specify other formats like .jpg, .svg, etc.).

86
Q

What are the four types of joins in Pandas?

A

Outer Join: Combines all rows from both dataframes, including non-overlapping rows.
Inner Join: Includes only rows common to both dataframes.
Left Join: Includes all rows from the left dataframe and matching rows from the right.
Right Join: Includes all rows from the right dataframe and matching rows from the left.

87
Q

How do you combine multiple merge() commands in a single line?

A

df = dataframe1.merge(dataframe2).merge(dataframe3)

88
Q

How do you specify custom keys for merging dataframes in Pandas?

A

df = df1.merge(df2, on=’column_name’).merge(df3, left_on=’df1_column’, right_on=’df2_column’)

89
Q

What does pd.concat() do, and what are its key parameters?

A

Combines multiple datasets into one structure.
Key Parameters:
axis: 0 (vertical) or 1 (horizontal).
join: Type of join (default = outer).
keys: Labels to identify data sources.

90
Q

What is the syntax for a vertical concatenation of dataframes?

A

df = pd.concat([df1, df2], axis=0)

91
Q

How do you concatenate dataframes horizontally with an inner join?

A

df = pd.concat([df1.set_index(‘col_name’), df2.set_index(‘col_name’)], join=’inner’, axis=1)

92
Q

How do you merge two dataframes on a common column?

A

df = dataframe1.merge(dataframe2, how=’join_type’, on=’common_column’)
Join types: inner, outer, left, right.

93
Q

How do you merge dataframes with custom join columns?

A

df = df1.merge(df2, left_on=’col_df1’, right_on=’col_df2’, suffixes=[‘_df1’, ‘_df2’])

94
Q

What happens when matching columns overlap during merging?

A

Pandas appends _x (from dataframe1) and _y (from dataframe2) to differentiate. Use suffixes to customize labels.

95
Q

What is the difference between merge() and join() in Pandas?

A

merge(): Combines dataframes based on common columns.
join(): Combines dataframes on their indexes.

96
Q

What is the default behavior of pd.concat() when the axis parameter is not specified?

A

By default, pd.concat() stacks dataframes vertically (axis=0).

97
Q

How do you add labels to identify the source of data in concatenation?

A

Use the keys parameter:
df = pd.concat([df1, df2], keys=[‘data1’, ‘data2’], axis=0)

98
Q

What is the syntax for an outer merge?

A

df = df1.merge(df2, how=’outer’, on=’common_column’)

99
Q

How do you combine dataframes using join()?

A

df = dataframe1.join(dataframe2, how=’join_type’, lsuffix=’_left’, rsuffix=’_right’)
Default join type: Left join.

100
Q

How do you merge dataframes when they share no common column?

A

Use left_on and right_on parameters:
df = df1.merge(df2, left_on=’col1_df1’, right_on=’col2_df2’)

101
Q

How do you differentiate overlapping columns from merged dataframes?

A

Use the suffixes parameter:
suffixes=[‘_df1’, ‘_df2’]

102
Q

What happens when concatenating dataframes with different indexes or columns?

A

Outer join (default): Includes all indexes or columns.
Inner join: Keeps only matching indexes or columns.

103
Q

How do you align dataframes by their column names for concatenation?

A

Set indexes with .set_index() before concatenation:
df = pd.concat([df1.set_index(‘col’), df2.set_index(‘col’)], axis=1)

104
Q

What is a key advantage of chaining merge() commands?

A

Efficiency and clarity when combining multiple datasets in a single line:
df = df1.merge(df2).merge(df3)

105
Q

What is the difference between axis=0 and axis=1 in concatenation?

A

axis=0: Stacks dataframes vertically (rows).
axis=1: Stacks dataframes horizontally (columns).

106
Q
A
107
Q

What does the len() function do in Python when applied to a DataFrame or list?

A

It returns the total number of rows or items in the structure, excluding the zeroth index.

108
Q

How do you access a column’s values in a DataFrame by its name?

A

Use count[‘column_name’].

109
Q

How can you isolate specific columns in a DataFrame?

A

Use double square brackets, e.g., df[[‘column1’, ‘column2’]]. The columns will appear in the order specified.

110
Q

How do you create a subset of rows from rows 1 to 4 using indexing?

A

Use count[1:5] (row 5 is excluded).

111
Q

How would you display the first 7 rows of a DataFrame?

A

Use count[:7].

112
Q

How would you display only the last row of a DataFrame?

A

Use count[-1:].

113
Q

How do you access a specific value in a DataFrame using iloc?

A

Use integer positions, e.g., df.iloc[2, 3] for the value at row 2, column 3.

114
Q

How do you select all rows but only the last column using iloc?

A

Use df.iloc[:, -1].

115
Q

How does loc differ from iloc?

A

loc uses labels (row/column names), while iloc uses integer positions.

116
Q

How do you retrieve a specific value at row 2 and column ‘field’ using loc?

A

Use df.loc[2, ‘field’].

117
Q

How can you filter rows based on a column value being greater than 10?

A

Use df[df[‘column_name’] > 10].

118
Q

How do you extract rows with specific values in a column using .isin()?

A

Use df[df[‘column_name’].isin([‘value1’, ‘value2’])]

119
Q

How do you filter rows based on a string condition using .query()?

A

Use syntax like df.query(‘Column == “Value”’).

120
Q

What is the purpose of the groupby function in Pandas?

A

It splits a DataFrame into groups based on column(s) and performs operations on each group (e.g., grouping rows by soil types or drainage levels).

121
Q

What are some common aggregation functions used with groupby?

A

mean: Calculates the average for each group.
max: Finds the maximum value for each group.
min: Finds the minimum value for each group.
sum: Adds up values for each group.

122
Q

How do you avoid warnings when using groupby on columns with mixed data types?

A

Use numeric_only=True to limit operations to numeric columns.

123
Q

How can you calculate the mean of grouped data using groupby?

A

Syntax: df.groupby([‘set’]).mean()
Example: Groups by the “set” column and calculates the mean, producing a matrix with rows for “control” and “experiment.”

124
Q

How do you count the occurrences of unique values in a column using groupby?

A

Use df.groupby(‘col_name’).size() to get a list of unique values and their counts.

125
Q

What is the purpose of splitting columns in a DataFrame?

A

To split values in a column into multiple new columns based on a delimiter or character.

126
Q

What is the syntax for splitting a column into multiple columns?

A

df[[‘new_col1’, ‘new_col2’]] = df[‘original_col’].str.split(‘delimiter’, n=number_of_splits, expand=True)
delimiter: Character where the split occurs.
n: Number of splits to perform.
expand=True: Ensures output is split into separate columns.

127
Q

How do you split the string “coding” at the letter d?

A

df[[‘col1’, ‘col2’]] = df[‘coding’].str.split(‘d’, n=1, expand=True)
Result:
col1: “co”
col2: “ding”

128
Q

What are two key notes about splitting columns?

A

Always use expand=True to create multiple columns.
Use the n parameter to control how many splits occur if the delimiter appears multiple times.

129
Q

What is the purpose of increasing sample size in correlation analysis?

A

Increasing the sample size improves the precision of the correlation estimate, reduces uncertainty in the p-value, and makes the results more reliable.

130
Q

How do you calculate the Pearson correlation coefficient in Python?

A

result = pearsonr(df[‘col1’], df[‘col2’])
print(f’r = {result.statistic:.2f}’)

131
Q

What does a Pearson correlation coefficient r value of 0.46 and a p-value of 0.03 indicate?

A

The p-value of 0.03 indicates there is a 3% chance that the correlation occurred by random chance, assuming there is no real correlation (null hypothesis). Since the p-value is low (< 0.05), we may reject the null hypothesis and consider the correlation statistically significant.

132
Q

What is the difference between a statistic and a parameter?

A

Statistic: A numerical property of a sample (e.g., sample correlation coefficient r).
Parameter: A numerical property of the population (e.g., population correlation coefficient ρ).

133
Q

What is the purpose of the p-value in hypothesis testing?

A

The p-value helps assess the significance of the observed correlation. It represents the probability of observing a test statistic as extreme as the observed one, assuming the null hypothesis is true.

134
Q

How do you interpret a p-value in correlation analysis?

A

If the p-value < 0.05, we reject the null hypothesis (no correlation) and consider the correlation significant.
If the p-value > 0.05, the correlation might be due to random chance and we fail to reject the null hypothesis.

135
Q

How can you calculate a 95% confidence interval for a Pearson correlation?

A

ci = result.confidence_interval()
ci.low, ci.high

136
Q

What is the purpose of the Shapiro-Wilk test?

A

The Shapiro-Wilk test assesses if the data follows a normal distribution. The null hypothesis is that the data comes from a normally distributed population.

137
Q

How do you apply the Shapiro-Wilk test in Python?

A

result = Shapiro(df[‘col_name’])

138
Q

How do you handle non-normally distributed data in Python?

A

to transform non-normally distributed data, apply a log transformation:

df[‘log-variable’] = np.log10(df[‘col_name’])

139
Q

What is the purpose of a t-test for independent samples?

A

A t-test for independent samples tests whether two independent samples have different means

140
Q

How do you perform a t-test for independent samples in Python?

A

result = ttest_ind(subset[‘col1’], subset[‘col2’])

141
Q

How do you modify the y-axis scale for large values in a scatter plot?

A

ax.set_yscale(‘log’)

142
Q

What is the probability of observing a correlation of at least 0.46, assuming no real correlation exists?

A

The p-value gives the probability of observing a correlation of at least 0.46 by chance. If the p-value is small (e.g., < 0.05), it suggests that the observed correlation is statistically significant.

143
Q

What is the purpose of a swarm plot?

A

Swarm plots are used to compare a numerical variable with two categories, allowing us to visually assess the difference in means between those categories.

144
Q

What is the best use of a scatterplot (lmplot)?

A

Scatterplots (lmplot) are ideal for visualizing the relationship between two numerical variables. It combines a scatter plot with a regression line to show the relationship.

145
Q

What is an explanatory variable in regression?

A

The explanatory variable is the independent variable (cause). It is expected to explain the variation in the response variable.

146
Q

What is a response variable in regression?

A

The response variable is the dependent variable (effect). It is expected to change in response to the explanatory variable.

147
Q

What is the model formula for regression?

A

The relationship between the response and explanatory variables is expressed as:
response_variable ~ explanatory_variable

148
Q

What does the t-value represent in regression?

A

The t-value measures how many standard errors the coefficient is away from zero.
Example: t-value = 0.79.

149
Q

What does the p-value indicate in regression?

A

The p-value indicates the probability that the coefficient is zero in the population.
Example: p-value = 0.433. Since p > 0.05, we accept the null hypothesis (no significant effect).

150
Q

What is a 95% Confidence Interval (CI) in regression?

A

The 95% CI is a range within which the true population parameter is likely to fall.
Example: CI = (-0.25, 0.109). Since the CI does not contain 0, we cannot reject the null hypothesis.

151
Q

How do you interpret the relationship between the y-axis and x-axis in linear models?

A

The mean of the y-axis is given by the equation:
Mean of y = intercept + slope * x

152
Q

What must be mentioned in the conclusion when interpreting relationships?

A

It is important to mention whether the relationship is linear or logarithmic and state if there is a statistically significant relationship.