Data analysis code Flashcards
What does axis=1 do in a DataFrame operation?
It moves along rows (horizontally)
What does axis=0 do in a DataFrame operation?
It moves down columns (vertically).
What is the purpose of the Shapiro-Wilk test?
It tests whether the data is normally distributed.
What is the interpretation of a p-value greater than 0.05 in the Shapiro-Wilk test?
Data is normally distributed; fail to reject the null hypothesis.
What does df.shape return?
A tuple with the number of rows and columns: (number of rows, number of columns).
What statistics does df.describe() provide?
It provides summary statistics: Count, Mean, Standard Deviation, Minimum, Quartiles (25%, 50%, 75%), and Maximum.
What does df.mean() do in a DataFrame?
It computes the mean of each column.
How do you access a specific column in a DataFrame?
Use the column name: df[‘column_name’].
What does the code df[‘column_name’].max() do?
It returns the maximum value in the specified column.
What assumptions does the paired t-test make?
It assumes no major outliers, independent observations, continuous dependent variable, and normally distributed dependent variable.
How do you read a CSV file into a DataFrame in Python?
df = pd.read_csv(‘../folder/name.filetype’)
How do you read an Excel file into a DataFrame in Python?
df = pd.read_excel(‘file_path’)
How do you read a tab-separated file into a DataFrame?
Use the sep=’\t’ parameter in pd.read_csv().
How do you handle missing data when reading a file?
Use na_values=’’ to replace ‘’ with NaN.
How do you specify the data type for integer columns in a DataFrame?
Use dtype=pd.Int64Dtype() to convert float to integers.
How do you rename columns when reading a file into a DataFrame?
Use header=None, names=[‘column1’, ‘column2’, …] when reading the file.
How do you skip rows from the top when reading a file?
Use the skiprows=… parameter.
How do you skip rows from the bottom when reading a file?
Use the skipfooter=… parameter.
How do you set a specific column as the index when reading a file?
Use index_col=1 to set the second column as the index
How do you update a specific value in a DataFrame?
Use df.at[‘row_name’, ‘column_name’] = new_value.
What does df.info() display?
It displays the number of entries, index range, columns, and non-null count per column.
What does the data type float64 represent?
It represents decimal numbers.
What does the data type int64 represent?
It represents whole numbers (integers).
What does the data type object represent?
It represents strings or words.
How do you display the last 5 rows of a DataFrame?
Use df.tail().
How do you sort the values in a DataFrame by a specific column?
Use df.sort_values(by=[‘column_name’], ascending=False).
How do you drop rows with missing values from a DataFrame?
Use df.dropna(inplace=True).
What is the effect of using inplace=True in df.dropna()?
It modifies the original DataFrame.
What is the effect of using inplace=False in df.dropna()?
It creates a new DataFrame and leaves the original unchanged.
How do you drop columns with missing values from a DataFrame?
Use df.dropna(axis=’columns’, inplace=True)
How do you drop rows that have less than 2 non-missing values?
Use df.dropna(thresh=2, inplace=True).
How do you drop rows based on missing values in specific columns?
Use df.dropna(subset=[‘column1’, ‘column2’], inplace=True).
How do you save a DataFrame as a CSV file?
Use df.to_csv(‘data_name.filetype’, index=False).
How do you save a DataFrame as an Excel file?
Use df.to_excel(‘data_name.filetype’, index=False).
How do you exclude the index when saving a DataFrame to a file?
Use index=False in the to_csv() or to_excel() methods.
How do you add a new column to a DataFrame?
df[‘Column name’] = [‘column contents’, ‘next value’, …]
How do you forward-fill NaN values in a column?
df[‘Column name’] = df[‘Column name’].fillna(method=”ffill”)
How do you remove a single column from a DataFrame?
df = df.drop(columns=’column name’)
How do you remove multiple columns from a DataFrame?
df = df.drop(columns=[‘name 1’, ‘name 2’])
How do you remove a specific row by its index?
df = df.drop(row_number)
How do you rearrange columns into a custom order?
cols = df.columns.tolist()
cols_new = [cols[1], cols[3], cols[2], cols[0]]
df = df[cols_new]
How do you sort the columns alphabetically?
cols_new = sorted(cols) then df = df[cols_new]
How do you sort the columns in reverse alphabetical order?
cols_new = sorted(cols, reverse=True) then df = df[cols_new]
How do you sort values by multiple columns in descending order?
df = df.sort_values(by=[col1, col2], ascending=False)
How do you create a new column by multiplying an existing column by a scalar?
df[‘new column’] = df[‘old column’] * 50
How do you apply a logarithmic transformation to a column?
df[‘new column’] = np.log(df[‘old column’])
How do you apply a square root transformation to a column?
df[‘new column’] = np.sqrt(df[‘old column’])
How do you concatenate two string columns in a DataFrame?
df[‘Soil_Drainage’] = df[‘Soil’] + ‘_’ + df[‘Drainage’]
How do you convert numerical data to strings before concatenating?
df[‘new_column’] = df[‘old col’] + ‘_’ + df[‘old col’].astype(str)
How do you split a column into multiple columns based on a delimiter?
df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_’, expand=True)
How do you limit the number of splits to just once when splitting a column?
df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_’, n=1, expand=True)
How do you use a specific substring for splitting a column?
df[[‘new col1’, ‘new col2’]] = df[‘old_column’].str.split(‘_20’, expand=True)
How do you remove duplicate rows from a DataFrame?
data = data.drop_duplicates()
How do you reset the index after dropping duplicates from a DataFrame?
data = data.drop_duplicates().reset_index(drop=True)
How do you transpose a DataFrame to swap rows and columns?
df = df.T (This makes the index the column headers and the column headers the index).
How do you reset the index and move it to a column in a DataFrame?
Use df.reset_index(inplace=True) to reset the index, then df = df.drop(columns=’index’) to remove the newly created index column.
How do you select a range of rows using iloc in a DataFrame?
f.iloc[1:2] selects rows from position 1 up to but not including position 2.
How do you check the data types of each column in a DataFrame?
Use df.dtypes to check the data types of each column.
How do you change the data type of a column in a DataFrame?
Use df[‘col number’].astype(type you want) to cast a column to a different type.
What is the purpose of melting a DataFrame, and how do you do it?
Melting transforms data from wide form into long form, which is useful for certain types of analysis and visualization. Use pd.melt(dataframe, id_vars, value_vars, var_name=’new_var_name’, value_name=’new_value_name’).