Feature Engineering with PySpark Flashcards
What is feature engineering?
process of using domain knowledge to create new features to help our models perform better.
What are the 6 basic steps of a data science project?
Project Understanding and Data Collection
EDA
Data Cleaning
Feature engineering
Model training
Project Delivery
What one of the performance advantages of Parquet data files in terms of importing and datatype selection?
In Parquet, data is columnar and imported in subcolumns. For instance, in csv, the whole file is imported at once.
Also, parquet fields have enforced types, saving users time searching for the correct datatypes.
Similar to research projects, what is the first step of every data science project?
Understanding what question is driving the analysi.
What kind of python function would you write to verify data load?
def check_load(df, num_records, num_columns):
Takes a dataframe and compares record and column counts to input
Message to return if the critera below aren’t met
message = ‘Validation Failed’
Check number of records
if num_records == df.count():
Check number of columns
if num_columns == len(df.columns):
Success message
message = ‘Validation Passed’
return message
What kind of function would you write to verify the datatypes?
create list of actual dtypes to check
actual_dtypes_list = df.dtypes
print(actual_dtypes_list)
Iterate through the list of actual dtypes tuples
for attribute_tuple in actual_dtypes_list:
Check if column name is dictionary of expected dtypes
col_name = attribute_tuple[0]
if col_name in validation_dict.keys():
Compare attribute types
col_type = attribute_tuple[1]
if col_type == validation_dict[col_name]:
print(col_name + ‘ has expected dtype.’)
During aggregation of results, how would you write the calculation of the mean?
df.agg({‘col_name’: ‘mean}).collect()
OR
- import pyspark.sql.functions as F*
df. agg(F.mean(‘col_name’)).collect()
what is the issue with converting pyspark dataframes to pandas?
Pyspark is used for big data and this may cause pandas to crash.
If you want to plot a pyspark dataframe using non-big data tools (eg seaborn), what strategy could be used? Please write the code as example.
We could sample from the dataframe and plot just a portion of the data.
df.sample(withReplacement=False, fraction=0.5, seed=42)
fraction=0.5 # percentage of data
seed=42 # ensure reproducibility
Write an example using where and like to filter a pyspark example.
What is the difference between .drop and .dropna()?
drop removes columns, do not forget to put a star to unpack the list.
dropna removes nullvalues.
What is the formula for min/max normalization?
What is the Z tranformation of the data? What is the formula?
Process of standartizing the data to have mean 0 and express the variance in std
What is a log transformation and where could it be useful?
When trying to standardize data that is highly skewed, such as the one found in the exponentional distribution.
Write a pyspark function that applies a min/max scaling procedure to a df and columns of choice, as arguments.
def min_max_scaler(df, cols_to_scale):
Takes a dataframe and list of columns to minmax scale. Returns a dataframe.
for col in cols_to_scale:
Define min and max values and collect them
max_days = df.agg({col: ‘max’}).collect()[0][0]
min_days = df.agg({col: ‘min’}).collect()[0][0]
new_column_name = ‘scaled_’ + col
Create a new column based off the scaled data
df = df.withColumn(new_column_name,
(df[col] - min_days) / (max_days - min_days))
return df
df = min_max_scaler(df, cols_to_scale)
Show that our data is now between 0 and 1
df[[‘DAYSONMARKET’, ‘scaled_DAYSONMARKET’]].show()