Feature Engineering with PySpark Flashcards by José Cruz

What is feature engineering?

process of using domain knowledge to create new features to help our models perform better.

How well did you know this?

Not at all

Perfectly

What are the 6 basic steps of a data science project?

Project Understanding and Data Collection

EDA

Data Cleaning

Feature engineering

Model training

Project Delivery

How well did you know this?

Not at all

Perfectly

What one of the performance advantages of Parquet data files in terms of importing and datatype selection?

In Parquet, data is columnar and imported in subcolumns. For instance, in csv, the whole file is imported at once.

Also, parquet fields have enforced types, saving users time searching for the correct datatypes.

How well did you know this?

Not at all

Perfectly

Similar to research projects, what is the first step of every data science project?

Understanding what question is driving the analysi.

How well did you know this?

Not at all

Perfectly

What kind of python function would you write to verify data load?

def check_load(df, num_records, num_columns):

Takes a dataframe and compares record and column counts to input

Message to return if the critera below aren’t met

message = ‘Validation Failed’

Check number of records

if num_records == df.count():

Check number of columns

if num_columns == len(df.columns):

Success message

message = ‘Validation Passed’

return message

How well did you know this?

Not at all

Perfectly

What kind of function would you write to verify the datatypes?

create list of actual dtypes to check

actual_dtypes_list = df.dtypes

print(actual_dtypes_list)

Iterate through the list of actual dtypes tuples

for attribute_tuple in actual_dtypes_list:

Check if column name is dictionary of expected dtypes

col_name = attribute_tuple[0]

if col_name in validation_dict.keys():

Compare attribute types

col_type = attribute_tuple[1]

if col_type == validation_dict[col_name]:

print(col_name + ‘ has expected dtype.’)

How well did you know this?

Not at all

Perfectly

During aggregation of results, how would you write the calculation of the mean?

df.agg({‘col_name’: ‘mean}).collect()

import pyspark.sql.functions as F*
df. agg(F.mean(‘col_name’)).collect()

How well did you know this?

Not at all

Perfectly

what is the issue with converting pyspark dataframes to pandas?

Pyspark is used for big data and this may cause pandas to crash.

How well did you know this?

Not at all

Perfectly

If you want to plot a pyspark dataframe using non-big data tools (eg seaborn), what strategy could be used? Please write the code as example.

We could sample from the dataframe and plot just a portion of the data.

df.sample(withReplacement=False, fraction=0.5, seed=42)

fraction=0.5 # percentage of data

seed=42 # ensure reproducibility

How well did you know this?

Not at all

Perfectly

Write an example using where and like to filter a pyspark example.

How well did you know this?

Not at all

Perfectly

What is the difference between .drop and .dropna()?

drop removes columns, do not forget to put a star to unpack the list.

dropna removes nullvalues.

How well did you know this?

Not at all

Perfectly

What is the formula for min/max normalization?

How well did you know this?

Not at all

Perfectly

What is the Z tranformation of the data? What is the formula?

Process of standartizing the data to have mean 0 and express the variance in std

How well did you know this?

Not at all

Perfectly

What is a log transformation and where could it be useful?

When trying to standardize data that is highly skewed, such as the one found in the exponentional distribution.

How well did you know this?

Not at all

Perfectly

Write a pyspark function that applies a min/max scaling procedure to a df and columns of choice, as arguments.

def min_max_scaler(df, cols_to_scale):

Takes a dataframe and list of columns to minmax scale. Returns a dataframe.

for col in cols_to_scale:

Define min and max values and collect them

max_days = df.agg({col: ‘max’}).collect()[0][0]

min_days = df.agg({col: ‘min’}).collect()[0][0]

new_column_name = ‘scaled_’ + col

Create a new column based off the scaled data

df = df.withColumn(new_column_name,

(df[col] - min_days) / (max_days - min_days))

return df

df = min_max_scaler(df, cols_to_scale)

Show that our data is now between 0 and 1

df[[‘DAYSONMARKET’, ‘scaled_DAYSONMARKET’]].show()

How well did you know this?

Not at all

Perfectly

Why and how do you correct right skwed data?

Study These Flashcards

It applies to data negatively skewed.

What are the types of missing data?

Study These Flashcards

Give 3 methods to deal with missing data that do not involve droping entirely?

Study These Flashcards

What visual way could be used to visualize missing data? Is there any package that could do the same?

Study These Flashcards

Write a pyspark function that inspects the columns and drops them if there is a threshold of missing data of around .6

Study These Flashcards

def column_dropper(df, threshold):

Takes a dataframe and threshold for missing values. Returns a dataframe.

total_records = df.count()

for col in df.columns:

Calculate the percentage of missing values

missing = df.where(df[col].isNull()).count()

missing_percent = missing / total_records

Drop column if percent of missing is more than threshold

if missing_percent > threshold:

df = df.drop(col)

return df

Drop columns that are more than 60% missing

df = column_dropper(df, 0.6)

What function can be used to describe a column data in a dataframe?

Study These Flashcards

df.column.describe() # it computes basic statistics such as mean, std, max, min…

What is the purpose of when and otherwise functions?

Study These Flashcards

df.withColumn(‘new_col_name’, when(bool_series, value).otherwise(value))

How would you used split and getItem method to split column strings and extract the first element?

Study These Flashcards

df.colname.split(spliting_char).getItem(item_position_idx)

What is the concept of split, explode and pivot?

Study These Flashcards

What is Binerization and how would you apply it in pyspark?

Binarization takes values below or equal to a threshold and replaces them by 0, values above it by 1. pyspark.ml.feature.Binarizer(df['colname'])

What is the issue with randomly spliting timeseries data?

Future dates can lead to overfitting.

What are the advantages of remove low observation variables?

It can improve processing speed of model training, prevent overfitting by coincidence and help interpretability by reducing the number of things to consider.

What is the pyspark method used to split data in train/test subsets?

.randomSplit()

Feature Engineering with PySpark Flashcards

(30 cards)