EDA Flashcards
Distribution helps with what data exploration activities?
-To determine how to fill missing values in a column/variable
-To determine by which measure, if there are missing values, they should be most aligned to within the distribution of the variable (imputing by the mode or the mean, or the median)
When do we impute by the median?
We impute by the median in a skewed distribution
What does an iloc function accomplish in EDA?
It isolates rows or columns in a dataset. We have to provide rows and columns by integer indexing.
What is this example of iloc instructing: df.iloc[:,[1,2,5,6,10]]
The iloc here is calling all the rows in the dataset with respect to the specific columns identified in the indexed column list.
What does the loc function accomplish in EDA?
It isolates rows or columns in a dataset. We are allowed to provide columns by the column string names instead of integer indexes.
What is this example of loc instructing: df.loc[:,[‘gender’, ‘age’]]
The loc here is calling all the rows in the dataset with respect to the specific columns identified in the names column list.
What function can be used to create a filter in a dataset?