Pandas Missing Values Flashcards
NaN (acronym for Not a Number)
The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
NumPy does provide some special aggregations that will ignore these missing values:
data.isnull()
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example
data[data.notnull()]
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example
data.dropna()
In addition to the masking used before, there are the convenience methods, dropna() (which removes NA values) and fillna() (which fills in NA values). For a Series, the result is straightforward:
df.dropna(axis=’columns’)
Alternatively, you can drop NA values along a different axis; axis=1 drops all columns containing a null value:
df.dropna(axis=’columns’, how=’all’)
The default is how=’any’, such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify how=’all’, which will only drop rows/columns that are all null values:
df.dropna(axis=’rows’, thresh=3)
For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept:
data.fillna(0)
We can fill NA entries with a single value, such as zero:
# forward-fill data.fillna(method='ffill')
# back-fill data.fillna(method='bfill')
We can specify a forward-fill / back fill to propagate the previous value forward:
df.fillna(method=’ffill’, axis=1)
or DataFrames, the options are similar, but we can also specify an axis along which the fills take place: