Exploratory Data Analysis Flashcards
Correct syntax for numpy, pandas, matplotlib and seaborn
Select columns (.loc) from DataFrame with ALL non-zeros
df.loc[ : , df.all( ) ]
Select columns (.loc) from DataFrame with ANY non-zeros
df.loc[ : , df.any( ) ]
Select columns (.loc) from DataFrame with ANY NaNs
df.loc [ : , df.isnull( ).any( ) ]
Select columns (.loc) from DataFrame with NO NaNs
df.loc [ : , df.notnull( ).all( ) ]
Drop rows with ANY NaNs
df.dropna(how = ‘any’)
Import local file.xlsx using pandas (as data)
data = pd.ExcelFile(file.xlsx) # print(data.sheet_names) # df = data.parse('sheetname') or (0)
Initial GET requests using urllib.
6 lines : import statements, url, request, response, read and close.
- from urllib.request import urlopen, Request
- url = “https://www.wikipedia.org”
- request = Request(url)
- response = urlopen(request)
- html = response.read( )
- response.close( )
Initial GET requests using requests.
4 lines : import, url, request, read.
- import requests
- url = “https://www.wikipedia.org”
- r = requests.get(url)
- text = r.text
Tidy Data: Principles. ( 3 )
- Columns represent separate variables containing values
- Rows represent individual observations
- Observational units form tables
Tidy Data: Melting and Pivoting
Turn analysis-friendly into report-friendly
Melting: turn columns into rows.
Pivoting: turn unique values into separate columns
Tidy Data: Melting syntax
pd.melt(frame=df, id_vars=’col-2b-fixed’, value_vars=[’ ‘,’ ‘ ], var_name=’name’, value_name=’name’)
Tidy Data: Pivoting syntax
df.pivot_table(values=’ ‘, index=’ ‘, columns=’ ‘, aggfunc=np.mean)
Change column-type from ‘object’ to ‘numeric’
df [ ‘object_col’ ] = pd.to_numeric ( df [ ‘object_col’ ], errors=’coerce’)
Change column-type to ‘category’
df [ ‘column’ ].astype( ‘category’ )
Plot idioms for DataFrames (3)
df. plot( kind=’hist’)
df. plt.hist( )
df. hist( )
Syntax for .loc accessor
df.loc [ ‘Row_Label’ ] [ ‘Col_Label’ ]
Syntax: sns barplot - use.
sns.barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
Syntax: sns countplot - use.
sns.countplot(x=’column’, data=df, hue=’category’)
Syntax: sns histogram plot - use.
sns.distplot ( df [ ‘continuous’ ], kde=False, bins=30 )
Syntax: sns scatterplot - use.
sns.lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, fit_reg=False )
Syntax: sns pairplot - use.
sns.pairplot ( df, hue=’categorical’, palette=’coolwarm’ )
sns categorical plots
sns. barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot( x= ‘categorical’, data= df )
sns. factorplot( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)
Syntax: sns heatmap - use.
sns.heatmap ( PivotTable, cmap=’ ‘, lw=’ ‘. lc=’ ‘ )
sns categorical plots
sns. barplot ( x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot ( x= ‘categorical’, data= df )
sns. factorplot ( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)
sns categorical plots (6)
sns. barplot ( x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot ( x= ‘categorical’, data= df )
sns. factorplot ( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)
sns distribution plots (4)
sns. distplot ( df [ ‘continuous’ ], kde=False, bins=30 )
sns. jointplot ( x= ‘continuous’, y= ‘numerical’, data=df )
sns. pairplot ( df, hue=’categorical’, palette=’coolwarm’ )
sns. rugplot ( df [ ‘continuous’ ] )
sns matrix plots (2)
sns. heatmap ( PivotTable, cmap=’ ‘, lw=’ ‘. lc=’ ‘ )
sns. heatmap ( df.corr( ), annot = True )
sns.clustermap (PivotTable, cmap= ‘ ‘ )
sns PairGrid
g = sns.PairGrid ( df )
g. map ( plt.scatter )
g. map_diag ( sns.distplot )
g. map_upper ( plt.scatter )
g. map_lower ( sns.kdeplot )
sns FacetGrid
g = sns.FacetGrid ( data=df, col=’category’, row=’category’ )
g.map ( sns.distplot, ‘numerical’ )
sns lmplot (regression)
sns.lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, markers[‘o’ , ‘v’], scatter_kws={‘s’:100} )
Scatter plots
df. plot.scatter(x= ‘col_A’, y=’col_B’, color=’col_C’, size=df [‘col_C’]*100 )
plt. scatter( x, y)
sns. lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, fit_reg=False )
df. iplot ( kind=scatter, x= ‘col_A’, y=’col_B’, mode=’markers’ )
Data Quality Control
- find missing data
- drop columns
- drop rows
df. isnull( ) # returns bool - NaN = True
sns. heatmap ( df.isnull( ), yticklabels=False, cbar=False, cmap=’viridis’ )
Histogram plots
df [ ‘continuous’ ].hist(bins=20)
df [ df [ ‘column1’ ] == 1 [ ‘column2’ ].hist ( bins=30, color=’blue’, label=’label’, alpha=0.5 )
Reproducible Data Analysis 1/10
Get data from web
Jake Vanderplas
from urllib.request import urlretrieve URL = 'https://data.seattle.gov/api....' urlretrieve(URL, 'file.csv') data = pd.read_csv("file.csv", index_col='Date', parse_dates=True) data.resample('W').sum( ).plot( )
Reproducible Data Analysis 2/10
Exploratory Data Analysis
Jake Vanderplas
data.columns = [ ‘West’, ‘East’ ]
data[‘Total’] = data[‘West’] + data[‘East’]
ax = data.resample(‘D’).sum( ).rolling(365).sum( ).plot( )
ax.set_ylim(0, None)
data.groupby(data.index.time).mean( ).plot( )
pivoted = data.pivot_table(‘Total’, index=data.index.time, columns=data.index.date)
pivoted.plot(legend=False, alpha=0.01) # line for every day
Reproducible Data Analysis 3/10
Version control with Git & GitHub
Jake Vanderplas
https://github.com Create new repository (Name, description, README, .gitignore(Python), MIT License) Copy Clone or download link Terminal window: git clone mv JupyterNotebook.ipynb into git folder git status git add JupyterNotebook.ipynb git commit -m "Add initial analysis" git push origin master
open JupyterNotebook.ipynb from correct location
git status > file.csv
vim .gitignore > # data > file.csv
Reproducible Data Analysis 4/10
Working with Data and GitHub
Jake Vanderplas
import os
from urllib.request import urlretrieve
URL = ‘https://data.seattle.gov/api….’
def get_file(filename=’file.csv’, url=URL, force_download=False):
if force_download or not os.path.exists(filename):
urlretrieve(url, filename)
data = pd.read_csv(“file.csv”, index_col=’Date’, parse_dates=True)
data.columns = [ ‘West’, ‘East’ ]
data[‘Total’] = data[‘West’] + data[‘East’]
return data
data = get_file( )
Reproducible Data Analysis 5/10
Creating a Python package
Jake Vanderplas
Terminal window
mkdir jupyterworkflow
touch jupyterworkflow/__init__.py
vim jupyterworkflow/data.py
""" Download and cache the data Parameters: filename : string (optional) location to save the data url : string (optional) web location force_download : bool (optional) Returns """ < replace 4/10 with following > from jupyterworkflow.data import get_file
Confusion matrix
True pos (tp). False pos (fp).
False neg (fn ). True neg (tn).
Confusion matrix
Accuracy =
Fraction of correct predictions
Accuracy = correct / total
= tp + tn/ tp fp tn fn
Confusion matrix
Precision =
How accurate positive predictions were.
tp / tp + fp
Confusion matrix
Recall
What fraction of positives the model identified.
tp / tp + fn
Confusion matrix
F1 score
The harmonic mean of precision and recall- lies between them
2 * prec * recall / prec + recall
Model trade-off between precision and recall.
Too many “yes” gives high fp- high recall, low precision
Too few “yes” gives high fn- low recall, high precision.
Input feature categories Naive Bayes classifier
Suited to yes or no features
Input feature categories Regression models
Numerical features
Input feature categories Decision tree
Numeric or categorical features
Input feature categories SVM
Numerical features
Common way to analyze the relationship between a categorical feature and a continuous feature
Boxplot
Check for null values in the dataset.
print( df.isnull ( ).values.sum( ) )
Check column-wise distribution of null values
print( df.isnull ( ).sum( ) )
Frequency distribution of categories within a feature
print(df[‘category_col’].value_counts( ) )
Dictionary comprehension to map category strings to numeric values.
eg.
{‘carrier’: {‘AA’: 1, ‘OO’: 7, ‘DL’: 4, ‘F9’: 5, ‘B6’: 3, ‘US’: 9, ‘AS’: 2, ‘WN’: 11, ‘VX’: 10, ‘HA’: 6, ‘UA’: 8}}
labels = df['category_col'].astype('category').cat.categories.tolist( ) replace = {'category_col' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
print(replace)
Categorical Data - 3 types
Nominal: No intrinsic order
Ordinal: Ordered or ranked.
Dichotomous: Nominal with only 2 categories