Exploratory Data Analysis Flashcards
Correct syntax for numpy, pandas, matplotlib and seaborn
Select columns (.loc) from DataFrame with ALL non-zeros
df.loc[ : , df.all( ) ]
Select columns (.loc) from DataFrame with ANY non-zeros
df.loc[ : , df.any( ) ]
Select columns (.loc) from DataFrame with ANY NaNs
df.loc [ : , df.isnull( ).any( ) ]
Select columns (.loc) from DataFrame with NO NaNs
df.loc [ : , df.notnull( ).all( ) ]
Drop rows with ANY NaNs
df.dropna(how = ‘any’)
Import local file.xlsx using pandas (as data)
data = pd.ExcelFile(file.xlsx) # print(data.sheet_names) # df = data.parse('sheetname') or (0)
Initial GET requests using urllib.
6 lines : import statements, url, request, response, read and close.
- from urllib.request import urlopen, Request
- url = “https://www.wikipedia.org”
- request = Request(url)
- response = urlopen(request)
- html = response.read( )
- response.close( )
Initial GET requests using requests.
4 lines : import, url, request, read.
- import requests
- url = “https://www.wikipedia.org”
- r = requests.get(url)
- text = r.text
Tidy Data: Principles. ( 3 )
- Columns represent separate variables containing values
- Rows represent individual observations
- Observational units form tables
Tidy Data: Melting and Pivoting
Turn analysis-friendly into report-friendly
Melting: turn columns into rows.
Pivoting: turn unique values into separate columns
Tidy Data: Melting syntax
pd.melt(frame=df, id_vars=’col-2b-fixed’, value_vars=[’ ‘,’ ‘ ], var_name=’name’, value_name=’name’)
Tidy Data: Pivoting syntax
df.pivot_table(values=’ ‘, index=’ ‘, columns=’ ‘, aggfunc=np.mean)
Change column-type from ‘object’ to ‘numeric’
df [ ‘object_col’ ] = pd.to_numeric ( df [ ‘object_col’ ], errors=’coerce’)
Change column-type to ‘category’
df [ ‘column’ ].astype( ‘category’ )
Plot idioms for DataFrames (3)
df. plot( kind=’hist’)
df. plt.hist( )
df. hist( )
Syntax for .loc accessor
df.loc [ ‘Row_Label’ ] [ ‘Col_Label’ ]
Syntax: sns barplot - use.
sns.barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
Syntax: sns countplot - use.
sns.countplot(x=’column’, data=df, hue=’category’)
Syntax: sns histogram plot - use.
sns.distplot ( df [ ‘continuous’ ], kde=False, bins=30 )
Syntax: sns scatterplot - use.
sns.lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, fit_reg=False )
Syntax: sns pairplot - use.
sns.pairplot ( df, hue=’categorical’, palette=’coolwarm’ )