Python & Plots Flashcards

1
Q

Package

A

A collection of modules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Library

A

A collection of packages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Module

A

A bunch of related code saved in a file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Framework

A

A collection of modules and packages that contain the basic flow and architecture of an application 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pandas

A

Open source python package used to manipulate and analyse tabular data. Built on numpy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Scatter plots

A

Great for viewing unordered data points

inflation_unemploy.plot(kind='scatter', x='unemployment_rate', y='cpi')

sns.scatterplot(x = "age", y = "value", size = "mpg", data = valuation)

sns.jointplot(x = 'age', y = 'value', data = valuation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Line plots

A

Great for viewing ordered data points

dow_bond.plot(kind='line', x='date',y=['close_dow', 'close_bond'], rot=90)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bar Charts

A

Great for viewing categorical data

  • Bar plots cannot display logarithms because they need to start at 0 and the log of 0 is undefined.

Horizontal Bar Plots

df.plot.barh(x=’val’,y=’lab’)

OR

sns.barplot(x=”val”, y=”lab”, data=df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Histogram plot

A

Great for visualising the distribution of values in a data set.

The data is chunked into bins and the data falls into each bins.

dog_pack[dog_pack["sex"]=="F"]["height_cm"].hist ()

To draw multiple histograms
~~~
dogs[[“height_cm”, “weight_kg”]].hist()
~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Series

A

A one dimensional array, more than one make a data frame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pandas LOC

A

Df.loc [string ]

Df.loc[row], [col]]

A single bracket gives you a series and a double bracket gives you a Df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pandas iloc

A

Df.iloc[[1]]
Is used for integer-location based indexing.

print(df.iloc[:, 1:])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Box plot

A

Used to compare the distribution of continuous variables for each category 

  • Answers questions about the spread of variables.
  • In a box plot, sorting by the IQR makes it easier to answer questions about how much variation there was among the “typical” population.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Numpy Comparisons

A
* logical_and ()
* logical_or()
* logical_not ()

np. logical_and (bmi > 21, bmi < 22)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Enumerate a list

A
fam = [1.73, 1.68, 1.71, 1.89]
for index, height in enumerate(fam) :
print("index " + str (index) + ": " + str (height))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Looping in dictionaries

A
world = { "afghanistan": 30.55,
"albania":2.77,
"algeria":39.21 }

for key, value in world.items () :
print (key + " - -- " + str (value))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Looping 2d arrays

A
import numpy as np
np_height = np.array (l1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array ([65.4, 59.2, 63.6, 88.4, 68.71)
meas = np.array ([p_height, np_weight])

for val in np.nditer (meas) :
print(val)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Looping pandas df

A
import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)

for lab, row in brics.iterrows:
print (lab)
print (row)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Pandas apply

A
  • Can be used to add a new column and apply some logic to it, it’s more efficient than a loop
apply
dfloop.py
import pandas as pd
brics = pd.read_csv ("brics.csv", index_col = 0)

brics ["name_length"] = brics["country"].apply (Len)
print(brics)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In panadas, what do the following functions do?

  • .head()
  • .info()
  • .shape
  • .describe()
A
  • .head() returns the first few rows (the “head” of the DataFrame).
  • .info() shows information on each of the columns, such as the data type and number of missing values.
  • .shape returns the number of rows and columns of the DataFrame.
  • .describe() calculates a few summary statistics for each column.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In pandas, what do the following functions do :

  • .values
  • .columns
  • .index
A
  • .values: A two-dimensional NumPy array of values.
  • .columns: An index of columns: the column names.
  • .index: An index for the rows: either row numbers or row names.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do you drop duplicates in pandas?

A
unique_dogs = vet_visits.drop_duplicates (subset= ["name", "breed"])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you count values in a column in pandas?

A
unique_dogs ["breed"].value_counts ()
unique_dogs ["breed"].value_counts(sort=True)
s.value_counts(normalize=True) = returns porportion of total
s.value_counts(normalize=True).sort_index()

The normalize transforms the result into percentages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do you sort values in a column in pandas?

A
df.sort_valves ("breed")
df.sort_values (["breed", "weight_kg"1)

result = df.sort_values('salary', ascending = True)

DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)[source]
25
Q

How to filter rows in pandas

A
test = test [test['state'].isin(canu)]
26
Q

How to use the agg function in pandas

A
def pct30 (column):
return column. quantile (0.3)

dogs["weight_kg"].agg([pct30, pct40])
dogs ["weight_kg"].agg(pct30)
dogs [["weight_kg", "height_cm"]].agg (pct30)

Group the results by title then count the number of accounts
counted_df = licenses_owners.groupby ('title'). agg ({'account' : 'count '})
27
Q

subset + aggregation in pandas

A

Count the number of rows in the budget column that are missing number_of_missing_fin = movies_financials ['budget']. isnull() . sum

ales_C = sales [sales ['type'] == "C"][ 'weekly_sales '].sum()

subset without aggregation:
avocados[avocados["type"] == "conventional"]["avg_price"]
Returns only the price column for matching types

28
Q

How to group by in pandas

A
avg_weight_by_breed = dog_pack.groupby ("breed") ["weight_kg"].mean ()
29
Q

How do you dectect missing values in a df

A
df.isna().any ()

Plotting missing values
~~~
import matplotlib.pyplot as plt dogs.isna ().sum().plot(kind=”bar”)
plt.show ()
~~~

Dealing with missing values
~~~
dogs.dropna()
dogs.fillna(0)
~~~

30
Q

Create a df by using list of dictionaries - row by row

A
list_of_dicts = [
{"name": "Ginger", "breed": "Dachshund", "height_cm": 22,
"weight_kg": 10, "date_of_birth": "2019-03-14"}, {"name": "Scout", "breed": "Dalmatian", "height_cm": 59,
"weight_kg": 25, "date_of_birth": "2019-05-09"}

 new_dogs = pd. DataFrame (list_of_dicts)
print (new_dogs)
31
Q

Create a df by using dictionary of lists - by column

A
dict_of_lists = {
"name": ["Ginger", "Scout"],
"breed": ["Dachshund", "Dalmatian"],
"height_cm": [22, 591,
"weight_kg": [10, 25],
"date_of_birth": ["2019-03-14",
12019-05-09"7
}
new_dogs = pd.DataFrame (dict_of_lists)
32
Q

Join/Merging in Pandas

A
wards_census = wards.merge (census, on='ward', suffixes= ('_ward', '_cen'))

Multiple tables
~~~
grants_licenses_ward = grants.merge (licenses, on=[‘address’, ‘zip’]) \
.merge (wards, on=’ward’, suffixes= (‘_bus’, ‘_ward’))
grants_licenses_ward.head()
~~~

Merge with left join
~~~
movies_taglines = movies.merge (taglines, on=’id’, how=’left’)
print (movies_taglines.head())
~~~
Different columns

movies_and_scifi_only = movies. merge (scifi_only, how='inner'
Left_on= id, right_on='movie_id")

An indicator to know source table and comminalities (left_only, both, right_only)
~~~
genres_tracks = genres.merge (top_tracks, on=’gid’, how=’left’, indicator=True)
~~~

33
Q

Concat Dataframes

A
pd.concat([dfA, dfB), ignore_index = True,]

Concat tables with different column names
~~~
pd.concat([inv_jan, inv_feb], sort=True)
~~~

OR
pd. concat (linv_jan, inv_feb], join='inner')

34
Q

Appending Tables

A
inv_jan.append([inv_feb, inv_mar],
ignore_index=True, sort=True)
35
Q

Valindating merges in pandas

A

Validating merges
.merge (validate=None)
one\_\_to\_\_one, many_to_one, many_to_many

Verifying concatenations
.concat (verify_integrity=False) :
* Check whether the new concatenated index contains duplicates
* Default value is False

36
Q

How to check if a column in dfA has the same values of a column in dfB - Filtering

A

popular_classic = classic_18_19 [classic_18_19[ 'tid'].isin(classic_pop['tid'l)]

37
Q

merge_ordered()

A
  • Great for time series data
    • Column(s) to join = on , left_on, and right_on
  • Type of join
  • how (left, right, inner, outer)
  • default outer
  • Overlapping column names
  • suffixes
  • Calling the function
  • pd.merge_ordered (df1, df2)
  • fill_method = 1) ffill (forward fill) fills data from previous field in previous row(above it)
38
Q

Using merge_asof()

A
  • Similar to a merge_ordered ( )left join
  • Match on the nearest key column and not exact matches.
  • Merged “on” columns must be sorted.
pd.merge_asof (visa, ibm, on=['date_time'],
suffixes= (' _visa','_ibm'), direction='forward')

The defualt direction is backwards, but you can also choose forward, nearest

39
Q

Query method in pandas

A
  • query (‘SOME SELECTION STATEMENT’)
  • Accepts an input string
  • Input string used to determine what rows are returned
  • Input string similar to statement after WHERE clause in SQL statement

stocks.query ('nike >= 90')

stocks_long.query('stock=="disney" or (stock=="nike" and close < 90)')

accessing an a date column thats an index
recent_gdp_pop = gdp_pivot.query ( date<= 11991-01-01" )

40
Q

Pivot

A
import numpy as np 
dogs.pivot_table (values="weight_kg", index="color", aggfunc=p.median)

Filling missing values in pivot tables
dogs.pivot_table (values="weight_kg", index="color", columns="breed", fill_value=0)
41
Q

Melt

A
  • To make analysis of data in table easier, we can reshape the data into a more computer-friendly form
  • Pandas.melt() unpivots a DataFrame from wide format to long format.

social_fin_tall = social_fin.melt(id_vars=['financial', 'company'])

You can melt certain values in a column.Here its only values in the financial column that are equal to 2018 & 2017
~~~
social_fin_tall = social_fin.melt(id_vars= [‘financial’, ‘company’],
value_vars= [‘2018’, ‘2017’], var_name= [‘year’], value_name=’dollars’, aggfunc = np.mean
)
~~~

think of the id_var as the columns you want to keep the same, the rest will be organised into proper rows

42
Q

unifrom distribution

A

`from scipy.stats import uniform

uniform.cdf (7, 0, 12)`

its used in continuous distributions where we need to find the area under the graph for that distribution

43
Q

seaborn plot

A
import matplotlib.pyplot as plt import seaborn as sns
sns.scatterplot (x="total_bill",
y="tip" data=tips, hue="smoker", hue_order= ["Yes",
"No"7)
plt.show()

OR
~~~

import matplotlib.pyplot as plt import seaborn as sns
hue_colors = {“Yes”: “black”
“No”: “red”}
sns.scatterplot (x=”total_bill”,
y=”tip” data=tips, hue=”smoker”, palette=hue_ colors, size=’size’)
plt.show ()
~~~

Relationship plot and a few extra varaibles that it takes
~~~
import seaborn as sns import matplotlib.pyplot as plt

sns.relplot (x=”total_bill”, y=”tip”, data=tips, kind=”scatter”
col=”day”, col_wrap=2, col_order= [“Thur”,
“Fri”,
“Sat”,
“Sun”1)
plt.show ()
~~~

different point style and transparency
~~~

Set alpha to be between 0 and 1
sns.relplot (×=”total_bill”, y=”tip”, data=tips, kind=”scatter” alpha=0.4)
~~~

44
Q

Seaborn line plots

A

Subgroups by location
~~~
import matplotlib.pyplot as plt import seaborn as sns
sns.relplot (x=”hour”, y=”NO_2_mean”,
data=air_df_Loc_mean,
kind=”line”
style=”location”,
hue=”location”)
plt.show()
~~~

Multiple observations per x-value, this automatically plots confidence interval
~~~
import matplotlib.pyplot as plt import seaborn as sns
sns.relplot (x=”hour”, y= “NO_2”, data=air_df, kind=”line”)
plt.show()
~~~

Plotting standard deviation
~~~
import matplotlib.pyplot as plt import seaborn as sns
sns.relplot (×=”hour”, y=”NO_2”, data=air_df, kind=”line” ci=”sd”)
plt.show()
~~~

45
Q

seaborn categorical plots

A

Line plot
~~~

import maplotlib.pyplot as plt import seaborn as sns
category_order = [“No answer”,
“Not at all”,
“Not very”,
“Somewhat”
“Very”]

sns.catplot (x=”how_masculine”
data=masculinity_data,
kind=”count”,
order=category_order)
plt.show()
~~~

Bar plots show the mean of quantitativ evariable per category

BOX Plots
sym changes the appearance of outliers
~~~
g = sns. catplot (x=”time”
y=”total_bill”, data=tips, kind=”box” ,sym=””)
~~~

to change whiskers, default is 1.5*iqr
whis = 2, or whis = [5,95]

46
Q

Seaborn point plots

A
  • Points show mean of quantitative variable
  • Line plot has quantitative variable (usually time) on x-axis
  • Point plot has categorical variable on x-axis
import matplotlib.pyplot as plt import seaborn as sns
sns.catplot(x="age"
y="masculinity_important" ,
data=masculinity_data, 
hue="feel_masculine" 
kind="point", 
join=False, 
estimator=median (more robust to outliers)
capsize=0.2, 
)
plt.show()
47
Q

Seaborn Styles and colors

A

sns.set_palette("RdBU")
sns.set_style('whitegrid')

Changing the scale
* Figure “context” changes the scale of the plot elements and labels
* sns.set_context ()
* Smallest to largest: “paper”, “notebook”, “talk”, “poster”

FacetGrid vs. AxesSubplot objects
Seaborn plots create two different types of objects: FacetGrid and AxesSubplot
g = sns.scatterplot(x="height", y="weight", data=df) type (g)
for facetgrid = g.fig.suptitle(‘title’)
for AxesSubplot = g.set_title(‘title)

Titles for subplots
~~~
9 = sns.catplot (x=”Region”,
y=”Birthrate”, data=gdp_data, kind=”box”, col=”Group”)
g.fig. suptitle (“New Title”,y=1.03)
g.set_titles(“This is {col_name}”)
~~~

Adding axis labels - works for both
~~~
g = sns. catplot (x=”Region”, y=”Birthrate”, data=gdp_data, kind=”box”)
g.set (xlabel=”New X Label”, ylabel=”New Y Label”)
plt.show ()
~~~

Rotating ×-axis tick labels
~~~
g = sns.catplot (x=”Region”, y=”Birthrate” data=gdp_data, kind=”box”)
plt.xticks (rotation=90)
plt.show()
~~~

48
Q

Convert Series to string

A
jobs['roles'] = jobs['roles'].str.lower()
print(contact.email.str.split('@', expand = True))

```

print(s.str.startswith(‘re’))
~~~

49
Q

Pandas Duplicated

A

Return boolean Series denoting duplicate rows.

df.duplicated()

50
Q

Jittery Scattered plots

A

age = brfss[‘AGE’] + np.random.normal (0, 2.5, size=len (brfss))

weight = brfss [‘WTKG3’]

plt.plot(age, weight, ‘o’, markersize=5, alpha=0.2)
plt. show()

if its a small sample of data, use larger marker size.

51
Q

Violin Plot

A

data = brfss.dropna(subset= [‘AGE’, ‘WTKG3’])

sns.violinplot (x=’AGE’, Y=’WTKG3’, data=data, inner=None)

plt.show ()

52
Q

Bootstrapping

A

Bootstrapping is resampling with replacement, all bootstrapped samples are the same size and a statistic is applied to each one
~~~

Bootstrapping coffee mean flavor
import numpy as np
mean_flavors_1000 = [1
for i in range (1000):
mean_flavors_1000.append‹
np.mean (coffee_sample.sample(frac=1, replace=True) [‘flavor ‘])
)
~~~

Bootstrapp does not account for biases in the data
Bootstrapping is great for figuring out standard deviations rather than mean

53
Q

Convert series string to lowercase

A

print(data.str.upper())

54
Q

random sample

A

print(chess.sample(n=5, random_state=42))

55
Q

Correlation and plots

A

heatmaps

Pairplots
sns.pairplot (data=divorce)
plt.show()

Pairplots
sns.pairplot(data=divorce, vars=[“income_man”, “income_woman”, “marriage_duration”] plt.show ()

Correlation
sns.heatmap (planes.corr(), annot=True) plt.show ()

56
Q

Key density kernel KDE

A

Kernel Density Estimate (KDE) plots
sns.kdeplot (data=divorce, ×=”marriage_duration”, hue=”education_man”, cut=0) plt.show)

sns.kdeplot (data=divorce, ×=”marriage_duration”, hue=”education_man”, cut=0, cumulative=True) plt.show()

57
Q

pandas class imblanace

A

value_counts

OR

Aggregated values with pd.crosstab0
pd.crosstab (planes [“Source”], planes [“Destination”l,
values=planes [“Price”], aggfunc=”median”)

OR

pd.crosstab (planes [“Source”], planes [“Destination”])

58
Q

pd.cut

A

Labels and bins
labels = [“Economy”, “Premium Economy”, “Business Class”, “First Class”]
bins = [0, twenty_fifth, median, seventy_fifth, maximum]

planes [“Price_Category”] = pd. cut (planes [“Price”],
labels=labels,