Python Flashcards
df[‘Column name’].map
apply a dictionary to a column to change it
.astype()
change the type of data
use with dataframe:
df[‘Column’].astype(‘boolean’)
.apply
use when calling a function on a specific column Ex:
def gender(x)…. etc
df[‘Gender’] = df[‘Gender].apply(gender)
df.head()
calls the first couple rows of the dataframe
df.info()
tells you about the null values and data types of the dataframe
.sort_values()
allows you to sort a dataframe column
df.sort_values(by =’column’, ascending = False)
Correlational values
0 - 0.25 = Very low
0.26 - 0.49 = Low
0.5 - 0.69 = Moderate
0.7 - 0.89 = High
0.90 -1.0 = Very High
this is looking at the r value (correlation coefficient)
these can also be negative
what is the r squared value
r squared tells you how much of the variance in y is explained by x in a regression analysis
r squared is always positive and is a %
what is a regression analysis (and what are the 2 main types)
regression analysis examines the relationship between variables to assist with prediction/forecasting
simple regression: one dependent and one independent variable
multiple regression: two or more independent variables with one dependent variable
.rename()
df.rename(columns = {‘current name’ : ‘new name’}, inplace = True)
change the name of a column in a df
what type of chart is best when comparing 2 categorical variables?
A. stacked bar chart (sns.countplot) - will show the composition of each category
B. grouped bar chart (sns.catplot) will show side by side comparison of categories
C. heatmap (though best for correlations)
what is the syntax for countplot
sns.countplot( x = ‘category1’, hue = ‘category2’, data = df)
can do without the hue part if looking at one category only
what are the two main types of data?
categorical and numerical
what is the syntax of catplot? what is great about this tool?
sns.catplot( x = ‘category1’, hue = ‘category2’, kind = ‘count’, data = df)
the kind can be changed to many different plots including box, point, bar, strip (scatter), and swarm
what is the best chart when looking at categorical data vs boolean?
A. stacked bar chart - shows the distribution/proportions
B. grouped bar chart - separate bars for each bool value