Introdution to Data Science Flashcards
Dive into Python
Modules: Group related tools together and make it easy to know where to look for a particular tool Common examples: matplotlib - for creating charts pandas - for loading tabular data scikit-learn - for performing ML scipy - contains statistics funsctions nltk - used to work with text data
Creating variables
Must start with a letter (usually lowercase) After first letter, can use letters/numbers/underscores No spaces or special characters Case sensitive ( my_var is different from MY_VAR )
float: represents an integer or decimal number
string: represents text; can contain letters, numbers, spaces, and special characters
Common string mistakes
Don’t forget to use quotes!Without quotes, you’ll get a name error.
Use the same type of quotation mark. If you start with a single quote, and end with a double quote, you’ll get a syntax error.
Fun with functions
Functions perform actions:
pd. read_csv() turns a csv le into a table in Python
plt. plot() turns data into a line plot
plt. show() displays plot in a new window
Function Name:
Starts with the module that the function “lives”in ( plt )
Followed by the name of the function ( plot )
Function name is always followed by parentheses ()
Positional Arguments:
These are inputs to a function;they tell the function how to do its job.
Order matters!
Keyword Arguments:
Must come after positional arguments
Start with the name ofthe argument ( label ), then an equals sign ( = )
Followed by the argument ( Ransom )
Common function errors
Missing commas between arguments
Missing closed parenthesis
What is pan
das?
Pandas is a modeule for working with tabular data - data with columns and rows - such as spreadsheets ar database tables.
Pandas helps to: Loading tabular data from different sources Search for particular rows or columns Calculate aggregate statistics Combining data from multiple sources
Inspecting a DataFrame
df.info()
print(df.info())
Selecting columns
Use columns in a calculation
e.g. credit_records.price.sum()
Plot data
e.g. plt.plot(ransom[‘letter’], ransom[‘frequency’])
Selecting with brackets and string (if column names contain spaces or special characters)
suspect = credit_records[‘suspect’]
Selecting with a dot (if column names contain only letter, numbers and underscores)
price = credit_records.price
Selecting rows with logic
Uses Booleans: True and False
Other types of logic: >, >=, , and < test that greater than or less than, respectively.
>= and <= test greater than or equal to or less than or equal to, respectively.
Using logic with DataFrames
credit_records.price > 20.00…returns True / False
credit_records[credit_records.price > 20.00] ….returns details
Creating line plots
Introducing Matplotlib
from matplotlib import pyplot as plt
plt.plot(x_values, y_values)
plt.show()
Multiple Lines (add the plot details and use plt.show() to finish off)
plt. plot(data1.x_values, data1.y_values)
plt. plot(data2.x_values, data2.y_values)
plt. show()
Adding Texts to Plots
Axes and title labels
plt. xlabel(“Letter”)
plt. ylabel(“Frequency”)
plt. title(“Ransom Note Letters”)
Labels anywhere before
plt.show()
Legends
plt. plot(aditya.days, aditya.cases, label=”Aditya”)
plt. plot(deshaun.days,deshaun.cases,label=”Deshaun”)
plt. plot(mengfei.days, mengfei.cases, label=”Mengfei”)
plt. legend()
Arbitrary text
plt. text(xcoord, ycoord, “Text Message”)
plt. text(5, 9, “Unusually low H frequency!”)
Modifying text Change font size plt.title("Plot title", fontsize=20) Change font color plt.legend(color="green")
Adding some style
Changing line color
plt. plot(x, y1, color=”tomato”)
plt. plot(x, y2, color=”organge”)
plt. plot(x, y3, color=”goldenrod”)
plt. plot(x, y4, color=”seagreen”)
plt. plot(x, y5, color=”dodgerblue”)
plt. plot(x, y6, color=”violet”)
Changing line width
plt. plot(x, y1, linewidth=1)
plt. plot(x, y2, linewidth=2)
plt. plot(x, y3, linewidth=3)
plt. plot(x, y4, linewidth=4)
plt. plot(x, y5, linewidth=5)
plt. plot(x, y6, linewidth=6)
plt. plot(x, y7, linewidth=7)
Changing line style
plt. plot(x, y1, linestyle=’-‘)
plt. plot(x, y2, linestyle=’–’)
plt. plot(x, y3, linestyle=’-.’)
plt. plot(x, y4, linestyle=’:’)
Adding markers
plt. plot(x, y1, marker=’x’)
plt. plot(x, y2, marker=’s’)
plt. plot(x, y3, marker=’o’)
plt. plot(x, y4, marker=’d’)
plt. plot(x, y5, marker=’*’)
plt. plot(x, y6, marker=’h’)
Before any other plotting code:
plt. style.use(‘fivethirtyeight’)
plt. style.use(‘ggplot’)
plt. style.use(‘seaborn’)
plt. style.use(‘default’)
print(plt.style.available) in the console to see all available styles
Making a scatter plot
Scattter plots help to visualixe unordered data points in a grid. Creating a scatter plot plt.scatter(df.age, df.height) plt.xlabel('Age (in months)') plt.ylabel('Height (in inches)') plt.show()
Keyword arguments
plt.scatter(df.age, df.height,
color=’green’,
marker=’s’)
Changing marker transparency
plt.scatter(df.x_data,
df.y_data,
alpha=0.1)
Making a bar chart
Creating a bar chart
plt. bar(df.precinct, df.pets_abducted)
plt. ylabel(‘Pet Abductions’)
plt. show()
Horizontal bar charts
plt. barh(df.precinct, df.pets_abducted)
plt. ylabel(‘Pet Abductions’)
plt. show()
Adding error bars
plt. bar(df.precinct, df.pet_abductions, yerr=df.error)
plt. ylabel(‘Pet Abductions’)
plt. show()
Stacked bar charts
plt. bar(df.precinct, df.dog, label=’Dog’)
plt. bar(df.precinct, df.cat, bottom=df.dog, label=’Cat’)
plt. legend()
plt. show()
Making a histogram
Histogram visualizes the distribution of values in a dataset.
Histograms with matplotlib
plt.hist(gravel.mass)
plt.show()
Changing bins
plt. hist(data, bins=nbins)
plt. hist(gravel.mass, bins=40)
Changing range
plt. hist(data, range=(xmin, xmax))
plt. hist(gravel.mass, range=(50, 100))
Normalizing
Unnormalized bar plot
plt.hist(male_weight)
plt.hist(female_weight)
Sum of bar area = 1
plt. hist(male_weight, density=True)
plt. hist(female_weight, density=True)
Plot types Summary
plt. scatter() shows individual data points
plt. bar() creates bar charts
plt. hbar() creates horizontal bar charts
plt. hist() visualizes distributions