Lecture Notes Flashcards
Define programming.
Programming means giving a computer a list of tasks, which it then runs in order to solve a problem.
What are some advantages of computer programming?
- Computers don’t get bored - automate repetitive tasks
- Computers don’t get tired
- Computers are calculators
- Computer code is reproducible
What can’t computers do?
- Computers are not creative
- Computers are not ethical
- Computers only know what you tell them
What are some advantages of python?
- High-level language
- Emphasises readability, making use of white space and indentation
- Dynamically typed
- Interpreted language
- Assigns memory automatically
- Supports multiple approaches to programming
- Extensive functionality
- Portable
- Open source
- Very popular
What are some disadvantages of python?
- Slower than compiled languages
- Can be memory-intensive
What are the different types of cells in a Jupyter notebook?
- Code cells - interpreted as Python code
- Markdown cells - for adding formatted text
How do you add a comment to a Jupyter notebook?
#
Why are comments important?
- Allow you to keep track of what your code does
- Avoids repetition and mistakes
- Easy for other people to follow
What steps should you take for debugging?
- Always read error messages carefully
- Comment your code thoroughly
- Tell your code to print outputs for intermediate steps
- Use the internet
How do you print in python?
print()
Prints whatever is in the brackets.
Useful for displaying results and testing purposes.
What does a variable have?
A name and a value.
The name is fixed, the value can change.
What are the different types of variables in Python?
- Numeric: integers, floats or complex numbers
- Text: string, always marked by quotation marks
- Boolean: True or False
- Sequences: lists or arrays of numbers/letters
How do you change the string x = ‘33.3’ to a float?
float(x)
How do you check the type of a variable?
type(x)
How do you change the a float to an integer?
int(x) - this roads it to a whole number
How do you get an input from the user?
variable = input(“Enter your name: “)
What is an expression?
Any group of variables or constants that together result in a value.
What are the common symbols used in basic math expressions?
*
/
% (remainder)
** (raise to the power of)
How do you concatenate two strings together?
String1 + String2
= String1String2
String1 * 3
String1String1String1
How is python indexed?
Zero-based indexing
What is string slicing?
Extracting certain characters from a string.
How do you access specific parts of a string?
Using the index with square bracket notation
- string[0]
Can we change a part of string in place?
We can access parts of a string to see their value, but we cannot change them in place - strings are immutable.
How do we access a sequence (sub-string) of any length?
By specifying a range to slice. Ranges use a : notation eg [1:10]
The slice occurs before each index (eg between 0 and 1 and 9 and 10)- returning characters 1-9.
How can we create a new string with slicing?
We can store our sub-string as a new variable (then this can be manipulated)
string2 = string1[:8]
What is string splitting?
String splitting is a very useful method for manipulating strings - it involves breaking a string into multiple parts.
string.split(‘ ‘)
What is a tuple?
A tuple is a type which holds an arbitrary sequence of items, which can be of different types.
They are used to store multiple items in a single variable.
Think multiple
How can you declare a tuple?
my_tuple = (‘A’, ‘tuple’, ‘of’, 5, ‘entries’)
How can you access a variable in a tuple?
Similar notation as characters in a string
my_tuple[0]
What is the advantage of a tuple over a list?
Tuples only use a small amount of memory but once created, the items cannot be changed.
Tuples are immutable, like strings
A list is a similar but more flexible data type compared to a tuple.
Lists are also comma-separated, but use square brackets
Give examples of immutable data types.
Tuples
Strings
What is the difference in declaring a list vs a tuple?
Both are comma-separated lists.
Tuples - ()
Lists - []
Tuples are immutable
Lists support assignment - you can access an item and change its value
Lists support assignment - what does that mean?
You can access an item and change its value.
How do you access/change items in a list?
list[index]
for a list of lists
list[i][j]
How do you get the length of a list?
len(list)
How do you compute the sum of values in a list?
sum()
How do find the minimum value in a list?
min(list)
How do you find the maximum value in a list?
max(list)
How do you make a copy of a list?
Store it as another variable
copied = list.copy()
How do you add an element to a list?
list.append(value)
What is the standard indent in python?
Four spaces - can usually tab in most editors
What is a dictionary?
A handy way to store and access data.
A dictionary is a set of keyword and value pairs. You use the keyword to access the value. The value can be of any type, including another dictionary.
dict = { x:y, a:b }
The name of a key is always a string and needs quotation marks.
How do you define a dictionary?
dict = { x:y, a:b }
The name of a key is always a string and needs quotation marks.
What is program flow?
Controlling which parts of your code get executed when, in what order, how many times, under what conditions, where to start and stop etc.
It is essential to making sure your program actually does what you want it to do.
Flow is controlled mainly by using conditional logic and loops.
What is the advantage of a dictionary?
We don’t need to care about where the value we want is, we just have to remember what we called it.
The name of a key is always a string and needs quotation marks.
What is an if statement?
A block of code which first checks if a specified condition is true, and only in that case will it carry out the task
if condition :
# body
It will only be applied to the indented code which follows the :
What is an if-else statement?
If statements only execute if the condition is true.
The else statement executes if the condition is false.
if condition :
# code
else :
# code
What is the elif statement?
If-elif-else
if condition 1 :
# code
elif condition 2 :
# code
else :
# code
What is a loop?
A block of code that will iterate (execute consecutively) multiple time.
What is a for loop?
A for loop requires something to iterate over, ie an “iterable” like a list (do something for every time in the list) or a string (do something for every character in the string)
for var in iterable :
# code
for i in range(10)
# code
Which is the simplest kind of loop?
For loop
How do you get a list of integers of length x, starting with 0?
range(x)
list(range(x))
What are the key words used for control in the flow of a loop?
Pass - do nothing
Continue - stop this iteration of the loop early, and go on to the next one
Break - end the loop entirely
How do we open a file in python?
open() function
r - reading only
w - for writing, if the file exists it overwrites it, otherwise it creates a new file
a - opens for file appending only, if it doesn’t exist, it creates the file
x - creates a new file, if the file exists it fails
+ - opens a file for updating
syntax:
f = open(‘zen_of_python.txt’, ‘r’)
What does “f = open(‘zen_of_python.txt’, ‘r’)” do?
‘r’ - opens a file for reading only.
What does “f = open(‘zen_of_python.txt’, ‘w’)” do?
‘w’ - opens a file for writing. If the file exists, it overwrites it. Otherwise, it creates a new file.
What does “f = open(‘zen_of_python.txt’, ‘a’)” do?
‘a’ - opens a file for appending only. If the file doesn’t exist, it creates the file.
What does “f = open(‘zen_of_python.txt’, ‘+’)” do?
’+’ - opens a file for updating.
When are changes to a file saved?
When the file is closed
Use the .close() method if not using with/as
What does “f = open(‘zen_of_python.txt’, ‘x’)” do?
‘x’ - creates a new file. If the file exists, it fails.
What do you have to do once you are finished with a file?
Close it, to release memory used in opening the file.
When writing to a file, the changes are not saved until the file is closed.
Use the .close() method
What is the basic way to read from a file?
f = open(“file_name.txt.”, “r”)
then use
print(f.read()) pr
print(f.readline())
What arguments does the open function take?
The name of the file you want to look at and the mode with which you want to interact with the file
What is the difference between .read(),.readline() and .readlines()?
.read() reads the entire contents of the file
.readline() reads only the next line, it can be called repeatedly until the entire file has been read
.readlines() is the most useful, it reads each line, one line at a time and then stores it all into a single list
What happens if you run print(fileread()) twice?
The first output will print the entire contents of the file.
The second output will be blank. Once the file object has been read to the end, any subsequent calls return an empty string.
What happens if you try f.read() from a closed file?
Results in an error
How do you read each line of a file and store all the lines in a list?
.readlines()
f = open(“file_name.txt”, “r”)
lines = f.readlines()
f.close()
print(lines)
The file is closed but we have the contents written to a variable, we can then get the lines we want by indexing
What is the safe way to open files?
We can make sure that files are only open for as long as we need them by using a with statement
with open(“file_nmae.txt”, “r”) as d:
# put file operations in here
print(f.read())
What happens if you try print(f.read()) after a with/as statement?
An error will be produced - the with/as syntax closes the file automatically at the end.
This is important for file writing, less important for file reading.
How do you write to a file?
with open(“file_name.txt”, “w”) as f:
f.write(“String”)
Basic input and output only reads and writes strings. The code below will cause an error and result in an empty file.
What happens to the contents when you open a file in write mode?
It erases any previous contents
How do you format a string?
%s - string
%d - integer
%f - float
%e - float, but using scientific notation
eg(‘%f’, %length)
or (“This is a %d word %s” %(length, datatype)) - can include as many variables as you want by putting several % signs in the string, and providing a tuple after the string.
The first % (inside the string) indicates that we are writing a variable. The letter that follows indicates what type of variable.
The second % sign (after the string) tells your code which variable to write at the first % sign.
How can you cadd a tab into the string?
“\t”
How can you add a new line into the string?
“\n”
What is a JSON file?
A JSON file is structured like a Python dictionary.
JavaScript Object Notation (JSON) is a standard text-based format for representing structured data based on JavaScript object syntax. It is commonly used for transmitting data in web applications .
How is it best to read CSV or JSON files?
Using specialised modules
How do we write JSON?
Using the JSON module
Use json.dump to write to file
import json
define dictionary eg masses
with open(“planets.json”, “w”) as f:
json.dump(masses,f)
- this is the thing you want to dump and the file you want to dump it into
How do we read JSON?
Using the JSON module
Use json.load to write to file
import json
with open(“planets.json”, “r”) as f:
new_dictionary = json.load(f)
print(new_dictionary) – to investigate that we have successfully read the JSON dictionary
What are the two calls to read and write json?
import json
write - json.dump()
read - json.load()
What quotation marks are standard used by JSON?
Double quotes
You can define it with single quotes - python doesn’t care but JSON does, so it will convert it eg so that all keys are “”
When might a dictionary be a string?
Dictionaries may be stored as a string if the dictionary is one entry within a larger database
How do we turn a dictionary into a string?
Simply add quotation marks
Can check the type with print(type(item))
If there are “” used in the string, then we create the overall string with ‘ ‘ - if we try to use the same type of quote both around and within the string, it would end the string early
How can we turn a string into a dictionary?
json.loads()
pronounce load-S
the extra s is for string
eg dict = json.loads(string)
print(type(dict)) to check it was successful
What are the two cases we want to allow code to fail gracefully?
Errors - a fundamental issue where python cannot understand your code (syntax error)
Exceptions - code is written in valid Python syntax, but an operation cannot be completed successfully
What is the syntax used to predict and catch exceptions under some circumstances?
The try/except code
try:
# code
except:
The except prevents the code from crashing and implementing an emergency fallback option.
Why do you need to be cautious about using a generic except statement?
It will catch all exceptions - even if the error is not what you think it is.
You should try to catch specific errors.
What is a ValueError exception?
Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError
How do you extend the exception-handling block with additional steps to execute after the try…except?
try:
# code
except:
# code
else:
# code - do if no exception
finally:
# code - always do this at the end
What is the difference between a type error and value error?
Passing arguments of the wrong type (e.g. passing a list when an int is expected) should result in a TypeError , but passing arguments with the wrong value (e.g. a number outside expected boundaries) should result in a ValueError.
What is the benefit of using a function?
Functions are re-usable.
We often want to do the same operation at different times or with different data,
What is a function?
A separate, named block of code for a specific purpose.
The code inside a function is “walled off” from the main code.
What is required for a function?
every function has a name, a list of (required and optional) inputs in parentheses, and returns something at the end.
def my_function():
return
What is the syntax for defining a function?
def my_function():
return
You should give your function a meaningful name
What are inputs of a function called?
Keyword arguments
How do you call function?
Call the function using its name, including the brackets (and any arguments required to be passed in)
eg hello_world()
If a function requires an argument to be provided, but we don’t provide it, what happens?
We get an error message
When we call a function and assign it to a variable, what happens?
eg sum = my_sum(5, 6)
The variable will be assigned the value returned by the function
What are global and local variables?
A global variable is a variable defined in the main body of the code. Any code executed after the variable has been defined is able to “see” the variable.
A local variable is a variable defined inside the function or other object. Its value is only accessible within the function or object (ie cannot be accessed outside of the function)
If we want to make an input of a function optional, what do we need to do?
Give it a default value
def my_sum(a, b =1):
return a +b
- If you provide a value for b it will overwrite
- If you don’t provide a value for b, it will use b = 1 as a default
How do you reverse a list?
list.revers()
How do you declare a function with an arbitrary number of variables?
def arb_function (*nums):
# code
Within the code, you then loop over nums
Why might you want to declare an arbitrary number of variables for a function?
You may not know in advance how much data you will need to work with
What do all functions in Python have in common?
All functions in Python return something.
If you do not specify a value (or leave out the return statement entirely), the function will return a None value by default.
Otherwise it returns the value we specify
How many values can you return from a function?
What options do you have for these outputs?
You can return more than one value from a function, and return different types.
For the output:
- Provide the same number of variables as the number of values returned/ Each returned value then goes to a separate variable.
- Provide a single variable, this will then contain a list of the values returned by the function
What does the return statement do?
Returns variables, ends the function call and returns to the main code.
Therefore any code in the function after the return will not be executed.
This can be convenient if you want to put conditions for what to return.
What is a lambda function?
A quick way to make short functions that can be defined in one line.
They can take any number of arguments, but can only have one expression.
name = lambda vars : code
eg doubler = lambda x: x*2
How do you define a lambda function?
name = lambda vars : code
eg doubler = lambda x: x*2
When would it be most appropriate to use a lambda function?
If we need to create a function for temporary use eg within another function.
How do you add an element to a list?
list.append(i)
How do you sort a list?
sorted(list)
What is a programming paradigm?
A paradigm is like a philosophy informing how we write code.
Usually there are many different ways to solve a problem with code. Different paradigms help to shape which approach we choose to use.
Procedural programming.
Object-oriented programming.
What are the most common paradigms in python?
Procedural programming - the code is organised as a sequence of instructions (do this, then this). Each block performs a prescribed task to solve the problem.
OOP - data are stored as “objects” belonging to pre-defined “classes”. These objects have a set of “attributes” stored internally, which can be updated using built in “methods.
What is a class?
A class is like a template designed in advance to handle a particular data structure, with a set of properties called attributes.
It also provides implementations of behaviour (member functions or methods). The syntax looks like class_name.function_name()
How do you reverse a list?
list.reverse()
How do you investigate all of the attributes and functions of a class or object?
dir(x)
or print(dir(x))
What are alternative names for attributes and methods?
Attributes - properties
Methods - functions
What do attributes with a double underscore represent?
Attributes internal to python that cannot be updated.
How do you create a new list?
my_list = [x,y,z]
What is the relationship of an object and a class?
Any object is an instance of a class, created to represent a particular item of data.
An instance ie one specific example
What do methods of an object do?
Update the internal state of the object eg reversing the list
How can you check the class of an object?
object.__class__
How do you create a class?
eg
class Animal():
# Can list attributes
# Can define functions using def function():
How do you create an object? (ie particular instance of the class)
object = Class()
Passing in attributes as appropriate. In this case, the attributes would be set to their defaults.
What is creating an object (ie. a particular instance of the class called?
Instantiation
How do you check the value of an attribute for an object?
object.attribute
How do you update the attribute of an object?
object.attribute = value
How do you create a new attribute of an object?
object.attribute = value
We can add attributes to class instances, we can’t edit the parent class
Why use classes?
Objects store data in a way where it is easy to update and display the internal state of that data, using built-in methods.
OOP allows you to put your methods next to the data.
Once we have defined useful classes and instantiated objects, an OO code will mainly interact with the data object through its built-in methods.
What function do we use when we create a class and we know that we will create many objects from that same class, with shared attributes and want to assign values when creating each object?
The __init__ function
What does the __init__ function do?
The python __init__ method is declared within a class and is used to initialise the attributes of an object as soon as the object is formed.
How do you use the __init__ function?
class Animal():
def __init__(self, attribute1, attribute2):
self.att1 = val
self.att2 = val
Give the __init__ function a list of arguments, the first argument is always self. This is a special variable which represents the object itself once we have created it ( a self-referential thing)
__init__ initialises the attributes of the class
What is the self parameter?
The self parameter is a reference to the current instance of the class, and is used to access variables that belongs to the class.
What does self.x mean?
The attribute “x” belonging to the object “self”
When using the __init__ function, what attribute can the functions defined take?
Taking the “self” function as an input, you can then access any attribute with self.x
What is a benefit of using __init__ when we create objects?
The object can be created and attributes defined in one line.
What is hierarchical inheritance?
Hierarchical inheritance is a type of inheritance in which multiple classes inherit from a single superclass.
“Parent” class - Animal
“Child” class - Cat, Dog etc.
Object - your pet
A child class inherits the attributes of its parents class, but we can also add new method and attributes.
How do you create a child class of a parent class?
Put the name of the parent class in the brackets when creating the child.
eg class Cat(Animal):
# put attributes from the parent class that should be fixed for the child class first
# then use super().__init__(attribute1, attribute2) within __init__ for the attributes we want to specify new values for
How do we define the __init__ function for a child class?
def __init__(self, attribute1, attribute2):
super().__init__(attribute1, attribute2)
If you define a useful function or class and want to use it in many different codes, instead of copying and pasting the code what can you do?
Make the code into a Python module.
This is a python file (extension .py) containing one or more classes or functions.
You can then import the class or function from the module easily.
What is a python module?
A python file (extension .py) containing one or more classes or functions.
You can then import the class or function from the module easily.
How do you import a module?
Ensure the .py file is in the same directory as the current notebook.
eg for the class Dog from the animals.py module
from animals import Dog
From this import, you can create Dog objects
(can also define and import functions, they don’t need to be part of a class, making it very easy to reuse them)
What is a python package?
A collection of modules.
You can install (download) these packages and then have access to incredibly useful functions.
What is NumPy?
Numerical python - a built-in module
A module is a pre-defined collection of functions we can load into our programs.
NumPy arrays are multidimensional array objects.
We can use NumPy arrays for efficient computations on large data sets.
How do we import NumPy?
import numpy as np
Then call functions as eg np.sin(0)
Alternatives (don’t use):
import numpy - importing the entire numpy module
from numpy import sin - importing only a specific function
When might you only import a specific function required, rather than an entire module?
If we don’t want to use up memory on the whole library.
We would need to know the name of the function/class hat we want to import in advance.
The specific function is now a global name, we don’t need to specify the module.
This could cause issues when there are functions with the same name.
How do you call trigonometry functions?
sin(), cos() and tan()
The value passed in this function should be in radians.
How do we investigate the contents of a module?
import module first
dir()
eg dir(np)
What is an array?
A “grid” of values, all of the same dtype
ie all floats, or all strings etc.
What is the difference between a 1D array and a list?
They look similar at a first.
- Array use less memory. Therefore array is much more efficient than a list, particularly for a large collection of elements.
- Lists are more flexible (can have mixed types)
How do you create a list of length x?
my_list = []
for i in range(x):
my_list.append(i)
How do you print the first x items of a list?
print(my_list[:x])
How do your create a Numpy array from a list?
my_array = np.array(my_list)
How do you print the first x items of an array?
print(my_array[:10])
How do you print the first array item?
print(my_array[0])
How do you print the last item of an array?
print(my_array[-1])
How do you print the time taken to execute a cell?
%%time
Why is there a difference in the time taken for an operation on a list vs an operation on an array?
Lists - operations can only be performed on items, so calculates have to be one at a time.
Arrays - operation is performed on all elements with single function call - this is much quicker for large data sets.
Why are arrays more convenient for many mathematical operations?
You can write one line of code rather than a loop.
Eg 3D array requires 2 nested for loops.
With NumPy,, we don’t have to worry about the array shape (it is automatically preserved)
What are different ways to create an array?
From a list:
my_array = np.array(my_list)
Create an array of zeros (n = how many elements you want)
np.zeros(n)
Create an array of ones
np.ones(n)
Create an array of numbers from a to b, with spacing c
np.arange(1, 10, 1)
Create an evenly spaced array from a to b, with c points
np.linspace(1, 10, 19)
Create an array of random numbers from 0 to 1 of length n
np.random.random(n)
What does np.arange() do?
Creates an array of numbers from a to b with spacing c.
Pass in where to start, where to stop and the spacing that you want
Stops before the stop number ie if you want 1 - 10
np.arange(1, 11, 1)
What does np.linspace() do?
Creates an evenly spaced array from a to be with C points.
State how many points you want. This enables better precision and can be used to control the number of samples that you want.
What function creates an array of numbers from a to b with spacing c?
np.arange(a, b, c)
What does np.random.random(n) do?
Create an array of random numbers from 0 to 1 of length n
How do you create an array of random numbers of length n?
np.random.random(n)
This creates an array of random numbers ranging from 0 to 1. Can apply transformations to get it into a range you want.
What is each dimension of an array called in NumPy?
An axis
How do you initialise a higher-dimension NumPy array?
You need to specify the data along each axis.
eg
my_2d_array = np.array(
[ [1, 2, 3],
[4, 5, 6] ]
)
This is a nested list. The items are the rows. Items at the same position (the same index) within each sub-list form the columns
eg
my_3d_array = np.array(
[ [ [1,2], [3,4], [5,6]],
[ [7,8], [9,10], [11,12] ] ]
)
How do you select an element from a 2D array?
We need to supply N indexes, equal to the number of axes (dimensions)
print(my_2d_array[0,0])
print(my_2d_array[0][0])
The first index is the row, the second is the column.
How do you select a whole row or column from a 2D array?
Get the first row, all columns
my_2d_array[0,:]
Get all rows, the first column
my_2d_array[:,0]
Get all rows, last two columns
my_2d_array[:,1:3]
The slice of the array is itself returned as an array.
How do you determine how many dimensions an array has?
Determine the dimensions an array has AND the size of each dimension
my_array.shape
What happens if you apply a simple expression to an array? eg array * 2
We can do this, the operation is applied to every element in the array. We are adding a scalar (constant value) to the array.
Can do add, sub, mult, div
What happens if we multiply two arrays together?
We can add, sub, mult, div one array with another - but the behaviour is different to scalar mathematic expressions.
Each element of the array will operate on the element in the same position in the other array.
eg my_array * my_array - the array is squared
If arrays are different shapes/sizes you can get errors or unexpected behaviours.
How can you combine arrays?
Combine array with shape (m,n) with:
- Array with shape (1, n)
- Array with shape (m, 1)
ie a 1D array with the same number of rows or columns as the data.
When you add/multiply, it will repeat the 1D array as many times as needed, in order to match the rows.columns in your data.
The new array is “broadcast” to the shape of your data.
What is masking?
Masking is the term used for selecting entries in arrays, e.g. depending on its content.
We can apply that mask to our data to retrieve a new array that only contains masked values.
We can specify conditions and return sub-sets of the array.
What is a mask for getting even numbers?
even_numbers = (my_array %2 == 0)
my_array[even_numbers]
Testing each element for the condition
What does (my_array %2 == 0) return?
The conditional statement returns an array of Boolean True/False, with the same shape as the array.
This can be used as a mask to pick out only the array elements where the condition is true.
We mask arrays using square bracket notation, similar to slicing.
How do you apply multiple masks at the same time?
Using &
Why do we often work with 2D arrays in data science?
They are good for holding tabular data.
How do you get pi in the Jupyter notebook?
Import numpy as np
np.pi
How do you generate data to plot for a sin curve?
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
Combine the two arrays into a new array
data = np.column_stack([x,y])
How do you combine two arrays into a new array?
Using column_stack() or row_stack() functions
Takes one argument - a list of arrays to stack
The arrays to stack must be the same length as each other
How can we change the shape of an array?
We transpose the array using .T
eg transposed_data = data.T
For more complicated manipulation we can use the shape and reshape methods
- data.shape to check the current shape
- rd = data.reshape(2,100) - instead of 100 rows and 2 columns, reshape to 2 rows and 100 columns
How do you transpose data?
data.T
The rows are now the columns
How do you reshape data?
reshaped_data = data.reshape(2,100)
The size of the new array (n rows * n columns) must match the original ie the product of the axis lengths is constant.
eg instead of 2 columns and 100 rows we can have 2 rows and 100 columns
OR three_d_data = data.reshape(2,50,2)
rows, columns, number of elements in each
How do you calculate the sum of elements of an array?
- OOP
data.sum() - Procedural
np.sum(data)
How do you find the minimum and maximum of an array?
Method approach
data.min()
data.max()
How do you compute statistics on the slices of an array?
Possible because the slice is just another array.
eg mean of the first column
print(np.mean(data[:,1]))
How do you write an array to a file to save for later use?
Using the savetxt() function
Required arguments are the name of the file to save to (created if it does not exist, otherwise it will be overwritten by default) and the array to save. We can also specify the format of the data and the character to separate the data.
np.savetxt(‘name.csv’, data, fmt=’%.4f’, delimiter=’,’)
How do you load data from a file?
loadtxt() or genfromtxt() functions
Required argument - file name.
You can also specify the delimiter and dtype to ensure desired behaviour
eg arr = np.genfromtxt(“file.csv”, delimiter=’,’, dtype= ‘float’)
What is the standard plotting library in python?
Matplotlib
What is Matplotlib?
A comprehensive library for creating static, animated and interactive visualisations in Python.
It makes easy things easy and hard things possible.
How do you import the matplotlib module?
import matplotlib.pyplot as plt
What is pyplot?
A set of functions that can be used to create a figure with procedural programming.
For better control over plotting, it is recommended to use an OO approach with Matlib objects.
What are the fundamental objects used in Matplotlib?
Figures - the entire area of the figure
Axes - the area for plotting data
How do you create the axis and figure for a plot?
When should you do this?
fig, ax = plt.subplots()
Do this at the start of the plot
How do you obtain the size of the figure in pixels?
print(fig)
What is the default resolution of the figure in pixels?
The default resolution is 100 pixels per inch.
How can you specify the size of the figure?
Using the figsize argument
fig, ax = plt.subplots(figsize=(7,5))
Size in inches
What condition needs to be met to plot some simple lines?
The points along the lines can be given as a list of x and y coordinates which must be the same length.
What is the minimal code for plotting a line graph?
fig, ax = plt.subplots()
x = […]
y = […]
ax.plot(x,y)
ax.set_xlabel(“X”)
ax.set_ylabel(“Y”)
plt.show()
This is the OOP approach
How do you set labels on your plot?
ax.set_xlabel(“X”)
ax.set_ylabel(“Y”)
Procedural:
plt.xlabel(“X”)
plt.ylabel(“Y”)
How do you display the plot in Jupyter?
plt.show() is used in Python to display the plot, not always needed in Jupyter.
What differences are there between the OOP and procedural approaches in plotting?
Procedural approach - we call functions from pyplot
Using methods (OOP) often start with set_, functions often do not eg plt.xlabel.
In procedural, we don’t tell Pyplot which axis to plot the data on, it infers which axis to use (the most recent one).
What kind of objects is matplotlib built to handle?
Numpy arrays
How do you plot two columns?
ax.plot(data[:,0],data[:,1])
What ways can you customise a plot?
- Changing the units
- Changing the upper and lower limits on the axes
- Changing the axes tick marks
- Adding another curve to a figure using the legends
- Changing line styles and colours
- Add arbitrary text labels
- Add a title
How can you customise the units of a plot?
Apply conversion in ax.plot. Can do numpy operations directly in the .plot as long as it produces another array.
eg ax.plot(data[:,0]/np.pi*180, data[:,1])
What is the conversion between degrees and radians?
degrees = radians * pi/180
How can we change the upper and lower limits on the axis?
ax.set_xlim()
eg ax.set_xlim(0,360)
How can we change the axes tick marks?
ax.set_xticks([…])
ax.set_yticks([…])
Pass in a list of the tick marks you want.
How do you add a second curve to a plot?
Use two ax.plot functions in the one plot.
How do you distinguish between two curves on the same plot?
Add a legend
ax.legend()
How do you change the location of a legend?
Using the loc keyword
- ‘lower’, ‘center’, ‘upper’ for vertical placement
- ‘left’, ‘center’, ‘right’ for horizontal placement
eg ax.legend(loc = ‘upper center’)
How do you add a box to the legend?
frameon=True
eg ax.legend(loc = ‘upper center’, frameon=True)
How do you alter the thickness and style of. line?
Add lifestyle=’-‘ and line width = 2 to ax.plot
Available line styles include ‘-‘ (solid), ‘–’ (dashed), ‘:’ (dotted), ‘-.’ (dash-dot), ‘–.’
eg ax.plot(data[:,0]/np.pi*180, cosine, label=’cos(x)’, color=’deeppink’, linestyle=’–’, linewidth=2)
How do you change the colour of a plotted line?
Add color=”” to ax.plot
How do you add an arbitrary text label to a plot?
ax.text(120, 1, “Maximum”, fontsize=20)
Providing the coordinates where you want to write the text and the string you want to put in.
How can you customise font size?
fontsize =
How do you add a title to a plot?
ax.set_title(“Title”)
How do we display multiple axes on the same figure?
This means showing different information on different panels of a single figure.
Using the plt.subplots() function we can specify how many axes in the vertical direction with nrows and the horizontal direction with ncols.
fig, axes = plt.subplots(figsize=(8,8), nrows=2, ncols=1)
ax1 = axes[0]
ax2 = axes[1]
now we access using ax1 and ax2 etc.
What is the keyword to generate a line graph?
ax.plot()
How do you create a scatter plot?
ax.scatter(x, y, marker=”o”)
How do you plot a scatter plot with error bars?
plt.errorbar(x, y, xerr, yerr, fmt=”o”, color=”r”)
In what ways can we customise a scatter plot?
Shape and colour of the plots for errorbar
Outline:
- ‘.’ : point
- ‘+’, ‘x’ : crosses
Filled:
- ‘o’ : circle
- ‘s’ : square
- ‘^’, ‘<’, ‘>’, ‘v’ : triangles in different directions
- ‘d’, ‘D’; ‘p’, ‘P’; ‘h’, ‘H’ : different types of diamond, pentagon or hexagon
- ‘*’ : star
Line plots:
- ‘-‘, ‘–’, ‘:’ etc
fmt=’s’
color=’gold’
markersize=6
markeredgewidth=2
markeredgecolor=’k’
ecolor=’k
How do you control the shape of the errorbar plot?
fmt (format)
How do you create a histogram?
ax.hist(x)
Specifying the bins:
ax.hist(x, bins=20)
What is the default number of bins if not specified?
10 bins (of equal width)
What should you consider when choosing the size of your bin?
With finer bins, we can see more detail in the distribution.
But if we use too many bins we can overdo it and end up with lots of misleading gaps.
How can you further customise a histogram?
- Changing colour
- Changing from a filled histogram to an outline
- Normalise the histogram to plot the probability density rather than total frequency
ax.hist(heights, bins=20, color=’teal’, histtype=’step’, density=’True’)
How do we get the values of the bin edges and the numbers in each bin?
The hist function returns these already.
counts = ax.hist(x)
numbers in each bin - counts[0]
boundary edges - counts[1]
What kind of plots are useful for categorical data?
Bar charts and pie charts
How do you create a bar chart?
ax.bar(categories, counts, color=bar_colors)
These are all lists to be passed in
How do you create a pie chart?
ax.pie(counts, labels=categories, colors=bar_colors, autopct=’%d’)
Include optional argument, auto percent to print the percentages - d means it prints as an integer
How do you display image data in Matplotlib?
Matplot has an easy way to make plots using images (eg a picture or photograph)
Data must be provided as a 2D NumPy array. Matplotlib will display the array as a grid of pixels, with the intensity of each pixel determined by the value of the array at that position.
image = np.gemfromtxt(“pixels.txt”)
fig, ax = plt.subplots(figsize=(8,8))
ax.imshow(image, origin=’lower’, cmap=’Greys_r’, vmin=0, vmax=300)
- Origin determines which way up it will be printed
- CMAP - what colour do you want it print
- Vmin and max are saturation points (less than 0 = fully black, above 300 = fully white, important for contrast)
What do you do if you don’t want to show any tick marks on a figure, eg for an image?
ax.set_xticks([])
ax.set_yticks([])
How do you “zoom in” on an important part of an image? How do you add a circle to highlight this?
Using array slicing to zoom in
ax.imshow(image[80:220,80:220], origin=’lower’, cmap=’Greys_r’, vmin=0, vmax=1000)
Highlighting key features
ax.scatter(70,70,marker=’o’,s=10000,c=’None’,edgecolors=’r’,label=’Supernova’)
How do you save a plot?
Reduce whitespace around your figure:
plt.tight_layout(pad=0.5)
Save your plot:
plt.savefig(‘image.png’)
What is Pandas?
Pandas builds on NumPy and introduces a new object called a data frame (or a series if one-dimensional)
What is the difference between a dataframe and pandas series?
Pandas series is one-dimensional (more similar to a list rather than a tabular structure)
What advantages do data frames provide for data science?
- A data frame looks like a table or spreadsheet, with convenient column and row labels
- A data frame includes methods for sorting, filtering and performing complex operations on data
- Columns can be of different data types (unlike an array)
- Provides some of the functionality of an array
How do you load a dataframe from a file?
import pandas as pd
df = pd.read_csv(“data.csv”)
When we load the data into Pandas, the first row is assumed to be the column headings. If we wanted to we could override this behaviour by providing a list of column names to an optional keyword, names=.
How do you import pandas?
import pandas as pd
How do you determine the number of rows in a data frame?
Length
len(df)
How do you examine the first few rows of a dataframe?
df.head(x) - where x is the number of rows to display
If you want to display the dataframe, what could you do?
print(df) but this isn’t very nice
can call df directly but this must be the last command in the cell
How are rows indexed?
The rows are given numerical indices by default.
Sometimes one of the columns in the data is already a convenient index. We can assign this as the index
df = pd.read_csv(‘titanic.csv’, index_col=’PassengerId’)
How do you assign an index within the read_csv() function?
index_col=”Column Name”
What should the first step of any data analysis be?
Clean up the data set
- removing unwanted data, missing values or duplicates
How do you drop a column?
df = df/drop(columns=[“X”])
you can drop multiple columns at once by providing a list of column labels
How might missing values be represented in the dataset?
NaN
Not a number value - usually this represents a missing value
Check for missing values
df.isna()
eg df.isna().head(6)
How do we remove rows with NaN values?
df.dropna()
eg
df = df.dropna(subset=[‘Age’])
If we don’t specify a subset of columns to use, it will remove all rows that have a NaN in any column.
How can we remove any rows that appear more than once in the data set?
df.drop_duplicates()
df.drop_duplicates(subset=’Ticket’)
How do we slice data from a pandas data frame?
Using loc() and iloc()
NB: they use square brackets like in array indexing
loc gets rows (and/or columns) with particular labels.
iloc gets rows (and/or columns) at integer locations.
How do we get a single column from a data frame?
df[‘Age’]
Slicing syntax
How do we get the first row of a data frame?
df.iloc[0]
As this is only 1D it displays a series.
How do we get rows 100-110 of a dataframe?
df.iloc[100:110]
How do we get rows 100-110 and the first four columns of a dataframe?
df.iloc[100:110, :4]
This will return 5 columns in total - the index column and then the first 4 columns
Why might loc() be more useful than iloc()?
We may not know the specific index to search for but we do know the column title.
How do we return only the “Name” column?
df.loc[:,’Name’]
How do we retrieve the name, sex and age columns of the first 10 passengers?
df.loc[:10, ‘Name’:’Age’]
df.loc[:10, [‘Name’,’Age’, ‘Fare’]]
NB: you can retrieve non-consecutive rows/columns by providing a list. Therefore df.loc is very flexible.
How do you use loc or iloc to return only the rows or columns where a certain condition is met?
Masking - provide an array of T/F to loc.
eg df.loc[(df[‘Pclass’]==1)]
How do you check whether values in a data frame column are in a list of possible values?
Using the .isin() method
eg
df.loc[df[‘Pclass’].isin([1,2])]
How do you compute summary statistics for numerical columns in a data frame?
df[‘Age’].mean()
mean / min / max - these calculations automatically ignore NaN values
How do you sort values for a column in a data frame?
df[‘Age’].sort_values()
How do you sort entries in a data frame by a particular column?
df.sort_values(by=’Age’)
df.sort_values(by=’Age’, ascending=False)
How can we get the length of a dataframe / column?
len() python function - number of rows
df.size pandas property - number of cells ie rows by columns
What is the result of adding two columns together?
A 1D data frame ( a series) - the original indices are still present.
To get only the values, we can access df.values - this is a property not a method, so no()
How do we get the values of a 1D dataframe / pandas series?
df.values
No brackets, it is a property not a method.
How do we get the values of a column?
df[“Name”].values
How do you add columns together?
df.add() method
eg
relatives = df[‘SibSp’].add(df[‘Parch’], fill_value=0)
store it in a new column:
df[‘Relatives’] = df[‘SibSp’].add(df[‘Parch’], fill_value=0)
What is the advantage of using the df.add() method to add columns, rather than using the + operator?
It will not try to add a NaN if the column has missing values, you can specify what value to use in place of NaNs by including a fill_value
This is safer for handling NaN values rather than simple addition
What operations can you use on columns so that NaN values can be handled appropriately?
df.add()
df.subtract()
df.multiply()
df.divide()
How can you plot data from a data frame?
Matplotlib can naturally understand data frames just like Numpy arrays. You can pass the columns directly to plotting commands.
eg
ax.hist(df[‘Age’], bins=30)
How do we apply functions to an entire column?
The df.apply() method
Define the function needed if required.
df[“Name”].apply(function_name)
Returns a series of
Quicker = lambda functions, define as a temporary function inside apply()
df[‘Name’].apply(lambda x: x.split(‘,’)[0])
How do you split a string?
my_string.split(‘,’)
How do you get the first/last part of a split string?
my_string.split(‘,’)[0]
my_string.split(‘,’)[-1]
What is a benefit of using a lambda function in apply()
It is defined as a temporary function and avoids using memory for a function that is used only once
How do you group data in a data frame?
df.groupby()
classes = df.groupby(‘Pclass’)
This provides a dictionary, where the keys are the groups and each contains a list of row indexes.
The object produced has the function/attribute .groups
print(classes.groups) to see the index of the rows belonging to each key.
How do you see the keys (ie the groups) from grouped data?
new_groups = df.groupby(‘Embarked’)
new_groups.groups.keys()
Why is grouping useful?
We can quickly calculate statistics separately on each of the different groups.
Allows us to investigate aggregated data rather than on the whole dictionary directly.
We can do this with any column of our grouped data using square bracket notation.
How do you calculate summary statistics for a grouped data frame?
classes = df.groupby(‘Pclass’)
classes[‘Fare’].mean()
This returns a value for each of the group keys
How do you determine how many entries fall into each group?
Look at the size of each group - classes.size()
NB: for a dataframe object, size is a property (no parentheses) but for a grouped object is is a method, requires parentheses
How do you determine and rank by how many entries fall into each group?
classes.size().sort_values(ascending=False)
How do you create a dataframe for a specific group?
Use get_group()
first = classes.get_group(“Group Name”)
The argument “Group Name” should match one of the keys in classes.groups.keys().
How do you make an array of 12 random integers from 40 to 100?
data = np.random.randint(low=40, high=100, size=12)
This will make a 12 x 1 array
How do you convert a Numpy array into a dataframe?
Use the pd.DataFrame function
df = pd.DataFrame(data)
where data is a NumPy array
The shape and content is preserved, but the rows and columns now have explicit names. By default the NumPy row and column indices.
How do you retrieve a column from a data frame?
Using familiar square bracket notation with the name of the column
df[“Name”] or df[1]
or more explicitly (better for more complex selections)
df.loc[:, “Name”]
When creating a data frame from an array, rather than using default indices, how can we create memorable column headings or row indices?
use index= and columns= attributes
eg
df = pd.DataFrame(data, index=[‘Matt’,’Jonathan’,’Fiona’,’Deepak’], columns=[‘DSA8001’,’DSA8002’,’DSA8003’])
How do we create a dataframe directly from a dictionary?
data_dict = {
“Module 1”: {“Matt”:80, “John”:60},
“Module 2”: {“Matt”:70, “John”:63},
“Module 3”: {“Matt”:82, “John”:76},
}
df = pd.DataFrame(data_dict)
Outer keys define the column headings ie modules will be the column
Each nested dictionary defines one row
Why do we not need to specify labels when creating a data frame directly from a dictionary?
Pandas will use the dictionary keys
How can you add a column to an existing dataframe?
Insert method
df.insert(loc=1, column=”Name”, value=data)
The length of the array data needs to match the number of rows in the df.
How do you add a new row to a data frame?
Concat function
Concatenates a new data frame to the end of the current one
df = pd.DataFrame(new_student, index=[‘New Student’], columns=[‘DSA8001’,’DSA8002’,’DSA8003’,’DSA8021’])
May need to reshape the data to be added after creation, using reshape.
Concatenation only works well if the column labels match. It will fill in things with NaN an may convert existing data (NaN is not an integer, things may be converted to float).
How do you save a data frame to a file?
to_csv() - typically we save as a CSV file using this built-in dataframe method
df.to_csv(“file_name.csv”)
What might happen if we repeatedly read, edit and save CSV files with Pandas?
When you open a data frame with read_csv(), it adds a numerical index column by default.
We may end up doing this repeatedly, adding another index column each time.
Best to specify a particular column to use for the row indices when reading in the CSV
df.read_csv(“file.csv”, index_col=0)
How should you read in a data frame from a file?
df.read_csv(“file.csv”, index_col=0)
Some columns contain JSON data, how is this formatted?
JSON is a string, formatted like a dictionary.
It is very flexible for dataframe columns that need to contain complex information.
How is complex information stored in a data frame column?
JSON data - can be stored as a dictionary
Before working with complex data stored in a column in JSON format, what do we need to do?
import json
How do you create a data frame with JSON in a column?
eg
df = pd.DataFrame(index=[‘Matt’,’Jonathan’,’Fiona’,’Deepak’], columns=[‘module_scores’])
What function is used to retrieve information from a data frame with JSON data?
json.loads(x)
eg getting the “DSA8002” column info.
df[‘module_scores’].apply(lambda x: json.loads(x)[‘DSA8002’])
What does json.loads do?
The json.loads() method can be used to parse a valid JSON string and convert it into a Python Dictionary
How do you get the value of a specific row and column from JSON data in a data frame?
Index the series returned from json.loads(x) like any other data frame
eg
df[‘module_scores’].apply(lambda x: json.loads(x)[‘DSA8002’]).loc[‘Matt’]
How do we convert a column to a date time dtype?
pd.to_datetime
df[‘datetime’] = pd.to_datetime(df[‘datetime’])
How do we check the data type of a column?
df[‘datetime’].dtypes
How do we extract hours/years etc. from a date time object? (from the timestamp)
df[‘hour’] = df[‘datetime’].dt.hour
or dt.year etc.
How do we calculate eg the total sales for each category in a data frame?
Group by and then sum
spend_by_hour = df.groupby(‘hour’)
spend_by_hour[‘total’].sum()
When would data frame merging be more useful?
If two data frames contain only some columns in common, it is often more useful to merge rather than concatenate.
What is the theory of merging two databases?
We find the columns in common between the two databases and return a set of rows with those columns.
What is merging a pandas data frame equivalent to?
JOIN statements in SQL.
What are JOIN statements in SQL equivalent to in pandas data frame?
Merging
What function is used to merge data frames?
pd.merge(df1, df2, on=”Column”, how=”left”)
What are the different ways data frames can be merged?
- Left join - keep everything in the left table and what’s in the right table if available
- Right join - keep everything in the right table and what’s in the left table if available
- Inner join - return only entries that are present in both tables
- Outer merge - returns all rows across both tables
What is the opposite of the inner join?
The outer join - returns everything across both tables
Which type of merge is most likely to have lots of NaNs?
The outer merge / full join in SQL
Are data frames static?
No, new data can be inserted by adding rows or columns
What is SQL?
Structured Query Language
Used as a tool to search relational databases. Can search, filter, group or combine databases to return entries matching certain criteria.
Most popular language to manage relational databases
What is a relational database?
- Data are stored as a table or tables with rows (records or tuples) and columns (attributes)
- Each record has a unique key
- Each table represents a particular type of data eg one table to store information on customers, another to store products
What is the advantage of SQL?
It is written closer to natural language, so queries can be constructed more intuitievely..
What are the different data types in SQL?
Numeric - eg INTEGER, FLOAT(p) with p digits of precision
String types - CHARACTER (L) with fixed length L, or VARCHAR(L) with a maximum length L
DATE, TIME
BOOLEAN (True, False)
Why are we able to use SQL to perform queries on pandas data frames?
Pandas data frames are relational databases.
Before performing SQL queries directly in Python/Pandas, what do we need to do?
import pandas as pd
import pandasql as ps
What is pandasql?
A handy python module to query pandas data frames
If you don’t have pandasql, what should you do?
!pip install pandasql
What is the general syntax for writing and executing an SQL command in the Jupyter notebook?
query = ‘’’
’’’
ps.sqldf(query)
What is a simple query to fetch all data?
query = ‘’’
SELECT *
FROM dataframe
‘’’
ps.sqldf(query)
What are SQL queries composed of?
Combination of “clauses” with the names of tables and/or columns
What clause returns entries of interest?
SELECT
What does the SELECT clause do?
Returns entries
What does the FROM clause do?
Tells SQL which database to select the columns from
Why are SQL clauses written in capital letters?
They are not case sensitive.
Writing in capitals helps differentiate them from the names of tables etc.
What is returned when we use PandaSQL?
A Pandas DataFrame - which is very convenient for further database operations.
How do you select a specific column from a database?
query = ‘’’
SELECT “Column Name1”, “Column Name2”
FROM dataframe
‘’’
ps.sqldf(query)
How do we return entries that have a specific attribute?
Use conditional searches using the WHERE clause.
query = ‘’’
SELECT *
FROM dataframe
WHERE city = “Belfast”
‘’’
ps.sqldf(query)
NB: single = sign, not ==
NB: Need to put string of interest in different quotation marks to overall string query
What conditions can we apply with WHERE?
- =, <, <=, >, >=
- BETWEEN X AND Y - number in a specified range
- IN (‘X’, ‘Y’) - values in a given list
- LIKE ‘%YZ%’ - value matches a given pattern YZ where % is used to represent free text before and/or after the pattern
How can we apply multiple condition at the same time in SQL?
Use the AND clause
How can we sort data by column values in SQL?
query = ‘’’
SELECT *
FROM dataframe
ORDER BY column DESC
‘’’
ps.sqldf(query)
Can specify ASC or DSC
How can we sort data by multiple column values?
query = ‘’’
SELECT *
FROM dataframe
ORDER BY column DESC, column2 ASC
‘’’
ps.sqldf(query)
The ordering is applied one after the other
How do we modify the SQL query so that we only return a small number of rows?
The LIMIT clause. This is similar to the head() function in Pandas.
query = ‘’’
SELECT *
FROM dataframe
LIMIT 5
‘’’
ps.sqldf(query)
What kind of data aggregation computing statistics are usually performed in SQL?
- COUNT - returns the number of records
- MIN, MAX - returns the smallest/largest entries in a column
- SUM - sum of entries in a column
- AVG - average of entries in a. column
How do you find out eg how many women are in a database using SQL?
query = ‘’’
SELECT COUNT(*)
FROM dataframe
WHERE Gender = “Female”
‘’’
ps.sqldf(query)
NB: you would get the same answer whether you count the whole database or a single column - the number of rows will be the same either way.
How do we use simple expressions in SQL to return a calculation?
query = ‘’’
SELECT Number, Income/1000 AS [Income ($k)]
FROM dataframe
‘’’
ps.sqldf(query)
NB: be aware of automatic rounding here, because the column was an integer, it automatically converts a float to an integer.
How do we name a created column in SQL?
Using the AS clause
It should come right after the column or expression of interest.
Need to include square brackets if you want the column name to have a space.
What clause allows us to return different values depending on the content of a column?
The CASE clause
query = ‘’’
SELECT Number, Gender, Age, City,
CASE
WHEN City=’New York City’ THEN ‘North’
WHEN City=’Dallas’ THEN ‘South’
END AS Region
FROM citizens
‘’’
ps.sqldf(query)
What does the CASE clause do?
Allows us to return different values depending on the content of a column.
The CASE and END start and end the logical criteria
AS specifies to column name to store the result
How do we calculate summary statistics based on a categorical column?
Using the GROUP BY clause
query = ‘’’
SELECT Age, COUNT(*)
FROM citizens
GROUP BY Age
‘’’
ps.sqldf(query)
When do we use GROUP BY in SQL queries?
To calculate aggregated statistics (counts, averages etc.).
You can’y display columns from grouped tables, you just see the first record in each group.
Only the column used for grouping and aggregated statistics should be included with the SELECT clause.
How do you group by multiple columns?
Select the two columns you want to group by, and select the aggregate statistic for the third column.
The order typed is the order they are grouped by
query = ‘’’
SELECT Gender, Age, AVG(Income)
FROM citizens
GROUP BY Age, Gender
‘’’
ps.sqldf(query)
When combining WHERE and GROUP BY clauses, what order should they be stated in?
Order matters
Need to apply the WHERE clause before grouping the data, so that undesired rows are not included in the grouping stage.
What clause do you need to apply conditions to grouped data?
HAVING clause
How do you use the HAVING clause?
query = ‘’’
SELECT Gender, Age, AVG(Income)
FROM citizens
GROUP BY Gender, Age
HAVING Age > 30
‘’’
ps.sqldf(query)
“Group by gender and age, but show me only the ones having ages over 30”
Why is the HAVING clause necessary?
It allows us to perform more complex conditions, by applying criteria to the statistics of each group.
Eg you only want groups with an average income over X. We need to do the grouping first before we can apply the condition.
What is the operation to join two tables in SQL?
JOIN
The first table is always the left table
The second table is always the right table
Joined based on a common column
If you do not specify the JOIN type in an SQL query, which type of JOIN is automatically performed?
Inner join
ie only returns records present in both tables.
What information do you need to provide in the SQL query when performing a JOIN?
You must provide the column to use for the join, otherwise the entire right table will be repeated for every record in the left table.
Specify columns to join on using ON
ON citizens.Number = welfare.IDNum
It is best to specify which table the column is coming from, to avoid any ambiguity if there is a column with the same name in each table.
How do you write a JOIN query in SQL?
query = ‘’’
SELECT *
FROM citizens
JOIN welfare
ON citizens.Number = welfare.IDNum
‘’’
ps.sqldf(query)
What can be handy to do in complex queries?
Give each table a shorthand name,
query = ‘’’
SELECT *
FROM citizens c
JOIN welfare w
ON c.Number = w.IDNum
‘’’
ps.sqldf(query)
When is an SQL sub-query used?
When it takes more than one query to get what we want.
Avoids hardcoding, which is not very efficient.
How do you specify a sub-query?
Specified using round brackets. The table or value the sub-query returns directly feed into the overall query.
query = ‘’’
SELECT *
FROM citizens
WHERE Income >
(SELECT MAX(Income)/2
FROM citizens)
‘’’
ps.sqldf(query).
When might you need to use a sub query?
When you need to compare a column against a list of values.
Eg finding all citizens belonging to age groups with average incomes over 55000
What clause is used to convert a character string to JSON?
JSON_EXTRACT
JSON_EXTRACT(table.column, “$[x].key”) AS new_colimn
Specifying which element of the JSON list we want to extract (0 is the first, 1 is the second etc.)
The key we want is optional
What is the syntax for an SQL query using JSON_EXTRACTS?
query = ‘’’
SELECT title, JSON_EXTRACT(credits.cast, “$[0].name”) AS starring
FROM credits
‘’’
ps.sqldf(query)
If data has a very large dynamic range, what is it good to do?
Look at the logarithm of the data (convert to powers of 10)
How do you convert an axes to the logarithmic scale?
using np.log10() function
create new columns with the logarithmic conversions and then plot these.
OR
plot directly with
plt.xscale(‘log’)
How can you identify trends?
Looking for the slope of the relationship.
Using np.polyfit() to fit a polynomial to data.
How do you use the polyfit function?
It takes an x array, y array and a degree (first = linear, second = quadratic etc.)
f = np.polyfit(x=df[‘log_galaxy_mass’], y=df[‘log_bh_mass’], deg=1)
f[0] is the slope
f[1] is the intercept
We can then plot this using dummy data
x_arr = np.arange(x1,x2,0.1)
y_mod = f[0] * x_arr + f[1]
plt.plot(x_arr, y_mod, color=’r’)
How do you quantify the goodness of fit of the polynomial fitted?
Calculating the Mean Squared Error - the average difference between model and data
MSE = mean((data - model)**2)
Calculate the predicted y values
linear_prediction = f[0] * df[‘log_galaxy_mass’].values + f[1]
Calculate MSE
mse = np.mean( (linear_prediction - df[‘log_bh_mass’])**2 )
When is the MSE good for a data set?
When the range of the y axis is several times bigger than 1 - so the fit is doing better than random
What is SciPy?
A Python scientific module which provides algorithms for many mathematical problems.
We use it for correlation in this module (does bigger x really mean bigger y)
How is correlation determined?
Using a statistic called Spearman’s Rank Correlation coefficient
from scipy.stats import spearmanr
print(spearmanr(df[‘log_galaxy_mass’], df[‘log_bh_mass’]))
How do you convert a list of strings to a single string?
” “.join(x)
How do you remove an item from a list?
list.remove(item)
How do you insert an item into a list?
list.insert(position, item)
How do you add something to the end of a list?
list.append(item)
How do you investigate the names of the keys in a dictionary?
dict.keys()
What is XeY shorthand for?
XeY is short-hand in Python for “X times 10 to the power of Y”
What does 1e6 represent?
1 x 1 000 000
How do you write a dictionary to a file?
Within the with open as - json.dumps(data, f)
How do you read in a dictionary within a file?
Within the with open as - json.load(f)
How do you check in a condition that a variable is of a certain type?
isinstance(string1, str)
What syntax for raising an exception can be used in a function?
def paint(self, colour):
try: if isinstance(colour, str): self.colour = colour else: raise TypeError("Paint should be provided as a string") except TypeError: print(TypeError, "- the colour remains", self.colour)
How can you remove an item from a list?
del list[0]
How does python store a list?
In a simplified sense, you are storing a list in your computer memory, and store the address of that list, so where the list is in your computer memory in x.
This means that x does not actually contain all the list elements, rather it contains a reference to the list.
How can you create a new list from an original list, so that it is passed by value rather than reference?
y = list(x) - this is a more explicit copy of the list
rather than y = x
How do you find the maximum value of a list?
max(list)
How can you round a value?
Round function
round(value, precision)
How can you look at python documentation?
help(function_name)
How do you find the length of a list?
len(list)
How do you sort a list?
sort(list, reverse=False)
How do you get the index of a specific item in a list?
list.index(item)
How do you count the number of time an element appears in a list/string?
list.count(element)
How do you capitalise the first letter of a string?
string.captialise()
How do you replace part of a string with a different part?
string.replace(“x”,”y”)
How do you convert an entire string to all caps?
string.upper()
How do you reverse the order of a list?
list.reverse() - this changes the list it is called on
What is the NumPy array an alternative to?
The NumPy list
How do you create a NumPy array from a list?
np_array = np.array(list)
Assumes the list contains elements of the same type
In a NumPy array, how are True and False treated?
As 1 and 0
How do you investigate the size of a numpy array?
array.shape
How can you subset. single element from a 2D NumPy array
array[0][2]
or array[0,2]
How can you get the mean of a column of a 2D NumPy array?
np.mean(dataset[:,0])
How can you check if two columns of a 2D NumPy array are correlated?
np.corrcoef(dataset[:,0], dataset[:,1])
correlation coefficient
How can you calculate the standard deviation of a NumPy array column?
np.std(dataset[:,0])
How do you generate random data points from a normal distribution?
data = np.round(np.random.normal(1.75, 0.2, 5000), 2)
mean = 1.75, std = 0.2, 5000 samples
What package do we use for data visualisation?
Matplotlib
import matplotlib.pyplot as plt
When is it appropriate to plot a line graph?
When time is on the x-axis
In a scatter plot, how do you set the size of plots?
s=numpy array
How can you add grid lines to your plot?
plt.grid(True)
How do you look at the keys of a dictionary?
dict.keys()
What type of values can dictionary keys be?
immutable objects
What are examples of immutable object types that can be used as dictionary keys?
Strings, Booleans, integers and floatsH
How can you check if a key is already in a dictionary?
“key” in dictionary - see if it returns True or False
How can you delete a value from a dictionary?
del(dictionary[“key”])
How can you manually check if two arrays are compatible for broadcasting?
np.broadcast_to()
How do you find the maximum value of a numpy array?
np.max(array)
How do you find the index of the maximum value of a numpy array?
np.argmax(array)
How you transform all values in a numpy array to positive?
np.absolute(array)
How do find the find the base 10 logarithm of 1000?
np.log10(1000)
How do you find the exponential of 1?
np.exp(1)
What kinds of mathematical functions can you access with numpy?
np.sin(x)
np.cos(x)
np.pi
How do you count the number of occurrences of eg a City in a database?
Group by city then find the size
eg home_team = matches.groupby(“Home Team Name”).size()
How do you make a dataframe from a dictionary and change the names of the indexes?
pd.DataFrame(dictionary)
Indexes automatically given
df.index = [list_of_strings]
How do you select a column from a data frame and keep it in a data frame (rather than a pandas series)?
Use double square brackets
df[[“column”]]
How do you select multiple columns from a data frame by name?
df[[“column1”, “column2”]]
OR
df.loc[:, [“column1”, “column2”]]
To carry out the slicing function my_array[rows columns] on pandas data frames what do we need?
loc and iloc
How can you only select certain columns and certain rows of a data frame?
df[ [“row1”,”row2”], [“col1”,”col2”]]
How do you apply multiple logical operators to a NumPy array / pandas series?
np.logical_and(array > 1, array < 5)
array[np.logical_and(array > 1, array < 5)]
np.logical_and()
np.logical_or()
np.logical_not()
How do you write a for loop to include access to the index?
for index, var in enumerate(seq):
expression
How do you loop over a dictionary to access both key and value?
for key, value in dictionary.items():
expression
How do you loop over an array to get each element?
To get every element of an array, you can use a NumPy function called nditer (ND iter)
for val in np.nditer(array):
print(val)
When looping over a dataframe - what does the following print out?
for val in dataframe:
print(val)
Prints out the column names
How do you iterate over the rows of a data frame?
In pandas, you need to explicitly say that you want to iterate over the rows.
Generates label on row and actual information.
for label, row in np.iterrows(dataframe)|:
print(label)
print(row)
dataframe.loc[label, “country_name_length”] = len(row[“country”])
Can also select a specific column eg print(row[“column_name”] or (as shown, can create new column)
But this is inefficient - use .apply
eg dataframe[“country_name_length”] = dataframe[“country”}.apply(len)
How can you create a column that contains a calculation based on another column?
Use .apply(function)
eg dataframe[“country_name_length”] = dataframe[“country”}.apply(len)
What does .apply() do?
Allows you to apply a function on a particular column in an element-wise fashion.
How do you generate random numbers, ensuring reproducibility?
Using a seed - generate pseudo random numbers
np.random.seed(123)
np.random.rand()
How do you randomly generate a 0 or 1?
np.random.randint(0,2)
This simulates a coin toss
How can you simulate a dice throw?
np.random.randint(1,7)
In functions with eg subtracting, how can you account for the fact you can’t have a negative number?
x = max(0, calculated_value)
this ensures x never goes below zero
How do you transpose a 2D NumPy array?
np.transpose(array)
How do you add a description of a defined function?
Use of docstrings - placed inside triple double quotation marks
def function(paramters):
“”” “””
How do you change the value of a global parameter inside a function?
use keyword global
global name
In a nested function, how can you change the value in an enclosing scope?
nonlocal keyword
How do you allow for passing multiple arguments into a function?
*args
How do you allow for passing multiple keyword arguments into a function?
**kwargs
This turns the identifier keyword-pairs into a dictionary within the function body
Then, within the function body, we print all the key value pairs stored in the dictionary kwargs
for key, value in kwargs.items():
How do we apply a lambda function to all elements of a list?
How do we print results of this lambda function?
We need to use map() to apply the lambda function to all elements of the sequence
result = map(lambda x,y: x+y)
It returns a map object, convert to list using list(result)
How can you filter out elements of a list which don’t meet certain criteria?
result = filter(lambda x: len(x) > 6, list)
What kind of error is thrown when an operation or function is applied to an object of an inappropriate type?
TypeError
When should we raise an error (instead of catching in an except)?
eg if we don’t want our function to work for a particular set of values - such as don’t want to square root negative numbers
using an if statement, we can raise a value error for cases in which the user passes the function a negative number
if x < 0:
raise ValueError(“X must be non-negative”)
in an SQL query, how do you count unique values?
COUNT (DISTINCT “column_name”)
How do you determine the number of rows in a data frame?
len(df)
How can you quickly inspect a data frame?
df.info()
What does df.describe() do?
The describe() method computes some summary statistics for numerical columns like mean and median
What are the components of a data frame that you can access?
df.values - a 2D NumPy array
df.columns - column labels
df.index - row labels
How can you sort a data frame by multiple column values?
df.sort_values([col1, col2], ascending=[True,False])
How do you select multiple columns from a data frame?
Need double square brackets
df[[“col1”, “col2”]]
How do you compare dates in a logical comparison?
The dates are in quotes, written as year, month then day
This is the international standard date format
How can you filter a dataframe on multiple options of a categorical variable?
Using .isin()
dogs[“colour”] .isin([“Black”, “Brown”])
What method allows to calculate custom summary statistics?
Aggregate .agg()
def function(column):
return column.quantile(0.3)
df[“column”].agg(function)
Can be used on multiple columns - pass in list [“col1”,”col2”]
Agg itself can also take a list of functions to apply at the same time
Can use .agg for the IQR
How can you calculate the cumulative sum of a column?
Calling .cumsum() on a column returns not just one number, but a number for each row of the data frame
df[“column”].cumsum()
Can also have .cummax(), .cummin(), .cumprod()
These all return an entire column of a dataframe, rather than a single number
When counting in a dataframe, how do you ensure you only count each “thing” once?
use .drop_duplicates()
eg df.drop_duplicates(subset=[“col1”, “col2”]
After subsetting, how can you count the number of values in a table?
To count the dogs of each breed, we subset the breed column and use the value_counts() method
Can do .value_counts(sort=True)
How can you turn counts into proportions of the total?
df[“column”].value_counts(normalize=True)
How can you calculate the mean weight of each colour of dog?
dogs.groupby(“colour”)[“weight”].mean()
What does the .agg method allow you to do?
Pass in multiple summary statistics at once to calculate
df[“column”].agg([np.min, np.max, np.sum])
What are pivot tables?
A way of calculating grouped summary statistics
.pivot_table()
df.pivot_table(values=”col”, index=”colour”)
o The values argument is the column that you want to summarise
o The index column is a column that you want to group by
Automatically calculates the mean, if you want another statistic, use aggfunc
df.pivot_table(values=”col”, index=”colour”. aggfunc=np.median)
To group by more than one variable, pass in columns
df.pivot_table(values=”col”, index=”colour”, columns=”breed’, fill_value=0, margins=True)
How do you set the index of a data frame?
df.set_index(“column”)
can include multiple columns df.set_index([“col1”, “col2”])
How do you reset the index of a dataframe?
df.reset_index()
to get rid of it completely df.reset_index(drop=True)
How can you subset a data frame with row labels?
.loc
df.loc[[item1, item2]]
How do you subset rows at the outer level of an index vs the inner level, when there are two indexes?
Outer -df.loc[[item1, item2]] - with a list
Inner - df.loc[[(oteritem1, inneritem1), (outerritem2, inneritem2]] - with a tuple
How can you sort values by their index?
.sort_index()
for multiple indexes - By default, it sorts all index levels from outer to inner, in ascending order, can control this;:
df.sort_index(level = [inner, outer], ascending=[True, False])
What does slicing do?
Selects consecutive elements from objects
If a column contains a date type, how can you access the different elements of the date?
df[“columns”].dt.year /.dt.month etc
What is the simple way to plot?
eg df[“column”].hist()
avg_weight_by_breed.plot(kind=bar)
How do you rotate axis labels by 45 degrees?
pass in rot=45
How can you investigate if there are any missing values in your dataset?
Represented by NaN
df.isna().any() - tells you if there are any missing values in each column
df.isna().sum() - tells you how many missing values are in each column
What can you do with missing values in a dataframe?
Drop - df.dropna()
Fill with 0 - df.fillna(0)
How do you convert a data frame to a CSV file?
df.to_csv(“new filename.csv”)
How do you find the value in column 1 based on a condition in columns 2?
journalsBSA.iloc[journalsBSA[“Rank”].idxmin()].loc[“Rank”]
correct - journalsBSA.loc[journalsBSA[“Rank”].idxmin(), “Title”]
How do you change the range of the data shown on the axis?
Change the axis limits - ax.set_ylim()
What are the steps of calculating the MSE?
determine the y values based on the predicted model and compare to actual values in table
MSE = np.mean( (predicted_y - df[“column”])**2)
How do you count the number of occurrences in a grouped
phys_groups.size().sort_values(ascending=False)
In databases, what are rows and columns referred as?
In the world of databases, rows are referred to as records
Columns are referred to as fields
What SQL query do you use to only return unique values?
SELECT DISTINCT column1, column2
FROM dataframe
What does the distinct key word do?
return the unique combinations of multiple field values
What is an SQL view?
A view is a virtual table that is the result of a saved SQL SELECT statement
Views are considered virtual tables
There is no result set when creating a view
Then this table can be queried