Session 8 - Advanced Programming Techniques Flashcards
We know how to read data from a single CSV file.
But often we want read data from
many files
We know how to read data from a single CSV file.
But often we want read data from
For example, we might run the same experiment on 100 participants
Each experiment generated a data file and now you want to process them all in one script.
How do we know which files to read from?
One good way is to put all the files into a single directory like /home/alex/subject_data and then find all the files in that directory that match a certain pattern (e.g. ending in .csv).
What does this ‘real directory’ show? - (3)
Here, then directory stores lots of different files from a single experiment.
The ones we care about are .csv files but there are also some other ones in there as well (like the .log files).
In the analysis we want to find and load in all the .csv files and ignore the other ones.
Two ways of listening contents that is in a directory - (2)
- glob
- listdir command (part of os)
One way to find out what is in a directory is with the
listdir command (part of os).
One way of listing contents of a directory is using
os.listdir
To download data from YNiC to practice on is using the command ‘git’
Git is a
a free protocol that allows you to manage files that are synchronized across the internet.
The very common use for ‘Git’ is for
distributing software source code and data files
One website that runs ‘git’ is caled ‘github.com’ which - (2)
a favourite place for people to store their software projects and has become almost synonymous with ‘git’ itself.
Currently, Github says that they host over 100 million developers and over 420 million software projects
YNiC runs its own git server and use to download
some useful data files like this
Example of downloading some useful data files from YNiC git server
What does this code do? - (6)
- This code is used to download a smaller version of a repository called ‘pin-material’ from a specific URL.
- The ‘!cd /content’ command changes the directory to ‘/content’.
- ‘!git clone –branch small –depth 1 https://vcs.ynic.york.ac.uk/cn/pin-material.git’ is the main command.
- It clones the ‘small’ branch of the repository with a depth of 1, meaning it only gets the latest version of the files, not the entire history.
- The comment explains that this smaller version doesn’t include neuroimaging data, making it much smaller than the full repository.
- The ‘!ls -lat’ command lists the contents of the current directory in detail, showing the latest changes first.
Git is a protocol of getting source files, text files from a server in a
particular order
os.listdir() is a function from the …. module
os
We can use the command listdir to see what is in a
particular directory
We can check current working directory using module os by using
os.getcwd()
We can lists contents of current working directory using function part of os called
os.listdir()
We can list the contents of a different directory by passing its path to `os.listdir() e.g.,
For instance, os.listdir('/content/pin-material')
lists the contents of the ‘/content/pin-material’ directory.
Explain this code (using YNiC’s git server to download useful files like pin-material-git) - (8)
- This code uses the
os
module -
os.getcwd()
prints the current working directory. -
os.listdir('.')
lists the contents of the current directory.and stores in variable called ‘contents’ -
'.'
represents the current directory. -
type(contents)
prints the type of the variablecontents
. -
print(contents)
prints the contents of the current directory. -
os.listdir('/content/pin-material')
lists the contents of the ‘/content/pin-material’ directory - different directory and stores into variable ‘newcontents’ - contents of ‘newcontents’ variable is printed out
Output of this code
Both os.listdir('.')
and os.listdir()
refer to the same thing,, listening
listing the contents of the current directory.
Remember .. means ‘
‘go up one directory’
’.’ means
‘this directory’
This is what pin-material directory looks like in file format:
os.listdir()
includes hidden files, which start with a…
Hidden files may….
You may need to fitler… - (3)
with a dot (e.g., .DS_Store
)
- Hidden files may not be useful and can clutter the list.
- You may need to filter out hidden files from the list returned by
os.listdir()
Example of os.listdir() including hidden files (e.g., .DS_Store)
A more useful function than os.listdir is
glob function from glob module
What does ‘glob’ stand for?
It is short for ‘global pattern match’
- The glob function from the glob module is used to
find files and directories matching a specific pattern.
The ‘glob’ function from glob module allows you to use special characters such as ‘*’ and ‘?’ to
search for strings that match certain patterns.
Example of using glob on YNiC pin material
Example of using YNiC pin material directory
Explain the code - (5)
- Importing the glob function is achieved with
from glob import glob
. -
filelist = glob('/content/pin-material/*.jpg')
finds all .jpg files in the ‘pin-material’ directory. -
print(filelist)
displays the list of .jpg files found. -
pyFiles= glob('/content/pin-material/*.py')
finds all Python script files. -
print(sorted(pyFiles))
prints the Python script files as a sorted list - in ascending order
Output of this code:
We see in this code that glob returns whatever path we used in the arguement
Therefore if we use the full path (as we did above) we now have a set of full paths
In other words:
- When provided with the full path as an argument, glob returns a list of full paths.
We could then use this list in loop to open multiple files and load the data from
each one in turn
Can use sorted function to find these hidden files first when using os.listdir
What are wildcard characters in the context of glob?
Wildcard characters are special symbols used in glob patterns to match filenames or paths.
List all the wildcard characters using in glob function - (4)
- (an asterix)
- ? (a question mark)
- [1234] a list of characters -
- [1-9] a range of characters -
Explain the wildcard ‘*’ in glob - (2)
- It matches any set of characters, including no characters at all.
- For example, ‘file*.txt’ matches ‘file.txt’, ‘file123.txt’,
What does the ‘?’ wildcard match in glob? - (2)
- It matches any single character.
- For example, ‘file?.txt’ matches ‘file1.txt’, ‘fileA.txt’, but not ‘file12.txt’.
How does the wildcard ‘[1234]’ work in glob? - (2)
- ‘[1234]’ is a wildcard character in glob that matches any single character from the list [1234].
- For example, ‘file[1234].txt’ matches ‘file1.txt’, ‘file2.txt’, but not ‘file5.txt’.
Explain the ‘[1-9]’ wildcard in glob - (2)
- ‘[1-9]’ is a wildcard character in glob that matches any single character in the range from 1 to 9.
- For example, ‘file[1-9].txt’ matches ‘file1.txt’, ‘file2.txt’, but not ‘file10.txt’.
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/fft*’) print? - (4)
The glob pattern ‘/content/pin-material/fft*’ matches all files in the ‘/content/pin-material’ directory that start with ‘fft’.
- From the given list of files:
- ‘fft_colour.jpg’ and ‘fft_bw.jpg’ match the pattern.
- Therefore, glob(‘/content/pin-material/fft*’) would print [‘fft_colour.jpg’, ‘fft_bw.jpg’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/*md’) print? - (4)
- The glob pattern ‘/content/pin-material/*md’ matches all files in the ‘/content/pin-material’ directory that end with ‘md’.
- Based on the * wildcard, which matches any set of characters, it will find files ending with ‘md’.
- From the given list of files, ‘README.md’ matches the pattern.
- Therefore, glob(‘/content/pin-material/*md’) would print [‘README.md’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/pop?_*’) print? - (7)
The glob pattern ‘/content/pin-material/pop?_’ utilizes two wildcard characters: ‘?’ and ‘’.
- ’?’ matches any single character, allowing for flexibility in matching filenames.
- ‘*’ matches any set of characters, including no characters at all.
- Therefore, the pattern matches files in the ‘/content/pin-material’ directory that start with ‘pop’, followed by any single character, and then an underscore, and then any set of characters.
- Based on this pattern:
- Files like ‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’, ‘pop2_debug_script2.py’, and ‘pop2_debug_script1.py’ would match.
- Therefore, glob(‘/content/pin-material/pop?_*’) would print [‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’, ‘pop2_debug_script2.py’, ‘pop2_debug_script1.py’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/pop*’) print? - (4)
- The glob pattern ‘/content/pin-material/pop*’ matches all files in the ‘/content/pin-material’ directory that start with ‘pop’.
- Based on the ‘*’ wildcard, which matches any set of characters, it will find files that start with ‘pop’.
- From the given list of files, ‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’, ‘pop2_debug_script2.py’, ‘pop2_debug_script1.py’, and ‘pop3_test_script.py’ match the pattern.
- Therefore, glob(‘/content/pin-material/pop*’) would print [‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’, ‘pop2_debug_script2.py’, ‘pop2_debug_script1.py’, ‘pop3_test_script.py’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/pop?_tidy_script[1-2]*’) print? - (6)
- The glob pattern ‘/content/pin-material/pop?_tidy_script[1-2]*’ matches files in the ‘/content/pin-material’ directory that start with ‘pop’, followed by any single character, then ‘_tidy_script’, then either ‘1’ or ‘2’, and then any set of characters.
- ’?’ matches any single character, allowing flexibility in matching filenames.
- ‘[1-2]’ matches either ‘1’ or ‘2’.
- ‘*’ matches any set of characters, including no characters at all.
- From the given list of files, ‘pop2_tidy_script2.py’ and ‘pop2_tidy_script1.py’ match the pattern.
- Therefore, glob(‘/content/pin-material/pop?_tidy_script[1-2]*’) would print [‘pop2_tidy_script2.py’, ‘pop2_tidy_script1.py’].
Here are the files in the directory:
[pop2_tidy_script2.py’
, ‘s3’,
‘README.md’,
‘app_headshape.xlsx’,
‘fft_colour.jpg’,
‘pop2_tidy_script1.py’,
‘fft_bw.jpg’,
‘.DS_Store’,
‘s4’,
‘app_headshape.bin’,
‘pop2_debug_script2.py’,
‘pop2_debug_script1.py’,
‘.git’,
‘pop3_test_script.py’]
What would glob(‘/content/pin-material/fft*.jpg’’) print? - (4)
- The glob pattern ‘/content/pin-material/fft*.jpg’ matches files in the ‘/content/pin-material’ directory that start with ‘fft’, followed by any set of characters, and end with ‘.jpg’.
- ‘*’ matches any set of characters, including no characters at all.
- From the given list of files, ‘fft_colour.jpg’ and ‘fft_bw.jpg’ match the pattern.
- Therefore, glob(‘/content/pin-material/fft*.jpg’) would print [‘fft_colour.jpg’, ‘fft_bw.jpg’].
There are cases where you might have full paths (e.g. from glob above) and need to split them up into directory and filename. You may also want to split out the extension of a file from the main part of it (i.e
turn myfile.txt into myfile and txt).
here are cases where you might have full paths (e.g. from glob above) and need to split them up into directory and filename. You may also want to split out the extension of a file from the main part of it (i.e. turn myfile.txt into myfile and txt).
You are already thinking of the split() function right? Well that can work but in addition, there are three os functions that can help you with that - (3)
1) basename
2) dirname
3) splitext
How to import three os functions that help you with spilting full file paths using os module?
basename, dirname, split text?
Functions like basename, dirname, and splitext from the os.path module can help split full paths into directory, filename, and file extension.
- These functions provide a convenient way to
extract different parts of a file path.
Explain the basename function from the os.path module - (3)
- The basename function, from the os.path module, extracts the filename from a full path.
- It returns the last component of the path, excluding the directory.
- For example, basename(‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’) would return ‘s4_rt_data_part01.hdf5’.
Explain the dirname function from the os.path module.
- The dirname function, from the os.path module, extracts the directory name from a full path.
- It returns the directory component of the path, excluding the filename.
- For example, dirname(‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’) would return ‘/content/pin-contents/s4’.
Explain the splitext function from the os.path module - (3)
- The splitext function, from the os.path module, splits a filename into its base name and extension.
- It returns a tuple containing the base name and the extension separately.
- For example, splitext(‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’) would return (‘/content/pin-contents/s4/s4_rt_data_part01’, ‘.hdf5’)
The splitext function returns a tuple (you can treat it as a list) of two items. - (2)
The first element is everything except the extension of the file and the second element is the extension (including the leading .).
Can use basename, dirname and splittext on variables - (3)
e.g., my_path = ‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’
dname = dirname(my_path)
fname = basename(my_path)
print(splitext(my_path))
What does the splitext function do when applied to the full path? - (3)
- The splitext function splits the full path into its base name and extension.
- When applied to the full path, it returns a tuple containing the base name and the extension separately.
- For example, splitext(‘/content/pin-contents/s4/s4_rt_data_part01.hdf5’) would return (‘/content/pin-contents/s4/s4_rt_data_part01’, ‘.hdf5’).
What does the splitext function do when applied to just the filename? - (3)
- When applied to just the filename, the splitext function splits the filename into its base name and extension.
- It returns a tuple containing the base name and the extension separately.
- For example, if fname is ‘s4_rt_data_part01.hdf5’, splitext(fname) would return (‘s4_rt_data_part01’, ‘.hdf5’).
Produce a code that Using glob, find all of the files in /content/pin-contents/s4 that end in .hdf5. Sort this list, loop over it and print out just the filename without the extension. Your output should look like:
Explain this code - (8)
The code first imports necessary modules from glob module and basename, split text and dirname functions from os module
Use glob to find files:
The glob function searches for all files ending with ‘.hdf5’ in the ‘/content/pin-material/s4/’ directory.
The resulting list of file paths is stored in the fileList variable.
A for loop iterates over each file path in fileList.
In each iteration of element in fileList,
fNameOnly stores the filename extracted from the full path thisFileName (e.g., if thisFileName is ‘/content/pin-material/s4/s4_rt_data_part04.hdf5’, then fNameOnly will store ‘s4_rt_data_part04.hdf5’.
parts variable = splitext (fNameOnly) so splitext splits the filename stored in fNameOnly into its base name and extension. The base name is stored in parts[0].
For example, if fNameOnly is ‘s4_rt_data_part04.hdf5’, then after splitting:
parts[0] will store ‘s4_rt_data_part04’ (the base name).
print(parts[0]):Only the base name stored in parts[0] is printed. For example, if parts[0] is ‘s4_rt_data_part04’, then this base name will be printed.
For loop continues until each element of list in fileList is covered
Output of this code
There are two additional things we can do with lists which can make our code more concise and easier to read and write
These are list comprehensions and list enumerating .
We know how to make a list both by hand and by the range function
Explain this code - (3)
-
list1=[0,1,2,3,4,5]
: Defines a list namedlist1
containing integers 0 through 5, entered manually. -
list2=list(range(6))
: Creates a list namedlist2
using therange()
function to generate integers from 0 to 5. - Prints both list1 and list2
We often need to manipulate the contents of data in lists and have learned to do this by using
for loops
Example of manipulate the contents of data in lists and have learned to do this by using loops:
Explain this code - (8)
-
input_list = range(10)
: Creates a range object containing integers from 0 to 9 (not including 10), assigned toinput_list
. -
output_list = []
: Initializes an empty list namedoutput_list
. -
for value in input_list:
: Iterates over each value ininput_list
.- Inside the loop:
-
value
takes on each value frominput_list
in sequence. -
output_list.append(value * 2)
: Multiplies each value by 2 and appends the result tooutput_list
.
-
- Inside the loop:
-
print(list(input_list))
: Prints the contents ofinput_list
, displaying integers from 0 to 9. -
print(output_list)
: Prints the contents ofoutput_list
, displaying each element multiplied by 2.
What would be its output?
For cases where we need to implement a simple transformation like this (such as multiplying by a number or calling a function on each member of a list), like in this example,
Python gives us an alternative: the list comprehension.
What is list comprehension mean in python?
A list comprehension is simply a statement inside of square brackets which tells Python how to contruct the list.
How to write this list ‘outputlist’ into list comprehension?
Explain this code - (2)
The example above therefore reads as (x * 2) for each value (x) in range(10). i.e., for each value in the list produced by range(10), put it in the variable x, then put the value x*2 into the list.
Note that the variable x is just a placeholder and could be called anything.
What would be output of this code?
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
The trick with list comprehensions is to read them out
loud to yourself.
List comprehension works with
any sort of list and any sort of data, e.g.,
Explain this code - (6)
-
original_data = ['Alex', 'Bob', 'Catherine', 'Dina']
: Defines a list namedoriginal_data
containing four strings. -
new_list = ['Hello ' + x for x in original_data]
: Utilizes list comprehension to create a new list namednew_list
.- For each element
x
inoriginal_data
, the expression'Hello ' + x
concatenates ‘Hello ‘ with the value ofx
, which represents each name in original_data. - The resulting strings are added to
new_list
.
- For each element
-
print(original_data)
: Prints the contents oforiginal_data
, displaying the original list of names. -
print(new_list)
: Prints the contents ofnew_list
, displaying each name prefixed with ‘Hello ‘.
What would be output of this code?
We can also call functions in
list comprehension
e.g.,
Explain this code - (6)
-
original_data = ['This', 'is', 'a', 'test']
: Defines a list namedoriginal_data
containing four strings. -
new_list = [len(x) for x in original_data]
: Utilizes list comprehension to create a new list namednew_list
.- For each element
x
inoriginal_data
, the expressionlen(x)
calculates the length of the stringx
. - The resulting lengths are added to
new_list
.
- For each element
-
print(new_list)
: Prints the contents ofnew_list
, displaying the length of each string inoriginal_data
. - For example, ‘This’ has 4 characters, ‘is’ has 2 characters, ‘a’ has 1 character, and ‘test’ has 4 characters.
What would be output of this code?
Use list comprehension to make the code fragment shorter:
What would be output of this code?
Use list comprehension to make the code fragment shorter:
What would be output of this code?
What is the purpose of the pass statement in Python? - (2)
The pass statement in Python serves as a placeholder and does nothing when executed.
It is often used to create empty loops, functions, or classes.
Example of using pass in an empty loop - (5)
-
data1 = [10, 20, 30]
creates a list of numbers from 10 to 30.- The for loop iterates over each item in
data1
. - Inside the loop, a comment explains that the loop is pointless but serves as a placeholder for future code.
- The pass statement is used to indicate that no action needs to be taken inside the loop.
- The for loop iterates over each item in
- Essentially, pass allows the loop to exist without any executable code inside it, avoiding syntax errors in situations where a loop is required but no action is necessary.
What would be output of this code? - (3)
The pass statement itself does nothing when executed and serves as a placeholder.
- As a result, there are no print statements or other operations that would produce output
NO OUTPUT
Explain the use of break in this code snippet - (5)
- It imports the randint function from the random module to generate random integer numbers.
- Inside a while loop that runs indefinitely (while True), random numbers between 0 and 5 (inclusive) are generated and printed.
- If a randomly generated number is equal to 0 printed, the break statement is executed.
- The break statement immediately terminates the while loop, exiting the loop and ending the program execution.
- This allows the program to stop generating random numbers once a 0 is encountered.
Python produces error if a for loop has
no code to execute inside it
We now looking at how to load different data formats
(other than .csv)
Files come in different ‘
formats
The format of a file means how the data
are stored inside it.
In some files the data are stored as ‘plain text’. You can open them in a t
a text editor (like Spyder or Notepad or you can look at them using cat on the command line) and read them although they might not make a lot of sense.
In some files the data are stored as ‘plain text’. You can open them in a text editor (like Spyder or Notepad or you can look at them using cat on the command line) and read them although they might not make a lot of sense
iles like this include .csv files, files ending in .txt and most types of programming language ‘source code’ like python files (.py) and web pages (.html).
The other ‘family’ of files store their data in formats where you cannot easily read them into a
text editor.
The other ‘family’ of files store their data in formats where you cannot easily read them into a text editor. These files usually contain - (2)
‘numbers’ of some sort stored in a way that computers like to read.
We call them ‘binary’ files.
Example of binary files
. Files like this include the image and video formats that you might be familiar with (.jpg, .gif, .mp4).
In neuroimaging, the brain data files we use tend to be in binary format (e.g.
nii, .mat)
In neuroimaging, the brain data files we use tend to be in binary format (e.g. .nii, .mat) while the files describing other stuff like experiment structure and subject responses are in
plain text’ (.csv, .txt).
Python can read most file formats – you just need to hunt down the
right modules.
Previously, we started to use matplotlib for plotting our data.
We imported the pyplot submodule like this:
pyplot contains most of the functions which we will learn in this section.
Because we imported it ‘as’ plt
We will access the functions using - (2)
plt.FUNCTIONNAME,
e.g. plt.plot; the same way as, when using numpy, we use np.array.
Plain text files can be formatted in a number of ways, but a common way for numerical data is:
i.e., numbers separated - either by spaces, tabs or commas with one row per line of text.
Up until now we have used Pandas to load in .csv files. But numpy also knows how to load in data and sometimes we just want to have the data
appear directly in a numpy array.
MEG systems arrange their sensors in a
‘bowl’ over the subject’s head. Like in this picture.
Each MEG sensor measures the
magnetic field activity in a particular location.
Together, MEG sensors tell you
what is happening all across the subject’s head at any moment.
We are going to load some MEG data from the file s4_meg_sensor_data.txt in the
pin-materials s4 sub-directory using this code:
When you have a text-based data file (e.g., MEG data: s4_meg_sensor_data.txt ) always start… - (2)
always start by having a look at it to understand the format.
Mostly often we just want to see a few lines of the file to get an idea of what is in it.
We can get an idea of what is inside a file using shell command
head which gives first 10 lines of a file
Using ‘!’ command, we can
list the contents of the “s4” subdirectory, we use the ls
command:
Explain what this code does - (6)
!ls -lh pin-material/s4
The ls
command lists the contents of a directory.
-
-lh
is a combination of options:-
-l
lists detailed information about each file, including permissions, owner, size, and modification time. -
-h
displays file sizes in a human-readable format (e.g., kilobytes, megabytes).
-
-
pin-material/s4
specifies the directory whose contents will be listed. - Therefore, this command lists detailed information about the contents of the “s4” subdirectory within the “pin-material” directory.
Explain what this code does - (3)
!head pin-material/s4/s4_meg_sensor_data.txt
The head
command displays the beginning of a file.
-
pin-material/s4/s4_meg_sensor_data.txt
specifies the file whose beginning will be displayed. - This command shows the first few lines of the “s4_meg_sensor_data.txt” file located within the “s4” subdirectory of the “pin-material” directory.
By executing this code:
It gives us this output
This output shows - (8)
We can see that the first line of the file contains some column headings.
We make a note of these as we will need them later on:
Column 0: Time
Column 1: Left Mean
Column 2: Left Lower CI
Column 3: Left Upper CI
Column 4: Right Mean
Column 5: Right Lower CI
Column 6: Right Upper CI
Executing this code line gives us this output:
We can use the tail command on MEG data to check 10 lines of the file are really numbers
The tail command is used to
It displays the last 10 lines of a file by default.
What does the following code snippet do, and why might it encounter an issue?
The code snippet uses NumPy’s “loadtxt” function to load numerical data from a text file.
- It attempts to load data from the file “s4_meg_sensor_data.txt” located within the “s4” subdirectory of the “pin-material” directory.
- However, it may encounter an issue if the first line of the file contains column names or non-numeric data instead of numerical values.
- By default, “loadtxt” expects numeric data and will raise an ValueError error if it encounters non-numeric content in the first row.
We can correct this by:
telling nump to skip the first line (skip first row) which contains t is the header with column names which is string - words
What would be the shape of the data and type?
(400, 7) - 400 rows and 7 columns
float64
After loading plain text files using np.loadtxt (e.g., ‘pin-material/s4/s4_meg_sensor_data.txt’, skiprows=1), we can save them out using
np.savetxt.
Explain the code - (9)
This code snippet uses NumPy’s “loadtxt” function to load numerical data from a text file.
- It loads data from the file “s4_meg_sensor_data.txt” located within the “s4” subdirectory of the “pin-material” directory, skipping the first row (which likely contains column names or non-numeric information).
- The loaded data is stored in the variable “importedData”.
- This line extracts a subset of the loaded data (“importedData”).
- It selects columns 1, 2, and 3 (exclusive indexing) from the loaded data.
- The extracted data, representing the timecourse and confidence intervals, is stored in the variable “ourdata”.
- The “savetxt” function from NumPy is used to save the extracted data (“ourdata”) to a new text file named “my_new_meg_data.txt”.
- This file is saved in the “/content/pin-material” directory.
- After saving the data, the “!ls -lth” command is executed to list the files in the directory, providing information about file sizes and modification times.
explain what this line np.savetxt mean in this code - (6)
np.savetxt: This function from the NumPy library is used to save data to a text file.
‘my_new_meg_data.txt’: Specifies the name of the text file where the data will be saved.
ourdata: Represents the data to be saved. This variable holds the extracted subset of the loaded data, which includes columns 1, 2, and 3.
header=’Mean UppcrCI LowerCI’: Defines a header string that will be written at the beginning of the file.
This header typically contains column names or other descriptive information. In this case, the header string specifies the column names as “Mean”, “UppcrCI”, and “LowerCI”.
fmt=’%1.4e’: Specifies the format string for writing the data. The %1.4e format specifier formats floating-point numbers with scientific notation (exponential format) and exactly 4 digits after the decimal point. This ensures that the data is written with a precision of 4 decimal places.
The MEG data are
average timecourses measured from some sensors on the left and right side of the head after a ‘beep’.
Usually in MEG we present the same stimulus many times and combine the recordings from each presentation to get an
‘average’ response.
Previously we loaded or made small arrays and plotted them with matplotlib. We can pass data to matplotlib either as a
list or as a numpy array.
Plot data as a list - example code
Explain whats happening in plt.plot([0,1,2,3]) - (4)
Since only one set of values is passed, these values will be used as the y-values, and the x-values will be automatically generated as the indices of the data points (0, 1, 2, 3).
In this example, we did not supply an ‘X’ axis. matplotlib assumed that we just wanted 0,1,2,3 and so on.
So we really plotted (0, 2), (1, 3), (2, 4) and (3, 5) and joined them up with a straight line.
When matplotlib draws a line in this way, it joins the points with straight lines by default.
We might, however, want to look at the individual data points. To do this, we can add a ‘format’ string - (2)
This is a string which describes how to format the data in the plot.
Here we do this by just adding an ‘o’ to the function call.
What does this line of code do?
plt.plot([2, 3, 4, 5], ‘o’)
The first line plots the data points as circles (‘o’) without joining lines, creating a scatter plot.
What does this line of code do?
plt.plot([2, 3, 4, 5], ‘og’)
The second line plots the same data points as green circles (‘og’) without joining lines.
What does this line of code do?
plt.plot([2, 3, 4, 5], ‘o–’)
The third line plots the data points as circles with a dashed line (‘o–’) joining them.
What are some commonly used markers in Matplotlib? - (8)
In Matplotlib, markers are symbols used to denote individual data points in plots.
Here are some commonly used markers:
- ’.’: Point marker
- ‘o’: Circle marker
- ‘v’: Downward triangle marker
- ’^’: Upward triangle marker
- ’+’: Plus marker
- ‘*’: Star marker
As well with plotting markers, there is also you can also change the line type here in matplotlib
For instance, – means used a dashed line whilst -. means use a dash/dotted line
In matplot we can also
adjust the colour of our lines or markers using the format string.
y default, matplotlib will use its own colour cycling scheme to choose colours for us but
but we can override this
We can combine markers and colours in same format string:
Explain what this code does and output - (5)
-
import matplotlib.pyplot as plt
: Imports the Matplotlib library and aliases it asplt
for convenience. -
plt.cla()
: Clears the current axes to ensure a fresh plot. -
plt.plot([2, 3, 4, 5], 'r+')
: Plots the data points [2, 3, 4, 5] with red plus markers (‘r+’).
The ‘r’ indicates the color red, and the ‘+’ specifies the marker style as a plus sign.
The datapoints are not joined by a line
How can we specify basic colors in Matplotlib? - (8)
- ‘b’: blue
- ‘g’: green
- ‘r’: red
- ‘c’: cyan
- ‘m’: magenta
- ‘y’: yellow
- ‘k’: black
- ‘w’: white
What does this code line do?
This code plots the points [2, 3, 4, 5] using blue stars at each data point with a dash-dotted line in between.
We can pass two lists/arrays of numbers first will be x value and second y values like this:
What does this code do: plt.plot([-2, 1.5, 2, 4], [2, 3, 4, 5], ‘g*’)?
This code plots a graph with green star markers at the coordinates (-2, 2), (1.5, 3), (2, 4), and (4, 5),
Why does this code not join the points together: plt.plot([-2, 1.5, 2, 4], [2, 3, 4, 5], ‘g*’)?
The absence of a line in the plot is due to the format string ‘g’, where ‘g’ denotes green color and ‘’ denotes star markers, but no line style is specified.
The default behaviour of matplotlib is to add plots of new data to the
existing figure
The default behaviour of matplotlib is to add plots of new data to the existing figure.
This allows us to
create complex plots from multiple components
E.g.,
The default behaviour of matplotlib is to add plots of new data to the existing figure. This allows us to create complex plots from multiple components.
A simple example would be a line plot with many lines of data, or a scatter plot where we show different types of data with different symbols
What does plt.ylim do?
set limits of y axis go from -3 to +4
What does plt.grid() do?
even put a beautiful grid on it
What does plt.savefig(‘my_figure.png’, dpi=300)
This code saves the current figure as an image file named “my_figure.png” with a resolution of 300 dots per inch (dpi).
You might want to save out your figures to include them in your papers or dissertation.
You can do this using the
savefig function
. There are two ways to use this savefig function
You can either select the figure and then use plt.savefig() or you can use the .savefig() method directly on the figure object.
Using can either select the figure and then use plt.savefig()
can use the .savefig() method directly on the figure object.
figureHandle = plt.figure()#Generate a new figure. Hold its ‘ID’ or ‘Handle’ in a variable
What does plt.close(‘all’) do?
Close all existing figures
what does legend part do in this code: import matplotlib.pyplot as plt? - (2)
The plt.legend() function adds a legend to the plot, which provides labels for the plotted lines.
In this code, it labels the first line as “A straight line” and the second line as “A wiggly line”.
Legends tell you what each line on a
plot means.
The ‘newline’ character (‘\n’) forces a new line inside a
string.
We can add new line character inside x-axis label and y-axis level via:
Why is one line blue and one line green in this code and plot? - (3)
In the provided code, the line plt.plot([2, 3, 4, 5]) creates a plot with only y-values, so it auto-generates x-values as sequential integers starting from 0.
This line is plotted in the default color, which is blue.
The subsequent line plt.plot([-2, 1.5, 2, 4], [2, 3, 4, 5], ‘g’) plots the provided x and y values in green color as specified by the ‘g’ argument.
We can also add text in your plot by
use plt.title to add a title to our plot and plt.text allows us to plot text at arbitrary positions in our figure.
What does plt.text do in this code? - (2)
The plt.text() function in this code adds text to the plot at a specified location.
In this case, it adds the text “Hello World” at the position (2, 2) on the plot, with the color set to red.
Explain this code plotting MEG data - (6)
np.loadtxt() loads data from a text file, skipping the first row.
Time data is stored in the first column (t).
Sensor readings from columns 1 and 4 are extracted (plot_dat).
plot_dat.shape prints the size of the extracted data - (400, 2) rows column
plt.plot(t, plot_dat) plots sensor data over time.
Legends and gridlines are added for clarity.
Wait in this code
There are two lines but only one plt.plot call! What has happened? - (3)
Until this point, each of our plt.plot() calls has only plotted a single line. If there are multiple columns in the data passed to plot_dat, matplotlib will automatically plot multiple lines - magic!
We have passed values for the x axis in (t: the time variable),
Notice that we have time before 0s. Time 0 is the time we present the stimulus (in this case a ‘beep’). We see that we get a large deviation in the signals shortly after the presentation of the stimulus.
The MEG signals we have plotted so far are the
are mean response across multiple presentations of the same ‘beep’
Someone has also computed a 95% Confidence Interval. We would like to visualise this. To do this, we will use the
plt.fill_between()
The plt.fill_between() allows us to add ‘error bars’
(perhaps an ‘error envelope’ is better desccription) to our plots.
The plt.fill_between is a function in matplotlib for
filling the area between two curves
What does plt.fill_between show? - (4)
Plots the error enevelope - shaded area between let_lower_ci and left_upper ci on the plot along time axis
t parameter is added to define x-axis values
left_lower_ci = data[:, 2] - extracts data from column 2 lower CI
left_upper_ci = data[:, 3] - extracts data from column 3 of upper CI
We would also like to plot the mean line (as well as error envelop) by
first plotting the mean line in black using plot.plot() (color=’k’), then plotting the error envelope over the top (plt.fill_between()).
When we plot over the top we have to set the color to be a bit transparent otherwise you will not see the line below.
Computers often refer to the transparency or ‘solidness’ of a color as its ‘alpha’ value. So if we set ‘alpha’ to 0.5, it will become 50% see-through. 20% is even more see-through.
We can also provide a colour argument to - (3)
plt.plot and plt.fill_betweeen() where we use ‘shorthand’ for colors where ‘green’ is ‘g’, ‘blue’ is ‘b’, ‘black’ is ‘k’ and so on.
e.g., plt.plot(data[:,0],color=’r’)
e.g. plt.fill_between(t, left_lower_ci, left_upper_ci,alpha=0.5, color=’g’)
We can specficy colours to plt.plot in different ways such as - (3)
specificy single letter ‘r’ = ‘red’
Red , green, blue format
Using colour names like aquamarine, mediumseagreen
To visualise a distribution of
numbers are hirstograms (frequency plots) and boxplots.
What does plt.plot(data[:,0], colour = ‘r’) mean? - (4)
data produces 2-D array with 10 rows and 4 columns filled with randomly generated numbers between interval 0 and 1 - numbers between 0 and 1 (exclusive)
data[:,0] selects all the rows of array of the first column (0) and plots them on y axis
Since no x axis values are explicitly mentioned, indices on x axis are generated –> x indices would be 0 to 9 since there are 10 rows in data array
Plots the values from data on y axis and indices of that on x axis as red line
What does data1 = np.random.randn(10000) mean? - (2)
This produces an array containing random numbers 10,000 random numbers drawn from standard normal distribution (mean = 0, SD = 1) - unlike rand that produces numbers from flat distribution
Majority of numbers would fall around mean (0)
What does data2 = np.random.randn(10000)?
Produce an array containing 10,000 random numbers drawn from standard normal distribution but majority of numbers fall around 1.8 than 0
Wha does plt.figure() and plt.hist(data1, bins = 50) show?
plt.figure() - produce a figure for the histogram
plt.hist(data1, bins = 50)
x axis = range of values in dataset data1 divided by 50 bins
Y axis is frequency or count of occurence of data points falling within each bin on x axis
specifices bins - how lumpy i want histogram to be
How to produce transparent histograms? - (2)
Specificying alpha parameter in plt.hist()
Alpha parameter controls the transparency of the histograms
What does plt.hist with alpha value 0.3 mean?
Alpha value of 0.3 means bars in histogram are somewhat transparent
The higher the alpha parameter is in plt.hist() the
less transparent the histogram will be
What does
binEdges=np.linspace(-10,10,100)
and what does it mean when specificed in histograms?
binEdges = np.linspace (-10,10,1000) produces array of bin edges ranging from -10 to 10 inclusive with 1000 evenly spaced interval - each interval defines boundaries of a bin of histogram
In plt.hist(data1, bins = binEdges) - bin parameter used to specificy bin edges to use for histogram
Box and whisker plots are often used to illustrate data which may have a
skewed distribution.
A box and whisker plot makes it easy to see the interquartile range
(25-75% range) as well as the median (50% value). Outlier points (as defined by a multiple of the interquartile range) are plotted as individual points outside the whiskers of the plot.
Explain this code which produces a boxplot - (6)
- data = np.random.rand(1000,3) - produces 1000x3 Numpy array filled with random numbers between 0 and 1 with uniform distribution
data[:,1] = data[:,1]*2 + 1 - this multiples the second column of data array (all rows) by 2 and adds 1 to value - scales and shifts values in second column
data[:,2] = data[:,2]*1.5 +- 1 - this multiplties the values in the third column of data array by 1.5 and subtracts 1 from each value - shifts and scales values from third column
plt.figure() - produces figure for plot
plt.boxplot(data)- produces boxplot using data in data array - each column has a different dataet
plt.show() - displays plot with boxplot
What is output of the boxplot?
What does plt.xticks ([1,2,3], [‘Set1’, ‘Set2’,’Set3’])
Adds label for the 3 boxplots by setting first boxplot Set 1, second boxplot Set 2 and third boxplot is Set 3
xticks does not use a
zero-referenced system.
what does it show here:
plt.xticks([1, 2, 3],[‘Mouse’,’Elephant’,’Badger’])
The first argument is a list of numbers indicating the different categories.
The second argument is a list of strings saying what to call them.
The ‘style’ of your plots is the default way everything looks. Stuff like
he color of the background, the line thickness, the font.
Matplotlib has a default plotting style. It also has the ability to change this style: either by means of
individual tweaks to plotting layouts, colours etc, or by changing all of its settings in one go.
ou can set the plotting style using the
plt.style.use() function
We can change the plotting style using
plt.style.use() funtion
using
plt.style.use(‘ggplot’)
which shows:
We can change the plotting style using
plt.style.use() funtion
using
plt.style.use(‘fivethirtyeight’)
which shows:
Question 1
Consider the following code snippet:
What is wrong with this code? How should it be corrected?
Technically this might execute but the file_path is not an absolute path as expected. Almost certainly it is missing the initial ‘/’
Identify the problem in this code and suggest a fix. - (2)
The plt.show() command fixes the image so the last two lines do not do anything.
Place them before the plt.show() command.
This code makes an assumption - what is it? - (2)
It assumes there is at least one line of data in the file as well as the header.
If this it not true, it will throw an exception.
This code has a similar bug to that in Q2. What is it? - (2)
Again - the plt.show() command stops the next line from working.
Change their order.
How might this code run into problems depending on the platform it runs on?
The ‘directory’ variable hard-codes the ‘/’ separators. This might fail on Windows where the separator should be ‘'
This code is designed to plot two random time series. What mistake does it make? - (3)
Data is defined as a 2 (down) by 10 (across) array.
So it will plot 10 random time series with two points each.
Change to rand(10,2) to make it work.
Again, this code might work or it might not depending on the operating system, even if you are sure that the directory ‘plots’ exists. What line makes it so ‘fragile’ and how could you fix it? - (3)
Line of error - plt.savefig(‘/plots/parabola.png’)
Windows uses \ instead of /. - fix it by plt.savefig(‘\plots\parabola.png’)
Use os.path.join to glue together all the bits in a platform-independent manner.
Which one of these boxplots will have the highest median value (as indicated by the bar across the middle)? - (2)
A
The second one (‘Group 2’). The offset is defined by the +2. The spread is defined by the *.5. So this one will have a median at +2 which is bigger than any of the others.
This is because offset (+2) directly adds a constant value to each data point, it has a more significant impact on the median compared to the spread (*.5).
Q10
A directory contains the following files - (2)
a: apple.txt, allFruit.csv, allFruit.xls, allFruit.tsv, apple.jpg
b: apple.jpg, banana.jpg