Python for Data Science Flashcards
True or False
The IPython Shell is typically used to work with Python interactively
TRUE
Which file extension is used for Python script files?
.py
Python scripts have the extension .py. my_analysis.py is an example of a script name.
You need to print the result of adding 3 and 4 inside a script. Which line of code should you write in the script?
print(3 + 4)
If you do a calculation in the IPython Shell, the result is immediately printed out. If you do the same thing in the script and run it, this printout will not occur.
In Python 3, you will need print(3 + 4). You need to explicitly include this print() function; otherwise the result will not be printed out when you run the script.
Python as a calculator
Python is perfectly suited to do basic calculations. Apart from addition, subtraction, multiplication and division, there is also support for more advanced operations such as:
Exponentiation: **. This operator raises the number to its left to the power of the number to its right: for example 4**2 will give 16.
Modulo: %. It returns the remainder of the division of the number to the left by the number on its right, for example 18 % 7 equals 4.
Which line of code creates a variable x with the value 15?
x = 15
In Python, variables are used all the time. They make your code reproducible.
You use a single equals sign to create a variable and assign a value to it.
What is the value of the variable z after executing these commands?
x = 5
y = 7
z = x + y + 1
In the command z = x + y + 1, x has the value 5 and y has the value 7. 5 + 7 + 1 equals 13.
You execute the following two lines of Python code:
x = “test”
y = False
You can recognize strings from the quotes. Booleans can either be True or False.
What is a list?
A list is a way to give a single name to a collection of values. These values, or elements, can have any type; they can be floats, integer, booleans, strings, but also more advanced Python types, even lists.
Which of the following is a characteristic of a Python list?
It is a way to name a collection of values, instead of having to create separate variables for each element.
Which three Python data types does this list contain?
x = [“you”, 2, “are”, “so”, True]
“you”, “are” and “so” are strings. 2 is an integer and True is a boolean.
Which command is invalid Python syntax to create a list x?
x = [“this”, “is”, “a” True “list”]
Only the first command will result in a so-called SyntaxError, because there are no commas before and after True.
Create list with different types
A list can contain any Python type.
Although it’s not really common, a list can also contain a mix of Python types including strings, floats, booleans, etc.
How does python access elements in a list?
By using an index
What is slicing?
Apart from indexing, there’s also something called slicing, which allows you to select multiple elements from a list, thus creating a new list. You can do this by specifying a range, using a colon.
eg: x[2 : 8]
Which pair of symbols do you need to do list subsetting in Python?
You use square brackets to subset lists in Python.
What Python command should you use to extract the element with index 1 from a Python list x?
x[1] or x[1] or x[1] or x[1]
You use square brackets for subsetting. Inside the square brackets, simply put the the index of the element you want to access.
You want to slice a list x. The general syntax is:
x[begin:end]
List slicing is a very powerful technique to extract several list elements from a list at the same time.
In Python, the begin index is included in the slice, the end index is not.
What is manipulating lists consist of?
- Changing elements
- Adding elements
- Removing elements
You have a list x that is defined as follows:
x = [“a”, “b”, “b”]
You need to change the second “b” (the third element) to “c”.
Which command should you use?
x[2] = “c”
The third element has index 2. You want to change this element with a string, so you need “c” instead of c.
You have a list x that is defined as follows:
x = [“a”, “b”, “c”]
Which line of Python code do you need to add “d” at the end of the list x?
x = x + [“d”]
You basically have to create a single-element list containing “d” and add that to the list x.
Next you have to assign the result of this addition to x again to actually update x for the future.
You have a list x that is defined as follows:
x = [“a”, “b”, “c”, “d”]
You decide to remove an element from it by using del:
del(x[3])
How does the list x look after this operation?
[“a”, “b”, “c”]
The operation removed the element with index 3, so the fourth element in the list x.
But what is a function?
Simply put, a function is a piece of reusable code, aimed at solving a particular task.
You can call functions instead of having to
write code yourself.
What is a Python function?
A piece of reusable Python code that solves a particular problem.
A function are a block of code, that perform a specific, related action. Functions make your code more modular, so that you can reuse code without having to retype it over and over again.
What Python command opens up the documentation from inside the IPython Shell for the min function?
You can use help(min). Notice that help() is also a function!
What are methods?
You can think of methods as _functions_ that “belong to” Python objects.
A Python object of type string has methods, such as capitalize and replace, but also objects of type float and list have specific methods depending on the
type.
In Python, everything is an object, and each object has specific________________________
method associated.
Different objects may have the same methods but_____________________________
depending on the type of the object, the methods behave differently.
eg. index() exists for both strings and lists
Some methods can change___________
the objects they are called on.
What is append() in Python?
append() is a method, and therefore also a function.
In Python, practically everything is an object. Every python object can have functions associated. These functions are also called methods.
You have a string x defined as follows:
x = “monty python says hi!”
Which Python command should you use to capitalize this string x?
x.capitalize()
Use the dot notation to call a method on an object, x in this case. Make sure to include the parentheses at the end, even if you don’t pass any additional arguments.
How does the list x look after you execute the following two commands?
x = [4, 9, 5, 7]
x.append(6)
[4, 9, 5, 7, 6]
If you call append() on a list, you’re actually adding the element to the list you called append() on; there’s no need for an explicit assignment (with the = sign) in this case.
What are Packages?
You can think of package as a directory of python scripts. Each such script is a so-called module. These modules specify functions, methods and new Python types aimed at solving particular problems
Are all packages available in Python by default.
Yes or No
No
How to use Python packages?
To use Python packages, you’ll first have to install them on your system, and then put code in your script to tell Python that you want to use these packages.
What are the main python packages for:
- Data Science
- Data Visualization
- Machine Learning
- data science: there’s numpy (toefficiently work with arrays)
- matplotlib for data visualization,
- scikit-learn for machine learning
Which of the following is a package for installation and maintenance system for Python?
pip
pip is a very commonly used tool to install and maintain Python packages.
Which statement is the most common way to invoke the import machinery?
The “import”
statement is arguably the easiest way to import packages and modules into Python
You import Numpy as foo as follows:
import numpy as foo
Which Python command that used the array() function from Numpy is valid if Numpy is imported as foo?
foo.array([1, 2, 3])
If Numpy is imported as np, you need np.array().
You want to use Numpy’s array() function.
You need to decide whether to import this function as follows:
from numpy import array
or by importing the entire numpy package:
import numpy
Select the two correct statements about these different import methods.
- The from numpy import array version will make it less clear in the code that you’re using Numpy’sarray() function.
- Using import numpy will require you to use numpy.array(), making it clear that you’re using a Numpy function.
Importing a particular function makes your code shorter, because you don’t need to include the numpy.prefix. However, It becomes less clear that array() is a function from the numpy package.
What is one additional feature of a Numpy array?
Numpy array is pretty similar to a regular Python list, but has one additional feature:
you can perform calculations over all entire arrays. It’s really easy, and super-fast as
well.
Can Numpy arrays contain different data types?
NO.
Numpy array can only contain values of a single type. It’s
either an array of floats, either an array of booleans, and so on.
Numpy is another data type in Python.
TRUE or FALSE
TRUE.
Which Numpy function do you use to create an array?
To create a Numpy array, you use the array( ) function.
You typically pass a regular Python list as an input.
Which two statements describe the advantage of Numpy Package over regular Python Lists?
- The Numpy Package provides the array, a data type that can be used to do element-wise calculations.
- Because Numpy arrays can only hold element of a single type, calculations on Numpy arrays can be carried out way faster than regular Python lists.
Creating a Numpy array is not necessarily easier, but it is a great solution if you want to carry out element-wise calculations, something that regular Python lists aren’t capable of.
What is the resulting Numpy array z after executing the following lines of code?
import numpy as np
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = x + y
array([4, 4, 4])
In Numpy, calculations are performed element-wise. The first element of x and the first element of yare added, giving 4. Similar for the second and third element of x and y.
What happens when you put an integer, a Boolean, and a string in the same Numpy array using the array()function?
All array elements are converted to strings
Numpy arrays can only hold elements with the same basic type. The string is the most ‘general’ and free form to store data, so all other data types are converted to strings.
For Numpy specifically, you can also use boolean Numpy arrays:
TRUE
What does .ndarray stand for?
N dimensional array
What does the method ‘np_2d.shape’ return?
shape
is a so-called attribute of the np2d
array, that can give you more information about what the data structure
looks like.
You can think of the 2D numpy array as an _________________________________
improved list of lists:
What charaterizes multi-dimensional Numpy arrays?
You can create a 2D Numpy array from a regular list of lists.
Multi-dimensional Numpy arrays are natural extensions of the 1D Numpy array:
They can only hold a single type and can be created from a regular Python list structure.
The number N in these N-dimensional Numpy arrays is not limited.
You created the following 2D Numpy array, x:
import numpy as np
x = np.array([[“a”, “b”, “c”, “d”],
[“e”, “f”, “g”, “h”]])
x[1,2]
Apart from element-wise calculations, 2D Numpy arrays also offer more advanced ways of subsetting compared to regular Python lists of lists. To select the second row, use the index 1 before the comma. To select the third column, use the index 2 after the comma.
What does the resulting array z contain after executing the following lines of Python code?
import numpy as np
x = np.array([[1, 2, 3], [1, 2, 3]])
y = np.array([[1, 1, 1], [1, 2, 3]])
z = x - y
array( [[0, 1, 2],
[0, 0, 0]])
Good Resource for Numpy Arrays
http://cs231n.github.io/python-numpy-tutorial/#numpy-arrays
What will provide you with a “sanity check” of the data?
summarizing statistics
Good to Remember
Numpy offers many functions to calculate basic statistics, such as np.mean(), np.median() andnp.std().
Both the mean and median are interesting statistics to check out before you start your analysis. Visual inspection of your data is practically infeasible if you’re dealing with millions of data points.
Select the three statements that hold.
Numpy is a great alternative to the regular Python list if you want to do Data Science in Python.
Numpy arrays can only hold elements of the same basic type.
Next to an efficient data structure, Numpy also offers tools to calculate summary statistics and to simulate statistical distributions.
No matter the dimension of the Numpy array, element-wise calculations will always be possible.
You are writing code to measure your travel time and weather conditions to work each day.
The data is recorded in a Numpy array where each row specifies the measurements for a single day.
The first column specifies the temperature in Fahrenheit. The second column specifies the amount of travel time in minutes.
The following is a sample of the code.
import numpy as np
x = np.array([[28, 18],
[34, 14], [32, 16],
… [26, 23], [23, 17]])
Which Python command do you use to calculate the average travel time?
np.mean(x[:,1])
:,1 inside square brackets tells Python to get all the rows, and the second column. You can then usenp.mean() to get the average of the resulting Numpy array.
How to get an overall hunch of your data set?
It’s always a good idea to check both the median and the mean, to get a first hunch for the overall distribution of the entire dataset.
The better your data visualizations the better you will be able to _______________________
extract insights and share with other people
The father of all visualization packages in python is -
matplotlib-.
Inside the matplotlib
package, there’s____________ the subpackage.
pyplot
What is scatter plot useful for?
A scatter plot is useful to see all the individual datapoints. Unlike in the line plot, these datapoints will not be connected by a line
What is the characteristic about data visualization?
Visualization is a very powerful tool for exploring your data and reporting results.
Data visualization is useful in different stages of the data analysis pipeline. The type of visualization that is most appropriate depends on the problem at hand.
What is the conventional way of importing the pyplot sub-package from the matplotlib package?
import matplotlib.pyplot as plt
The general syntax is import package.subpackage as local_name.
You are creating a line plot using the following code:
a = [1, 2, 3, 4]
b = [3, 9, 2, 6]
plt. plot(a, b)
plt. show()
Which two options describe the result of your code?
The first argument corresponds to the horizontal, x-axis. The second argument is mapped onto the vertical, y-axis.
You are modifying the following code that calls the plot() function to create a line plot:
a = [1, 2, 3, 4]
b = [3, 9, 2, 6]
plt. plot(a, b)
plt. show()
What should you change in the code to create a scatter plot instead of a line plot?
Change plot() in plt.plot() to scatter()
To create a scatter plot, you’ll need plt.scatter().
Good to remember about matplotlib
When you have a time scale along the horizontal axis, the line plot is your friend. But in many other cases, when you’re trying to assess if there’s a correlation between two variables, for example, the scatter plot is the better choice.
What are the benefits of using a histogram?
The histogram is a type of visualization that’s
particularly useful to explore your data set. It can help you to get an idea about the distribution
What is a characteristic of a histogram?
Histogram is a great tool for getting a first impression about the distribution of your data.
Histogram is useful to display any distribution, and typically consist of non-overlapping bins. The matplotlib package contains functionality to build histograms very easily.
You are working with a Python list with 10 different values. You divide the values into 5 equally-sized bins.
How wide will these bins be if the lowest value in your list is 0 and the highest is 20?
The range of your values is 20. Dividing these values into 5 equally sized bins will result in bins with width 4.
You write the following code:
import matplotlib.pyplot as plt
x = [1, 3, 6, 3, 2, 7, 3, 9, 7, 5, 2, 4]
plt.hist(x) plt.show()
You need to extend the plt.hist() command to specifically set the number of bins to 4. What should you do?
Add a second argument to plt.hist():
plt.hist(x, bins = 4)
If you do not specify the number of bins the data has to be divided into, matplotlib chooses a suitable number of bins for you.
Setting the number of bins is as simple as specifying the bins argument appropriately.
Why is choosing the right number of bins important in a histogram?
The number of bins is pretty important. Too little bins oversimplifies reality, which doesn’t show you the details. Too much bins overcomplicates reality and doesn’t give the bigger picture.
You are customizing a plot by labelling its axes. You need to do this by using matplotlib.
Which code should you use?
xlabel(“x-axis title”) and ylabel(“y-axis title”)
To set the axis title, use the functions xlabel() and ylabel()
Which matplotlib function do you use to build a line plot where the area under the graph is colored?
fill_between()
Typically, you place all customization commands between the plot() call and the show() call, as follows:
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [4, 5, 6]
plt.plot(x, y)
customization here plt.show()
What will happen if you place the customization code after the show() function instead?
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [4, 5, 6] plt.plot(x, y)
plt.show()
customization here
Python doesn’t throw an error, but you won’t see your customizations. The show() function displays the plot you’ve built up until then. If the customizations come afterwards, there is no effect on the shown output.
The show() function displays the plot you’ve built up until then. If the customizations done afterwards, there is no effect on the shown output.
Therefore, you should place all customization commands between the plot() call and the show() call.
You write the following code:
x = 7
if x > 6 :
print(“high”)
elif x > 3 :
print(“ok”)
else :
print(“low”)
What will be printed out if you execute the code?
high
If your control structures get more advanced, Python can take many different paths through your code.
As soon as Python encounters a condition that is True (x > 6 in this case), the corresponding code is executed and the control structure is abandoned. The elif and else parts are not considered anymore!
To check if two Python values, or variables, are equal, you can use_____________
==
To check for inequality, you need ____________________
!=
In pandas where do you store data?
Dataframe
Good to remember about PANDAS
You typically don’t build a pandas data frame manually. Instead, you import data from an
external file that contains all this data
How do you access a column in panda?
To access a column, you typically use square brackets with the column label.
How do you access a row in panda?
You’ll want to use loc
.
eg. bric.loc[“BR’]
How is a Pandas DataFrame different from a 2D Numpy array?
In Pandas, different columns can contain different types.
Both Pandas and Numpy offer many different ways of subsetting. 2D Numpy arrays can only contain values of the same basic type, a downside compared to Pandas if you’re working on typical Data Science problems.
What are two characteristics that describe Pandas DataFrame?
The rows correspond to observations.
The columns correspond to variables.
Which Pandas function do you use to import data from a comma-separated value (CSV) file into a Pandas DataFrame?
read_csv() is the function you need. You can specify a ton of other arguments to customize the way the data is imported.
Which technique should you use to select an entire row by its row label when accessing data in a Pandas DataFrame?
loc .
Square brackets are used to get specific columns from a Pandas DataFrame. iloc is used if you want to select a row based on its position in the DataFrame, and not based on its row label.
cars[‘cars_per_cap’]
cars[[‘cars_per_cap’]]
What is the difference between these 2 methods of accessing a column in panda?
The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.