Data Input and Validation Flashcards

read_csv() shape head() tail() info()

1
Q

What function one can use to read data.

A
read_excel()
read_csv
read_json()
read_sql_table()
Example:
import pandas as pd
oo = pd.read_csv('../data/olympics.csv', skiprows=4)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Example of parameters to pass to pd.read_csv()

A

pd.read_csv( filepath,..,skiprows = None,… )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Shape attribute returns a tuple of …

A

that’s rows and columns representing the dimensionality of the DataFrame
df.shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

By default :

df. head()
df. tail()

return how many entries of the df

A

5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does info() method provides ?

A

Info provides a summary of the data frame including the number of entries, the data type, and the number of non-null entries for each series in the data frame. This is important because often when working with a real data set, there will be missing data. You want a view of this to determine how you will handle this missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does value_counts() method returns ?

A

It returns a series object, counting all the unique values. There are two things in particular to be aware of value_counts. As this is returning a count of the unique values, the first value is the most frequently occurring element. The second, the second most frequently occurring element and so on. This order can be reversed by just setting the ascending flag to True.

Note that value_counts() is used only with Series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some parameters one can pass to Series.value_counts() ?

A

Series.value_counts(normalize = False, sort = True, ascending = False, bins = None, dropna = True)

Info provides a summary of the data frame including the number of entries, the data type, and the number of non-null entries for each series in the data frame. This is important because often when working with a real data set, there will be missing data. You want a view of this to determine how you will handle this missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Series.sort_values( axis = 0, ascending = True, inplace = False, kind = ‘quicksort’, na_position = ‘last’)

What does axis parameter mean ?

DataFrame.sort_values(by, axis=0, ascending = True, inplace = False, kind = ‘quicksort’, na_position = ‘last’)

A

As axis is equal to zero, you are sorting along the column and in ascending order by default. So if you visualize a series as being a single column, you are sorting the contents of that column in ascending order. By default, the NaNs, or missing data, are put right at the end. Sort_values(), when used in conjunction with a DataFrame, is particularly useful as you can sort multiple series in ascending and descending order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How would your sort DataFrame ?

A

DataFrame.sort_values(by = [‘Series1,’Series2’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Boolean Indexing ?

Give syntax example.

A

Boolean indexing. Boolean vectors or conditions can be used to filter data. Based on a condition, pass series of true and false values to a series or data frame to select and display the rules where the series has true values. Instead of using and, or, or not, as with most programming languages, you can use the following symbols instead & | ~. Remember that if you have more than one condition, or Boolean vector, this must be grouped in brackets or parentheses.

df [ ( df[‘col_Name’] > 0 ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Give couple examples of String Handling !

A

Series.str.contains()
Series.str.startswith()
Series.str.isnumeric()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Syntax to create an indexed DataFrame from scratch w

A

df = pd.DataFrame({‘A’: [1, 2, 3], ‘B’: [4, 5, 6]}, index=[‘a’, ‘b’, ‘c’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to create a new data frame out of existing ?
Example : Data Frame called df
contains 5 columns City, Edition, NOC, Athlete, Gender.
create df with only Edition, Athlete

A

df [ [ ‘City’,’Edition’ ] ]

Note that [[ ]] crates new Data Frame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does
import matplotlib.pyplot as plt
%matplotlib inline
does ?

A

The line “import matplotlib.pyplot” allows you to use the matplotlib.pyplot module using the abbreviation plt.

The IPython kernel works seamlessly with the Matplotlib Plotting Library to provide this functionality. To set this up, you must execute that second line, matplotlib inline and that’s what’s known as a merger command. With the Matplotlib inline backend, the output of the plotting commands is displayed inline, within the Jupiter Notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the default kind of the graph ?

A

By default, the graph is a line plot, but you can also specify that you want to use another type of graph such as a barh graph or a pie chart.

plot ( kind = ‘line’ )
plot ( kind = ‘bar’ )
plot ( kind = ‘barh’)
plot ( kind = ‘pie’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to specify the figure size of the plot ?

A

The figure size is a toggle where you can specify the width and the height in inches.
plot(figsize = (width,height))
Example:

plot(kind = ‘line’, color = ‘yellow’, figsize=(5,5))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the classes of colormaps

A

Sequential, Diverging, Qualitative.
The sequential should be used for representing information that has ordering. There is a change in lightness, often over a single hue.

Diverging is to be used when the information being plotted deviates around a middle value. Here there are often two different colors being used. And finally, the qualitative class is used to represent information which does not have any ordering or relationship, and is often miscellaneous colors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Name on at least two colormaps.

A

Set1 magma YlGn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Seaborn ?

A

Seaborn is a visualization library based on Matplotlib. One of the reasons to use Seaborn is that it produces beautiful statistical plots. It is very important to realize that Seaborn is a complement and not a substitute to Matplotlib. Now one of the advantages again with using Seaborn is that it works very well with pandas

import seaborn as sns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Example for using Seaborn

A

sns. countplot(x =’Gender’ , data = oo, hue = ‘Sport’)

sns. countplot(x=’Medal’,data = dfB, hue = ‘Gender’,palette = ‘bwr’, order =[‘Gold’,’Silver’,’Bronze’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is index in pandas

A

The index object is an immutable array, and indexing allows you to access a row or a column using a label. This is what makes Pandas special, because typically in other programming languages, you cannot access an array using labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does set_index() do

A

The set index allows us to determine which of the series is going to be the index

23
Q

Example to use set_index()

A

DataFrame.copy().set_index(‘SeriesName’)

24
Q

Use reset_index to

A

returns a DataFrame to its default, integer-based index:

DataFrame.reset_index(inplace = True)

25
Q

What does sort_index() do ?

A

Sort index allows for all the items to be sorted by that index. The advantage of this is that when you have a particularly large data set, sorting the index reduces the time to access any subset of that data. You can sort objects by a label along the axis

DataFrame.sort_index(axis = 0, level = None, ascending = True, inplace = False, .. by = None)

26
Q

How does loc[] work ?

A

“loc[]” is a label-based indexer, that means you are selecting by the labels. And notice that “loc[]” uses square brackets and not regular brackets. “loc[]” will raise the KeyError when items are not found

DataFrame.loc[]
DataFrame[’ Series ‘].loc[]

27
Q

How does iloc[] work ?

A

Here with Iloc, we’re doing selection by integer index. Iloc is primarily integer position based. One of the advantages of Iloc is that it allows for the traditional Pythonic slicing

28
Q

How to give multiple entries to iloc ?

A

is to use a list. iloc, For example, I might want the integer index 1542, 2390, 6000, and 15000. This will return the rows corresponding to these index.

df.iloc[[ 1 , 4, 5, 10]]

29
Q

Give example of using iloc with pythonic slicing .

A

df.iloc[1;4]

30
Q

How does Groupby works ?

A

Groupby does three things. It splits a DataFrame into groups based on some criteria, it applies a function to each group independently and it combines the results into a DataFrame

The Groupby object isn’t a DataFrame but rather a group of DataFrames in a dict-like structure

So, Groupby splits the DataFrame into groups. Each of these groups remember is a DataFrame, it applies a function for each group and then finally it combines the results into a DataFrame

31
Q

What will type() return on df.groupby(‘ ColName’)

A

pandas.core.groupby.generic.DataFrameGroupBy

32
Q

Example of how to apply groupby

A

for group_key, group_value in oo.groupby(‘Edition’):
print(group_key)
print(group_value)

33
Q

Groupby Computations

A
GroupBy.size()
GroubBy.count()
 groupby.first() / groupby.last()
groupby.head() / groupby.tail()
groupby.mean()
groupby.max() / groupby.min()
34
Q

What does agg() do ?

A

Instructions for aggregation are provided in the form of a python dictionary or a list.

And the dictionary keys are where you specify which series or columns in your data frame you want to perform the operations and the actual dictionary values specify the function to run. You can also pass custom functions to the list of aggregated calculations and each will be passed the values from the column in your grouped data. Groupby is a very useful Pandas function and it’s worth your time making sure you understand how to use it.

DataFrame.groupby(agg( {..:[ …] } ))
DataFram.groupby(agg([…[))

35
Q

Example:

A

oo. loc[oo[‘Athlete’] == ‘LEWIS, Carl’].groupby(‘Athlete’).agg({‘Edition’ : [‘min’,’max’,’count’]})
oo. groupby([‘NOC’]).agg({‘Edition’ : [‘min’,’max’,’size’]})

36
Q

What does stack() and unstack() functions do ?

A

stack and unstack functions that are very helpful, especially when used in conjunction with group by. The stack function allows you to move the inner columns to the rows for the dataframe and the unstack function does the reverse.

The stack function helps you to reshape the dataframe.

37
Q

What does stack() function returns ?

A

When using the stack function, the stack function returns a data frame or a series. The inner levels of a stack function are sorted. So when we do a stack we are returning a data frame or series with a new innermost level of rules.

38
Q

Array Slicing: Accessing Subarrays

Multidimensional subarrays

A
x[start:stop:step]
x = np.arange(10)
# First five elements
print(x[:5])
# Elements after index 5
print(x[5:])
# Middle
print(x[4:7])
Multidimensional slices work in the same way, with multiple slices separated by com‐
mas. For example
x2[:, 0] # first column of x2
 x2[:3, ::2] # all rows, every other column
39
Q

One important—and extremely useful—thing to know about array slices is that they
return views rather than copies of the array data

A
print(x2)
[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]
x2_sub = x2[:,0]
array([3, 7, 1])
x2_sub[:] = 0
print(x2)
[[0 5 2 4]
 [0 6 8 8]
 [0 6 7 7]]
40
Q

Creating copies of arrays

A
x2_sub_copy = x2[:2, :2].copy()
[[99 5]
[ 7 6]]
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
[[42 5]
[ 7 6]]
41
Q

’’’
Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished
through the routines np.concatenate , np.vstack , and np.hstack . np.concatenate
takes a tuple or list of arrays as its first argument

A
x = np.arange(10)
y = np.arange(10)
np.concatenate([x,y])
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
grid = np.arange(10).reshape(2,5)
np.concatenate([grid,grid]) # concatenate along the first axis
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9],
       [0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
42
Q

concatenate along the first axis

A

grid = ([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
np.concatenate([grid,grid], axis = 1)

array([[0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
[5, 6, 7, 8, 9, 5, 6, 7, 8, 9]])

43
Q

Concatenation of arrays

A

Concatenation, or joining of two arrays in NumPy, is primarily accomplished
through the routines np.concatenate , np.vstack , and np.hstack . np.concatenate
takes a tuple or list of arrays as its first argument, as we can see here
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
array([1, 2, 3, 3, 2, 1])

44
Q

You can also concatenate more than two arrays at once:

A

z = [99, 99, 99]
print(np.concatenate([x, y, z]))
[ 1 2 3 3 2 1 99 99 99]

45
Q

np.concatenate can also be used for two-dimensional arrays:

A

grid = np.array([[1, 2, 3],

[4, 5, 6]])

46
Q
# concatenate along the first axis
np.concatenate([grid, grid])
A
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
47
Q

For working with arrays of mixed dimensions, it can be clearer to use the np.vstack
(vertical stack) and np.hstack (horizontal stack) functions:

A
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
# vertically stack the arrays
np.vstack([x, grid])
Out[48]: array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
 # horizontally stack the arrays
y = np.array([[99],
[99]])
np.hstack([grid, y])
Out[49]: array([[ 9, 8, 7, 99],
[ 6, 5, 4, 99]])
48
Q

Splitting of arrays

A

The opposite of concatenation is splitting, which is implemented by the functions
np.split , np.hsplit , and np.vsplit . For each of these, we can pass a list of indices
giving the split points
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
Notice that N split points lead to N + 1 subarrays. The related functions np.hsplit
and np.vsplit are similar:

49
Q

grid = np.arange(16).reshape((4, 4))

grid

A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
[0 1 2 3]
[4 5 6 7]

[[ 8 9 10 11]
[12 13 14 15]]

50
Q

Arithmetic operators implemented in NumPy

A
\+ np.add Addition (e.g., 1 + 1 = 2 )
- np.subtract Subtraction (e.g., 3 - 2 = 1 )
- np.negative Unary negation (e.g., -2 )
* np.multiply Multiplication (e.g., 2 * 3 = 6 )
/ np.divide Division (e.g., 3 / 2 = 1.5 )
// np.floor_divide Floor division (e.g., 3 // 2 = 1 )
** np.power Exponentiation (e.g., 2 ** 3 = 8 )
% np.mod Modulus/remainder (e.g., 9 % 4 = 1
51
Q

np.absolute(x)

A

x = np.array([-2, -1, 0, 1, 2])
np.absolute(x)
array([2, 1, 0, 1, 2])

52
Q

Trigonometric functions

A
theta = np.linspace(0, np.pi, 3)
\: print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
theta = [ 0. 1.57079633 3.14159265]
sin(theta) = [ 0.00000000e+00 1.00000000e+00 1.22464680e-16]
cos(theta) = [ 1.00000000e+00 6.12323400e-17 -1.00000000e+00]
tan(theta) = [ 0.00000000e+00 1.63312394e+16 -1.22464680e-16]
53
Q

from scipy import special

A
# Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x) =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2) =", special.beta(x, 2))
54
Q

For large calculations, it is sometimes useful to be able to specify the array where the
result of the calculation will be stored. Rather than creating a temporary array, you
can use this to write computation results directly to the memory location where you’d
56 | Chapter 2: Introduction to NumPy
like them to be. For all ufuncs, you can do this using the out argument of the
function:

A

x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)