Data Input and Validation Flashcards
read_csv() shape head() tail() info()
What function one can use to read data.
read_excel() read_csv read_json() read_sql_table() Example: import pandas as pd oo = pd.read_csv('../data/olympics.csv', skiprows=4)
Example of parameters to pass to pd.read_csv()
pd.read_csv( filepath,..,skiprows = None,… )
Shape attribute returns a tuple of …
that’s rows and columns representing the dimensionality of the DataFrame
df.shape
By default :
df. head()
df. tail()
return how many entries of the df
5
What does info() method provides ?
Info provides a summary of the data frame including the number of entries, the data type, and the number of non-null entries for each series in the data frame. This is important because often when working with a real data set, there will be missing data. You want a view of this to determine how you will handle this missing data
What does value_counts() method returns ?
It returns a series object, counting all the unique values. There are two things in particular to be aware of value_counts. As this is returning a count of the unique values, the first value is the most frequently occurring element. The second, the second most frequently occurring element and so on. This order can be reversed by just setting the ascending flag to True.
Note that value_counts() is used only with Series
What are some parameters one can pass to Series.value_counts() ?
Series.value_counts(normalize = False, sort = True, ascending = False, bins = None, dropna = True)
Info provides a summary of the data frame including the number of entries, the data type, and the number of non-null entries for each series in the data frame. This is important because often when working with a real data set, there will be missing data. You want a view of this to determine how you will handle this missing data
Series.sort_values( axis = 0, ascending = True, inplace = False, kind = ‘quicksort’, na_position = ‘last’)
What does axis parameter mean ?
DataFrame.sort_values(by, axis=0, ascending = True, inplace = False, kind = ‘quicksort’, na_position = ‘last’)
As axis is equal to zero, you are sorting along the column and in ascending order by default. So if you visualize a series as being a single column, you are sorting the contents of that column in ascending order. By default, the NaNs, or missing data, are put right at the end. Sort_values(), when used in conjunction with a DataFrame, is particularly useful as you can sort multiple series in ascending and descending order.
How would your sort DataFrame ?
DataFrame.sort_values(by = [‘Series1,’Series2’])
What is Boolean Indexing ?
Give syntax example.
Boolean indexing. Boolean vectors or conditions can be used to filter data. Based on a condition, pass series of true and false values to a series or data frame to select and display the rules where the series has true values. Instead of using and, or, or not, as with most programming languages, you can use the following symbols instead & | ~. Remember that if you have more than one condition, or Boolean vector, this must be grouped in brackets or parentheses.
df [ ( df[‘col_Name’] > 0 ]
Give couple examples of String Handling !
Series.str.contains()
Series.str.startswith()
Series.str.isnumeric()
Syntax to create an indexed DataFrame from scratch w
df = pd.DataFrame({‘A’: [1, 2, 3], ‘B’: [4, 5, 6]}, index=[‘a’, ‘b’, ‘c’])
How to create a new data frame out of existing ?
Example : Data Frame called df
contains 5 columns City, Edition, NOC, Athlete, Gender.
create df with only Edition, Athlete
df [ [ ‘City’,’Edition’ ] ]
Note that [[ ]] crates new Data Frame
What does
import matplotlib.pyplot as plt
%matplotlib inline
does ?
The line “import matplotlib.pyplot” allows you to use the matplotlib.pyplot module using the abbreviation plt.
The IPython kernel works seamlessly with the Matplotlib Plotting Library to provide this functionality. To set this up, you must execute that second line, matplotlib inline and that’s what’s known as a merger command. With the Matplotlib inline backend, the output of the plotting commands is displayed inline, within the Jupiter Notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.
What is the default kind of the graph ?
By default, the graph is a line plot, but you can also specify that you want to use another type of graph such as a barh graph or a pie chart.
plot ( kind = ‘line’ )
plot ( kind = ‘bar’ )
plot ( kind = ‘barh’)
plot ( kind = ‘pie’)
How to specify the figure size of the plot ?
The figure size is a toggle where you can specify the width and the height in inches.
plot(figsize = (width,height))
Example:
plot(kind = ‘line’, color = ‘yellow’, figsize=(5,5))
What are the classes of colormaps
Sequential, Diverging, Qualitative.
The sequential should be used for representing information that has ordering. There is a change in lightness, often over a single hue.
Diverging is to be used when the information being plotted deviates around a middle value. Here there are often two different colors being used. And finally, the qualitative class is used to represent information which does not have any ordering or relationship, and is often miscellaneous colors
Name on at least two colormaps.
Set1 magma YlGn
What is Seaborn ?
Seaborn is a visualization library based on Matplotlib. One of the reasons to use Seaborn is that it produces beautiful statistical plots. It is very important to realize that Seaborn is a complement and not a substitute to Matplotlib. Now one of the advantages again with using Seaborn is that it works very well with pandas
import seaborn as sns
Example for using Seaborn
sns. countplot(x =’Gender’ , data = oo, hue = ‘Sport’)
sns. countplot(x=’Medal’,data = dfB, hue = ‘Gender’,palette = ‘bwr’, order =[‘Gold’,’Silver’,’Bronze’])
What is index in pandas
The index object is an immutable array, and indexing allows you to access a row or a column using a label. This is what makes Pandas special, because typically in other programming languages, you cannot access an array using labels