# Chapter 1: Pandas Foundations Flashcards
What is a percentile?
A percentile is a measure that divides a set of observations into 100 equal parts, where each part represents one percent of the total number of observations. For example, the 75th percentile is the value below which 75% of the observations fall.
What is quantile?
A quantile, on the other hand, is a measure that divides a set of observations into equal parts, where each part represents a specified proportion of the total number of observations. For example, the 75th quantile is the value below which 75% of the observations fall.
When to use percentiles vs quantiles?
In general, if you are interested in understanding the overall distribution of a dataset, you may want to use percentiles.
If you are interested in dividing the dataset into equal-sized groups or identifying specific values within the dataset, you may want to use quantiles
How are is the index used when two dataframe columns are combined?
The is used for alignment, before any calculations occurs.
What do we call axis-0 and axis-1?
In a Dataframe, the axis-0 is the index , and axis-1 is composed by the columns.
What is the index comprised of?
all the index labels
What does the term columns refers to?
it refers to all the column names as a whole.
How are column names represented in Pandas?
column names are represented as a Series
NaN
This is how Pandas represents missing values (Not a Number).
head(n)
- This function returns the first n rows for the object based on position.
- n: int, default 5
- For negative values of n, this function returns all rows except the last |n| rows, equivalent to df[:n].
tail(n)
- This function returns last n rows from the object based on position.
- n : int, default 5
- For negative values of n, this function returns all rows except the first |n| rows, equivalent to df[|n|:].
columns = movies.columns
it creates an object of the type pandas.core.indexes.base.Index. It is a list with all the columns names.
index = movies.index
it creates an object with the index information.
- RangeIndex(start=0, stop=4916, step=1)
type() vs .dtype
- type() is a built-in Python function that returns the type of an object
- the type() of a Pandas tells you that the object is a Pandas DataFrame or Series, but it doesn’t provide information about the data type of the individual elements in the object.
- .dtype is an attribute of a Pandas Series or DataFrame that returns the data type of the elements in the object, such as float64, int64, object, or datetime64.
Continuous data
Continuous data is numerical data that can take on any value within a range or interval, and it can be measured or calculated to any level of precision. Examples of continuous data include height, weight, temperature, and time.
Categorical data
Categorical data, on the other hand, is non-numerical data that is divided into categories or groups based on a qualitative characteristic. Examples of categorical data include gender, color, race, and type of car.
Interval data (Continuous)
Interval data has equal intervals between values, but it does not have a true zero point.
ratio data (Continuous)
- Ratio data has a true zero point, such as weight or distance.
- In ratio data, it is possible to perform mathematical operations such as addition, subtraction, multiplication, and division.
Examples of Interval data
Temperature measured in Celsius or Fahrenheit: The difference between 20°C and 30°C is the same as the difference between 30°C and 40°C, but 0°C does not represent a true absence of temperature.
Dates measured in years or months: The difference between January 2022 and January 2023 is the same as the difference between January 2023 and January 2024, but 0 years or 0 months does not represent a true absence of time.
IQ scores: The difference between an IQ score of 110 and 120 is the same as the difference between 120 and 130, but an IQ score of 0 does not represent a true absence of intelligence.
pH levels: The difference between a pH of 6 and 7 is the same as the difference between 7 and 8, but a pH of 0 does not represent a true absence of acidity.
Time measured in hours, minutes, or seconds: The difference between 3 hours and 4 hours is the same as the difference between 4 hours and 5 hours, but 0 hours does not represent a true absence of time.
Examples of ratio data
Height: A height of 0 indicates a complete absence of height, so height is an example of ratio data.
Weight: A weight of 0 indicates a complete absence of weight, so weight is an example of ratio data.
Distance: A distance of 0 indicates a complete absence of distance, so distance is an example of ratio data.
Time taken to complete a task: If a task takes 0 seconds to complete, that indicates a complete absence of time, so time taken to complete a task is an example of ratio data.
Money: If someone has 0 dollars, that indicates a complete absence of money, so money is an example of ratio data.
Nominal data (categorical)
Nominal data has categories without any order or hierarchy, such as colors.
Ordinal data (categorical)
Ordinal data has categories with a specific order or hierarchy, such as levels of education.
Analyzing and Visualizing Continuous vs Categorical data
Continuous data is often analyzed using statistical methods such as regression analysis and t-tests, while categorical data is often analyzed using methods such as chi-squared tests and contingency tables. Visualizations for continuous data often include scatter plots and histograms, while visualizations for categorical data often include bar charts and pie charts.
float
The NumPy float type, which supports missing values
int
The NumPy integer type, which does not support missing values
‘Int64’
pandas nullable integer type
object
The NumPy type for storing strings (and mixed types)
‘category’
pandas categorical type, which does support missing values
bool
The NumPy Boolean type, which does not support missing values (None
becomes False, np.nan becomes True)
‘boolean’
pandas nullable Boolean type
datetime64[ns]
The NumPy date type, which does support missing values (NaT)
DataFrame.dtypes
This returns a Series with the data type of each column.
Series.dtype or Series.dtypes
Return the dtype object of the underlying data.
DataFrame.dtypes.value_counts()
returns the counts of the data type of every column.