# Chapter 1: Pandas Foundations Flashcards

1
Q

What is a percentile?

A

A percentile is a measure that divides a set of observations into 100 equal parts, where each part represents one percent of the total number of observations. For example, the 75th percentile is the value below which 75% of the observations fall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is quantile?

A

A quantile, on the other hand, is a measure that divides a set of observations into equal parts, where each part represents a specified proportion of the total number of observations. For example, the 75th quantile is the value below which 75% of the observations fall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When to use percentiles vs quantiles?

A

In general, if you are interested in understanding the overall distribution of a dataset, you may want to use percentiles.
If you are interested in dividing the dataset into equal-sized groups or identifying specific values within the dataset, you may want to use quantiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are is the index used when two dataframe columns are combined?

A

The is used for alignment, before any calculations occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do we call axis-0 and axis-1?

A

In a Dataframe, the axis-0 is the index , and axis-1 is composed by the columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the index comprised of?

A

all the index labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the term columns refers to?

A

it refers to all the column names as a whole.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are column names represented in Pandas?

A

column names are represented as a Series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

NaN

A

This is how Pandas represents missing values (Not a Number).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

head(n)

A
  • This function returns the first n rows for the object based on position.
  • n: int, default 5
  • For negative values of n, this function returns all rows except the last |n| rows, equivalent to df[:n].
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

tail(n)

A
  • This function returns last n rows from the object based on position.
  • n : int, default 5
  • For negative values of n, this function returns all rows except the first |n| rows, equivalent to df[|n|:].
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

columns = movies.columns

A

it creates an object of the type pandas.core.indexes.base.Index. It is a list with all the columns names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

index = movies.index

A

it creates an object with the index information.
- RangeIndex(start=0, stop=4916, step=1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

type() vs .dtype

A
  • type() is a built-in Python function that returns the type of an object
  • the type() of a Pandas tells you that the object is a Pandas DataFrame or Series, but it doesn’t provide information about the data type of the individual elements in the object.
  • .dtype is an attribute of a Pandas Series or DataFrame that returns the data type of the elements in the object, such as float64, int64, object, or datetime64.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Continuous data

A

Continuous data is numerical data that can take on any value within a range or interval, and it can be measured or calculated to any level of precision. Examples of continuous data include height, weight, temperature, and time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Categorical data

A

Categorical data, on the other hand, is non-numerical data that is divided into categories or groups based on a qualitative characteristic. Examples of categorical data include gender, color, race, and type of car.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Interval data (Continuous)

A

Interval data has equal intervals between values, but it does not have a true zero point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

ratio data (Continuous)

A
  • Ratio data has a true zero point, such as weight or distance.
  • In ratio data, it is possible to perform mathematical operations such as addition, subtraction, multiplication, and division.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Examples of Interval data

A

Temperature measured in Celsius or Fahrenheit: The difference between 20°C and 30°C is the same as the difference between 30°C and 40°C, but 0°C does not represent a true absence of temperature.

Dates measured in years or months: The difference between January 2022 and January 2023 is the same as the difference between January 2023 and January 2024, but 0 years or 0 months does not represent a true absence of time.

IQ scores: The difference between an IQ score of 110 and 120 is the same as the difference between 120 and 130, but an IQ score of 0 does not represent a true absence of intelligence.

pH levels: The difference between a pH of 6 and 7 is the same as the difference between 7 and 8, but a pH of 0 does not represent a true absence of acidity.

Time measured in hours, minutes, or seconds: The difference between 3 hours and 4 hours is the same as the difference between 4 hours and 5 hours, but 0 hours does not represent a true absence of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Examples of ratio data

A

Height: A height of 0 indicates a complete absence of height, so height is an example of ratio data.

Weight: A weight of 0 indicates a complete absence of weight, so weight is an example of ratio data.

Distance: A distance of 0 indicates a complete absence of distance, so distance is an example of ratio data.

Time taken to complete a task: If a task takes 0 seconds to complete, that indicates a complete absence of time, so time taken to complete a task is an example of ratio data.

Money: If someone has 0 dollars, that indicates a complete absence of money, so money is an example of ratio data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Nominal data (categorical)

A

Nominal data has categories without any order or hierarchy, such as colors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Ordinal data (categorical)

A

Ordinal data has categories with a specific order or hierarchy, such as levels of education.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Analyzing and Visualizing Continuous vs Categorical data

A

Continuous data is often analyzed using statistical methods such as regression analysis and t-tests, while categorical data is often analyzed using methods such as chi-squared tests and contingency tables. Visualizations for continuous data often include scatter plots and histograms, while visualizations for categorical data often include bar charts and pie charts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

float

A

The NumPy float type, which supports missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

int

A

The NumPy integer type, which does not support missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

‘Int64’

A

pandas nullable integer type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

object

A

The NumPy type for storing strings (and mixed types)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

‘category’

A

pandas categorical type, which does support missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

bool

A

The NumPy Boolean type, which does not support missing values (None
becomes False, np.nan becomes True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

‘boolean’

A

pandas nullable Boolean type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

datetime64[ns]

A

The NumPy date type, which does support missing values (NaT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

DataFrame.dtypes

A

This returns a Series with the data type of each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Series.dtype or Series.dtypes

A

Return the dtype object of the underlying data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

DataFrame.dtypes.value_counts()

A

returns the counts of the data type of every column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

DataFrame.info()

A

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

36
Q

value_counts()

A

In Pandas, the value_counts() method is used to count the frequency of unique values in a Pandas Series, sorted in descending order of frequency.. However, you can also use the value_counts() method with a Pandas DataFrame to count the frequency of unique values in a specific column or across multiple columns.

37
Q

using value_counts() with a DataFrame

A

To use value_counts() with a Pandas DataFrame, you first need to select the column or columns you want to count the frequency of unique values for. You can do this using bracket notation or dot notation, depending on the column name.

38
Q

Selecting a single column in a dataframe…

A

returns a Series(that has the same index as the
DataFrame)

39
Q

Selecting a DataFrame column, Attribute access vs Indexing Operator

A

Dataframe.column_name vs DataFrame[‘column_name’]

40
Q

.loc() vs .iloc()

A
  • .loc is used for label-based indexing, which means that you use column and row labels to select data. For example, if you have a DataFrame with a column labeled ‘Name’ and a row labeled ‘A’, you can select the data at that intersection using .loc[‘A’, ‘Name’].
  • .iloc is used for integer-based indexing, which means that you use integer positions to select data. For example, you can select the first row and second column of a DataFrame using .iloc[0, 1].
41
Q

Series.size

A

Return the number of elements in the underlying data.

42
Q

series_attr_methods = set(dir(pd.Series))
len(series_attr_methods)

A

how many series attributes and methods are there?

43
Q

len(series_attr_methods & dataframe_attr_methods)

A

which attributes and methods do Series and Dateframes have in common

44
Q

director.sample(n=5, random_state=42)

A
  • Get 5 random items from the director Series.
  • The random_state parameter is set to 42, which means that the same 5 elements will be selected from the Series each time the code is run.
  • a seed value for the random number generator used to sample the items (default is None)
45
Q

The data type of the Series usually determines….

A

which of the methods will be the most useful.

46
Q

Counting the number of elements in the Series may be done with…

A
  • size
  • shape
  • len(series)
47
Q

Series.unique()

A

Return unique values of Series object.

48
Q

Series.count

A

Return number of non-NA/null observations in the Series.

49
Q

Basic summary statistics are provided with …

A

.min, .max, .mean, .median, and .std

50
Q

series.describe()

A
  • Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
  • Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types.
51
Q

series.quantile()

A
  • Return value at the given quantile.
  • if you pass in a scaler, you will get scalar output, but if you pass in a list, the output is a pandas Series:
52
Q

Series.isna()

A

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values.
Everything else gets mapped to False values.
Characters such as empty strings ‘’ or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

53
Q

Series.fillna()

A

Fill NA/NaN values using the specified method.

54
Q

Series.dropna()

A

Return a new Series with missing values removed.

55
Q

director.value_counts(normalize=True)

A

Passing normalize=True as an argument to value_counts() returns the relative frequencies or proportions of each unique value instead of their counts. The resulting Series contains the percentage of occurrences of each unique value.

56
Q

Relative Frequency

A

Relative frequencies, also known as proportions or percentages, are a way of expressing the frequency of an event or value in relation to the total number of events or values in a sample.

57
Q

How Relative Frequency are calculated?

A

In statistics, a relative frequency is calculated as the number of times an event or value occurs divided by the total number of events or values in the sample. The resulting proportion represents the fraction or percentage of the sample that the event or value represents.

For example, if we have a sample of 100 people and 20 of them are male, the relative frequency of males in the sample would be 20/100 = 0.2, or 20%. This means that males make up 20% of the sample.

58
Q

How do we use relative frequencies?

A

Relative frequencies are useful for comparing the occurrence of different events or values in a sample or population, and for identifying patterns and trends in data. They can also be used to make predictions about future occurrences based on past data.

59
Q

Series.hasnans

A

Series.hasnans

60
Q

Series.notna()

A

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings ‘’ or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values

61
Q

series_imdb_score + 1

A

Adds 1 to each element of the series.

62
Q

imdb_score * 2.5

A

multiplies 2.5 by each element of the series.

63
Q

imdb_score // 7

A

The // operator performs integer (floor) division, which returns the largest integer that is less than or equal to the result of the division. This means that it truncates any decimal portion of the result.

64
Q

imdb_score % 2

A

The percent sign (%) is the modulus operator,
which returns the remainder after a division

65
Q

imdb_score > 7

A

Each comparison operator turns each value in the Series to True or False based on the
outcome of the condition. The result is a Boolean array.

66
Q

imdb_score.add(1)

A

pandas addition method. The same as imdb_score + 1

67
Q

imdb_score.gt(7)

A

pandas > method. The same as
imdb_score > 7

68
Q

Benefits of using Pandas methods vs operators?

A
  • Using the method rather than the operator can be useful when we chain methods together.
  • Methods, on the other hand, can have parameters that
    allow you to alter their default functionality.
69
Q

money - 15 vs money.sub(15, fill_value=0)

A

the .sub method allows you to specify a fill_value
parameter to use in place of missing values.

70
Q

Arithmetics Operators

A

+,-,*,/,//,%,**

71
Q

Arithmetic Series method name

A

.add, .sub, .mul, .div, .floordiv, .mod, .pow

72
Q

Comparison Operators

A

<,>,<=,>=,==,!=

73
Q

Comparison Series Method Name

A

.lt, .gt, .le, .ge, .eq, .ne

74
Q

the special method .__mul__ is called …

A

whenever the multiplication operator is used.

75
Q

What are specials methods with double underscores called?

A

dunder methods

76
Q

an example of a dunder method?

A

.__mul__

77
Q

Is there a difference between calling the operator * and the dunder method __mul__?

A

No. The operator is just syntactic sugar for the special method.

78
Q

Is there a difference between calling mul() and the dunder method __mul__?

A

Yes, the mul() method has additional parameters.

79
Q

What is method chaining?

A

It is sequential invocation of methods using attribute access.

80
Q

How is method chaining possible?

A

Because In Python, every variable points to an object, and many attributes and methods return new objects.

81
Q

Why a column whose value look like integers has a float64 as its data type?

A

Because int does not support missing values.

82
Q

How can we convert a column with missing values (float64) to Int64?

A
  • We fill the missing values with zeroes.
  • We use astype(Int64) for the conversion
83
Q

Which Pandas’ int data type supports missing values?

A

Int64

84
Q

How can you have chain methods in different lines?

A

You use parenthesis like:
(
fb_likes.fillna(0)
.astype(‘Int64’)
.head()
)

85
Q

How do you use the .pipe method to show an intermediate
value debugging chains?

A

The .pipe method on a Series needs to be passed a function that accepts a Series as
input and can return anything

86
Q

Renaming dataframe columns using a dictionary

A

Using dictionaries
col_map = {
“director_name” : “director”
“num_critic_for_reviews” : “critic_reviews”
}
movies.rename(columns=col_map)