Priority 5 Flashcards
Which is faster, Python lists or Numpy arrays?
NumPy arrays
Why are NumPy arrays faster than Python lists?
NumPy arrays are implemented in C versus Python lists are implemented in Python. Because C is a compiled language, it is faster than Python, which is an interpreted language.
What are the differences between Python lists and tuples?
3 bullet points
- Lists are mutable whereas tuples are not.
- Lists are defined using square brackets
[]
whereas tuples are defined using parentheses()
. - Tuples are generally faster than lists given immutability, allowing for code optimization.
What are the similarities between Python lists and tuples?
3 bullet points
- Both collection of objects.
- Both comma-separated values.
- Both ordered.
What is a Python set?
Unordered collection of unique objects
What is the typical use case of Python sets?
Often used to store a collection of distinct objects and perform membership tests (i.e., to check if an object is in the set).
How are Python sets defined?
Curly braces, {}
, and a comma-separated list of values.
What are the key properties of Python sets?
5 bullet points
- Unordered
- Unique
- Mutable
- Not indexed/do not support slicing
- Not hashable (cannot be used as keys in dictionaries or as elements in other sets)
What is the difference between Python split and join?
1 bullet point for each
- Split function is used to create a list from a string based on some delimiter (e.g., space).
- Join function concatenates a list of strings into a single string.
Syntax: Python split
Include definition of any class objects and/or parameters
string.split(separator, maxsplit)
- string: The string you want to split.
- separator: (optional): The delimiter used to split the string. If not specified, it defaults to whitespace.
- maxsplit: (optional): The maximum number of splits to perform. If not specified, it splits the string at all occurrences of the separator.
Syntax: Python join
Include definition of any class objects and/or parameters
separator.join(iterable)
- separator: The string that will be used to separate the elements of the iterable in the resulting string.
- iterable: An iterable object (e.g., a list, tuple, or string) whose elements will be joined together.
What are the logical operators in Python? What are they used for?
-
and
,or
,not
- Used to perform boolean operations on
bool
values.
Logical operators in Python: and
Returns True
if both operands are True
; otherwise, False
.
Logical operators in Python: or
Returns True
if either of the operands are True
; returns False
if both operands are False
.
Logical operators in Python: not
Returns True
if the operand is False
; returns False
if the operand is True
.
What are the top 6 functions used for Python strings?
len()
strip()
split()
replace()
upper()
lower()
Top 6 functions used for Python strings: len()
Returns the length of a string.
Top 6 functions used for Python strings: strip()
Removes leading and trailing whitespace from a string.
Top 6 functions used for Python strings: split()
Splits a string into a list of substrings based on a delimiter.
Top 6 functions used for Python strings: replace()
Replaces all occurrences of a specified string with another string.
Top 6 functions used for Python strings: upper()
Converts a string to uppercase.
Top 6 functions used for Python strings: lower()
Converts a string to lowercase.
What is the pass
keyword in Python? What is it used for?
pass
is a null
statement that does nothing. It is often used as a placeholder where a statement is required syntactically, but no action needs to be taken.
What are some common use cases of the pass
keyword in Python?
3 bullet points
- Empty functions or classes: When you define a function/class but haven’t implemented any logic yet. Use
pass
to avoid syntax errors. - Conditional statements: If you need an
if
statement but don’t want to take any action in theif
block, you can usepass
. - Loops: You can use
pass
in loops when you don’t want to perform any action in a specific iteration.
What is the use of the continue
keyword in Python?
continue
is used in a loop to skip over the current iteration and move on to the next one.
Definition: immutable data type in Python
Object whose state cannot be modified after it is created.
Definition: mutable data type in Python
Object whose state can be modified after it is created.
Examples of immutable data types in Python
- Numbers:
int
,float
,complex
bool
str
- Tuples
Examples of mutable data types in Python
- Lists
- Dictionaries
- Sets
Because numbers are immutable data types in Python, what happens when you change the value of a number variable?
Old value gets garbage-collected, freeing up the memory assigned to stroing the object
Python variables versus objects
- Variables are names that refer to or hold references to concrete objects.
- Objects are concrete pieces of information that live in specific memory positions on computer.
Can you use sort()
on tuples? Why or why not?
No. Tuples are immutable. You would have to create a new sorted tuple from the original tuple.
What are try except
blocks used for in Python?
Exception handling
try except
blocks in Python: what is the try
block?
Contains code that might cause an exception to be raised.
try except
blocks in Python: what is the except
block?
Contains code that is executed if an exception is raised during the execution of a try
block.
What are the similarites between Python functions and methods?
3 bullet points
- Both blocks of code that perform a specific task.
- Both can take input parameters and return a value.
- Both defined using the
def
keyword.
What are the key differences between Python functions and methods?
4 bullet points
- Functions are defined outside of classes; methods are functions that are associated with a specific object or class.
- Functions can be called on a standalone basis; methods are called using the dot notation on an object of a class.
- Functions perform general tasks; methods perform actions specific to the object they belong to.
- Parameters are optional for functions; for methods, the first parameter is usually
self
, which refers to the instance of the class.
How do functions help in code optimization?
4 high-level points
- Code reuse
- Improved readability
- Easier testing
- Improved performance
Functions + code optimization: Code reuse
Allow you to reuse code by encapsulating it in a single place and calling it multiple times from different parts of your program. Reduces redundancy, making code more concise and easier to maintain.
Functions + code optimization: Improved readability
Functions make your code more readable and easier to understand by dividing your code into logical blocks. This makes it easier to identify bugs and make changes.
Functions + code optimization: Easier testing
Functions allow you to test individual blocks of code separately, which can make it easier to find and fix bugs.
Functions + code optimization: Improved performance
Functions allow you to use optimized code libraries and/or allow the Python interpreter to optimize the code more effectively.
Why is NumPy often used for data science?
3 bullet points
- Fast and efficient operations on arrays and matrices of numerical data versus Python’s built-in data structures. This is because it uses optimized C and Fortran code behind the scenes.
- Large number of functions for performing mathematical and statistical operations on arrays and matrices.
- Integrates well with other scientific computing libraries in Python, such as SciPy and pandas.
Definition: list comprehension in Python
Shorter syntax when creating a new list based on the values of an existing list.
Syntax: Python list comprehension
new_list = [expression for item in iterable if condition]
Definition: dict comprehension in Python
Concise way of creating dictionaries in Python
Syntax: Python dict comprehension
{key: value for item in iterable}
Definition: global variable in Python
A variable that is defined outside of any function or class
Definition: local variable in Python
A variable that is defined inside a function or class
Where can a Python global variable be accessed?
Can be accessed from anywhere in the program
Where can a Python local variable be accessed?
Can only be accessed within the function or class in which it is defined
What happens inside a Python function if you have a local variable and global variable with the same name?
The local variable will take precedence over the global variable within the function or class in which it is defined
What will this code output?
# Adding a long comment so that it left-aligns the text x = 10 def func(): x = 5 print(x) func() print(x)
5
10
Definition: Python ordered dictionary
Subclass of Python dictionary class that maintains the order of elements in which they were added
Python ordered dictionary class name
OrderedDict
How do Python ordered dictionaries maintain the order of elements in the dictionary?
A doubly linked list
What do return
and yield
in Python have in common?
Both are keywords used to send values back from a function
What is the functionality of the return
keyword in Python?
Terminates the function and returns a value to the caller
What is the functionality of the yield
keyword in Python?
Pauses the function’s execution and returns a value to the caller but maintains the function’s state so that it can be resumed later
What is the use case of the return
keyword in Python?
Used in regular functions when you want to compute a single result and return it
What is the use case of the yield
keyword in Python?
Used to create generator functions that produce a sequence of values over time
Definition: Python lambda function
Small anonymous function that can take any number of arguments but can only have one expression
Syntax: Python lambda function
lambda arguments : expression
What will this code output?
# Adding a long comment so that it left-aligns the text x = lambda a : a + 10 x(5)
15
How are Python lambda functions typically used in practice?
Often used in combination with higher-order functions, such as map()
, filter()
, and reduce()
What does the assert
keyword in Python do?
Used to test a condition. If the condition is True
, the program continues to execute. If the condition is False
, then the program raises an AssertionError
exception.
What is the assert
keyword in Python used for?
Used for debugging purposes and is not intended to be used as a way to handle runtime errors
For exception handling within production Python code, should you use try-except
or assert
? Why?
try-except
* Allows recovery and custom actions versus termination with AssertionError
* Fully customizable exception messages versus limited to raising AssertionError
What are decorators in Python?
Used to modify or extend the functionality of a function, method, or class without changing its source code
Syntax: Python decorators
Adding a long comment so that it left-aligns the text
# Adding a long comment so that it left-aligns the text @decorator_function def function_to_be_decorated(): # Function code here
What does this code output?
def my_decorator(func): def wrapper(): print("Something is happening before the function is called.") func() print("Something is happening after the function is called.") return wrapper @my_decorator def say_hello(): print("Hello!") say_hello()
Something is happening before the function is called.
Hello!
Something is happening after the function is called.
What is univariate analysis?
Used to analyze and describe the characteristics of a single variable
Common steps when conducting univariate analysis on a numerical variable
4 bullet points
- Calculate descriptive statistics, such as mean, median, mode, and standard deviation, to summarize the distribution of the data.
- Visualize the distribution of the data using plots such as histograms, boxplots, or density plots.
- Check for outliers and anomalies in the data.
- Check for normality in the data using statistical tests or visualizations such as a Q-Q plot.
Common steps when conducting univariate analysis on a categorical variable
4 bullet points
- Calculate the frequency of each category in the data.
- Calculate the percentage of each category in the data.
- Visalize the distribution of the data using plots such as bar and pie charts.
- Check for imbalances or abnormalities in the distribution of the data.
Common ways to find outliers in a data set
3 bullet points
- Visual Inspection: Identification via visual inspection of data using plots such as histograms, scatterplots, or boxplots.
- Summary Statistics: Identification via calculating summary statistics, such as mean, median, or interquartile range. For example, if the mean is significantly different from the median, it could indicate the presence of outliers.
- Z-Score: z-score measures how many standard deviations a given data point is from the mean. Data points with a z-score > threshold (e.g., 3 or 4) may be considered outliers.
What are common methods to handle the missing values in a data set?
5 main points
- Drop rows
- Drop columns
- Imputation with mean or median
- Imputation with mode
- Imputation with a predictive model
Drop rows
Common methods to handle the missing values in a data set
Explanation + Pro/Con
Drop rows with null values
* Pro: Simple and fast
* Con: Can signicantly reduce sample size and impact the statistical power of the analysis
Drop columns
Common methods to handle the missing values in a data set
Explanation + Pro/Con
Drop columns with null values
* Pro: Can be a good option if many values are missing from column or column is irrelevant
* Con: Can result in omitted variable bias
Imputation with mean or median
Common methods to handle the missing values in a data set
Explanation + Pro/Con
Replace null values with the mean or median of the non-null values in the column
* Pro: Good option if the data are missing at random and mean/median is a reasonable representation of the data
* Con: Introduces bias if the data are not missing at random
Imputation with mode
Common methods to handle the missing values in a data set
Explanation + Pro/Con
Replace null values with the mode of the non-null values in the column
* Pro: Good option for categorical data where mode is a reasonable representation of the data
* Con: Introduces bias if the data are not missing at random
Imputation with a predictive model
Common methods to handle the missing values in a data set
Explanation + Pro/Con
Use a predictive model to estimate the missing values based on other available data
* Pro: Can be more accurate/less biased if the data are not missing at random and there is a strong relationship between the missing values and other data
* Con: More complex/time-consuming
Definition: skewness
Measure of asymmetry or distortion of symmetric distribution. A distribution is skewed if it is not symmetrical, with more data points concentrated on one side of the mean than the other.
What are the different types of skewness?
- Positive skewness
- Negative skewness
Positive skewness
Different types of skewness
3 bullet points
- Long tail on the right side
- Majority of data points concentrated on the left side of the mean
- A few extreme values on the right side of the distribution that are pulling the mean to the right
Negative skewness
Different types of skewness
3 bullet points
- Long tail on the left side
- Majority of data points concentrated on the right side of the mean
- A few extreme values on the left side of the distribution that are pulling the mean to the left
What are the three main measures of central tendency?
- Mean
- Median
- Mode
Mean
Three main measures of central tendency
3 bullet points
- Arithmetic average of a dataset
- Calculated by adding all the values in the dataset and dividing by the number of values
- Sensitive to outliers
Median
Three main measures of central tendency
3 bullet points
- Middle value of the dataset when the values are arranged in order from smallest to largest
- Arrange the values in order and find the middle value. If there is an odd number of values, the median is the middle value. If there is an even number of values, the median is the mean of the two middle values.
- Not sensitive to outliers
Mode
Three main measures of central tendency
3 bullet points
- Value that occurs most frequently in a dataset
- May have multiple modes or no modes at all
- Not sensitive to outliers
Definition: descriptive statistics
Used to summarize and describe a dataset by using measures of central tendency (mean, median, mode) and measures of spread (standard deviation, variance, range)
Definition: inferential statistics
Used to make inferences about a population based on sample data using statitical models, hypothesis testing, and estimation
What are the four key elements of an EDA report
- Univariate analysis
- Bivariate analysis
- Missing data analysis
- Data visualization
Univariate analysis
Four key elements of an EDA report
How does it contribute to understanding a dataset?
Helps understand the distribution of individual variables
Bivariate analysis
Four key elements of an EDA report
How does it contribute to understanding a dataset?
Helps understand the relationship between variables
Missing data analysis
Four key elements of an EDA report
How does it contribute to understanding a dataset?
Helps understand the quality of the data
Data visualization
Four key elements of an EDA report
How does it contribute to understanding a dataset?
Provides a visual interpretation of the data
Definition: central limit theorem
2 bullet points
- As sample size increases, the distribution of the sample mean will approach a normal distribution
- True regardless of the underlying distribution from which the sample is drawn
What is the benefit of the central limit theorem?
Even if the individual data points in a sample are not normally distributed, we can use normal distribution-based methods to make inferences about the population by taking the average of a large enough number of data points
Two main types of target variables for predictive modeling
- Numeric variable
- Categorical variable
Numeric variable
Main types of target variables for predictive modeling
2 bullet points
- Quantifiable characteristic whose values are numbers
- May be continuous or discrete
Categorical variable
Main types of target variables for predictive modeling
- Values can take on one of a limited, usually fixed, number of possible values
Definition: binary categorical variable
Categorical variable that can take on exactly two values
Definition: polytomous categorical variable
Categorical variable with more than two possible values
When will the mean, median, and mode be the same for a given dataset?
Symmetric unimodal distribution: symmetrically distributed with a single peak
Definition: model variance
Error from sensitivity to small fluctuations in training data
Definition: model bias
Error from overly simplistic assumptions (e.g., data is linear when it’s not, omitted variable bias)
What will be the result of a model with low bias and high variance?
Overfitting: model will be to sensitive to noise and random fluctuations in the data, failing to generalize well to new data
What will be the result of a model with high bias and low variance?
Underfitting: model will miss important relationships in the data
What are the types of errors in hypothesis testing?
- Type I error
- Type II error
Type I error
Types of errors in hypothesis testing
4 bullet points
- False positive
- Null hypothesis is true but is rejected
- Denoted by the Greek letter α
- Usually set at a level of 0.05, meaning there is a 5% chance of making a Type I error
Type II error
Types of errors in hypothesis testing
4 bullet points
- False negative
- Null hypothesis is false but is not rejected
- Denoted by the Greek letter β
- Often represented as 1 - β, or the power of the test. The power of the test is the probability of correctly rejecting the null hypothesis when it is false.
Definition: confidence interval
Range of values expected to contain the true population parameter with a specific level of confidence
What is the most common confidence interval?
95%
What is the primary difference between correlation and covariance?
- Correlation is the normalized version of covariance, meaning correlation adjusts for the scales of the variables
Definition: correlation
Strength and direction of a linear relationship between two variables
Equation: correlation
What is the range and meaning of different values of correlation?
-1 and 1
* +1: Perfect positive linear relationship
* -1: Perfect negative linear relationship
* 0: No linear relationship
What are the units of correlation?
Unitless
Definition: covariance
Measures the degree to which two random variables change together. Indicates the direction of the linear relationship between variables
Equation: covariance
What is the range and meaning of different values of covariance?
Any value, positive, negative, or zero
* Positive: When X increases, Y tends to increase
* Negative: When X increases, Y tends to decrease
* 0: No linear relationship
What are the units of covariance?
Product of the units of the two variables
Definition: hypothesis test
Statistical method to determine whether there is enough evidence in a sample of data to support or reject a stated assumption (hypothesis) about a population
What are some of the key reasons why hypothesis testing is useful for data science?
3 points
- Can make decisions based on statistical evidence, rather than relying on assumptions or opinions.
- Formal, standardized approach, making results interpretable and reproducible.
- Allows for clear and credible communication of findings.
What are some of the key use cases for hypothesis testing?
3 points
- A/B Testing: Evaluate if new feature, design, or change has a significant impact
- Feature Selection: Test the significance of variables in statistical or ML models
- Model Significance: Assess the significance of a predictive model
Definition: Contingency table
Tabular format used to display the frequencies (counts) of data points across two or more categorical variables
What is the chi-square test of independence?
Statistical test used to determine whether there is a significant association between two categorical variables in a contingency table
What are the null and alternative hypotheses in the chi-square test of independence?
- Null Hypothesis (H_0): The two variables are independent (no association).
- Alternative Hypothesis (H_a): The two variables are not independent (there is an association).
Equation: chi-square statistic
Equation: 𝐸 (Expected frequency, calculated under the assumption of independence)
Equation: Degrees of freedom (df) in a chi-square test of independence
How do you run the chi-square test of independence?
3 steps
- Calculate the chi-square statistic
- Compare the computed Χ^2 statistic to a critical value from the chi-square distribution table (based on df and significance level, e.g., 0.5)
- If Χ^2 exceeds the critical value, reject the null hypothesis
Definition: p-value
Probability, assuming the null hypothesis is true, of obtaining a test statistic as extreme as or more extreme than the one observed
Definition: alpha (hypothesis testing)
Significance level, or a predetermined threshold representing the maximum acceptable probability of making a Type I error (i.e., rejecting a true null hypothesis). Criterion against which the p-value is compared to decide whether to reject the null hypothesis
What are the most common types of sampling techniques?
4 main points
- Simple random sampling
- Stratefied random sampling
- Cluster sampling
- Systematic sampling
Simple random sampling
Common types of sampling techniques
Each member of the population has an equal chance of being selected for the sample
Stratefied random sampling
Common types of sampling techniques
Involves dividing the population into subgroups (or strata) based on certain characteristics and selecting a random sample from each stratum
Cluster sampling
Common types of sampling techniques
Involves dividing the population into smaller groups (or clusters) and then selecting a random sample of clusters
Systematic sampling
Common types of sampling techniques
Involves selecting every kth member of the population to be included in the sample
Equation: Bayes’ theorm