midterm Flashcards
Define Logistic Regression
Logistic regression is a way to predict yes or no answers based on given data. It takes numbers as input, does some math, and gives a result between 0 and 1, which helps decide between two choices (like pass/fail or spam/not spam).
Define Sample
A small group chosen from a larger group (population) to study.
Example: Surveying 100 students from a school of 1,000 students.
Define Population
The entire group we are interested in studying.
Example: All the students in the school.
independent Variable
A factor that we change or control in an experiment to see its effect.
Dependent Variable
The outcome that we measure, which depends on the independent variable.
Augmentation
Making more data by slightly modifying existing data to improve a modelโs performance.
Oversampling
Creating more copies of data from underrepresented groups to balance the dataset.
undersampling
Reducing the amount of data from overrepresented groups to balance the dataset.
Nominal Data
Categories with no specific order.
Ordinal Data
Categories with a meaningful order, but the difference between them is not measurable.
Interval Data
Numbers with equal spacing between values, but no true zero point.
Ratio Data
Like interval data, but with a meaningful zero point, so you can compare ratios.
Qualitative Data
Qualitative data refers to non-numerical information that describes characteristics, attributes, or properties of something. It is used to capture insights, opinions, emotions, behaviors, and descriptions that cannot be easily measured or counted.
Probabality Distribution
A probability distribution shows how likely different outcomes are. It assigns a probability (between 0 and 1) to each possible event.
Example: The probability of rolling a dice and getting a 3 is 1/6 = 0.1667.
Frequency Distribution
frequency distribution shows how many times each value appears in a dataset.
Example: If you survey 10 people about their favorite fruit:
๐ Apple - 4
๐ Banana - 3
๐ Grapes - 2
๐ Orange - 1
Cumulative Distribution
A cumulative distribution shows the total count or percentage as you move through the dataset.
Example (Cumulative Frequency)
Apple: 4
Apple + Banana: 4 + 3 = 7
Apple + Banana + Grapes: 7 + 2 = 9
Apple + Banana + Grapes + Orange: 9 + 1 = 10
Each value adds up to the total number of observations.
How to Convert a Frequency Distribution to a Probability Distribution (%)
To convert a frequency table into a probability distribution, follow these steps:
Find the total frequency (sum of all occurrences).
Divide each frequency by the total to get probability.
Multiply by 100 to get percentage.
Cumulative Percentages
This is the cumulative sum of percentages as we move down the list.
Relative Percentages
relative percentage shows how each category compares to the total. Itโs the same as the percentage in the probability distribution.
Population Mean (ฮผ) vs. Sample Mean (xฬ)
Pop mean: The average of all values in an entire population.
sample mean: The average of values in a selected sample from the population.
Example
Population Mean: If we measure the height of all students in a school, we get the exact population mean (ฮผ).
Sample Mean: If we measure the height of just 50 students, we estimate the mean using the sample mean (xฬ).
The Importance of Data Cleaning in Analysis
Data cleaning is the process of fixing or removing incorrect, incomplete, duplicate, or irrelevant data before analysis. It plays a vital role because poor-quality data leads to misleading results and bad decisions.
Improves Accuracy โ Dirty data (errors, missing values, duplicates) can lead to wrong conclusions. Cleaning ensures the analysis is based on reliable data.
Enhances Efficiency โ Clean data reduces processing time and improves the performance of machine learning models and statistical calculations.
Avoids Bias & Misinterpretation โ Inconsistent or incomplete data can skew results, leading to biased decisions.
Which type of mean is used to describe a portion of individuals in a given population?
A. sample mean
B. population mean
Ans: A
When would it be appropriate to calculate a population mean?
A. When data are measured for a portion of individuals from a population.
B. When the sample mean is not available.
C. When data are measured for all members of a population.
D. When it is not possible to measure all data in a population.
Ans: C
The ______ is the sum of all scores (in a sample or population) divided by the number of scores
summed.
A. mode
B. median
C. mean
D. range
Ans: C