Exam 1 Flashcards
how does the conda package manager help
it automatically handles dependencies for you
how does the conda environment manager help
It allows us to make isolated environments with different dependencies for different projects
two things data consists of
objects and attributes
Properties of attributes
- Distinctness
- Order
- Meaningful differences
- Meaningful ratios
Names of Types of attributes in order
- Nominal
- Ordinal
- Interval
- Ratio
Characteristics of a dataset
Dimensionality – Number of attributes in a dataset.
Size – Number of objects (rows).
Sparsity – How many values are missing or zero.
Resolution – Level of detail in the data.
issues with data
Noise – Random errors or meaningless variations.
Outliers – Values that are very different from the rest.
Missing values – Data that isn’t recorded.
Duplicate data – Repeated entries in the dataset.
Mapping distance measures
Euclidean distance – Used for continuous variables.
Manhattan distance – Used for grid-based movement (taxi car in manhattan).
Chebyshev distance – Used in chess for king’s moves.
Mahalanobis distance – Used in statistics.
SMC (Simple Matching Coefficient) – Used for binary attributes.
Jaccard distance – Measures similarity of sets (e.g., common words in two documents).
Correlation – Measures relationships between variables.
minkowski distance
A general formula for distance calculations that includes multiple types:
define machine learning
Machine learning is the process of making computers learn patterns from data without being explicitly programmed.
General Strategy of Machine Learning
Collect data.
Train a model using data.
Evaluate performance.
Use the model for predictions.
supervised vs unsupervised
supervised Learning – Has labeled data (e.g., spam detection).
Unsupervised Learning – No labels; finds patterns (e.g., clustering customers).
Classification vs. Regression
Classification – Predicts categories (e.g., cat vs. dog).
Regression – Predicts continuous values (e.g., stock prices).
Explain decision trees
A model that makes decisions by splitting data based on attributes.
Used for classification.
explain support vector machines(SVM)
A model that finds the best boundary between different classes.
Used in image recognition.
explain Clustering
Groups similar objects together.
Example: Finding customer segments in a business.