Mer tentafrågor Flashcards
Explain the concept of “feature selection” in the context of machine learning. Why is it important?
Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.
It is important because:
- Makes models easier to interpret because of simplification
- reduces overfitting because of eliminating irrelevant features
- Improves model accuracy
- reduces training time.
- Effective feature selection can enhance the performance of a model by focusing on the most relevant data and reducing noise.
Describe the difference between a parametric and a non-parametric machine learning model.
The difference between parametric and non-parametric machine learning models lies in their approach to the underlying structure of the data they model:
Parametric models
- Predetermined form that maps inputs to outputs
- Specific formula that describes the relationship between input and output data
- Examples would be:
- Linear regression
- Linear relationship between input and output variables, ex. Predicting house prices based on features like size, location and number of rooms. The key point here is that the model assumes that these features contribute linearly to the price.
- Logistic regression
- Used for classification problems, ex. Assumes a logistic function to estimate that an input belongs to a certain class
- Linear regression
- Characteristics:
- Fixed number of parameters
- Regardless of the amount of data, the number of parameters doesn’t change. Makes learning process easier and require less processing power
- Limitations
- Because of their fixed structure, it might not capture real world data effectively as those relationship may not be non-linear
- Fixed number of parameters
Non-parametric models
- Flexible structure
- Non-fixed form for the function that maps input to outputs. Allows model structure to be determined by the data itself, aka more flexible.
- Ex.
- Decision trees
- Helps diagnose “diseases” based on symptoms by segmenting input into simple regions.
- Node: Represents a symptom
- Branch: Represents the absence or presence of a symptom
- Helps diagnose “diseases” based on symptoms by segmenting input into simple regions.
- K-nearest Neighbors(KNN)
- Classifies datapoints based on how its neighbours are classified. Good for things like recommendation systems, KNN can recommend products based on what users who fit a similar profile bought.
- KNN also shines when you are unsure how to group your data as its data-driven grouping. It groups data based on the similarities between data points. Grouping is entirely data-driven based on proximity in the feature space.
- Decision trees
- Characteristics
- Flexibility
- Suitable for complex and non-linear data, very adaptable to wide range of data structures
- Challenges
- Require more data to learn and is prone to “overfitting”, can also req a lot of processing power
- Flexibility
In summary, parametric models are more straightforward and computationally efficient but less flexible, while non-parametric models are more adaptable to complex data patterns at the cost of needing more data and being prone to overfitting.
Discuss the challenges and considerations in implementing a machine learning model in a real-world healthcare setting for predicting patient outcomes.
Implementing a machine learning model in healthcare poses several challenges and considerations:
- Data Quality and Availability: Healthcare data can be fragmented, incomplete, or inconsistent, requiring careful preprocessing and integration.
- Model Interpretability: Models must be interpretable to healthcare professionals for trust and practical application. Complex models like deep learning may offer high accuracy but lack transparency.
- Ethical Considerations and Bias: Ensuring the model doesn’t propagate biases present in historical data, such as those based on race, gender, or socioeconomic status.
- Regulatory Compliance: Adhering to legal standards like HIPAA in the US, which governs the use and sharing of personal health information.
- Model Validation and Reliability: Rigorous validation is required to ensure models are reliable and generalize well to different patient populations.
- Integration with Healthcare Systems: Models must integrate seamlessly with existing healthcare IT systems, requiring collaboration between data scientists, clinicians, and IT professionals.
Explain how the k-means clustering algorithm works and discuss its limitations.
The k-means clustering algorithm partitions data into k distinct clusters based on feature similarity. The process involves:
- Initialising k centroids
- The mean/central position of all points in a cluster, a calculated average position that represent the center of a cluster.
- Random initialisation
- Choosing ‘k’ points at random from the dataset, to serve as the initial centroids
- ‘k’ represents the number of clusters you want to form.
- They are the starting points of the clustering process
Basically the process is:
- Selection
- Select initial centroids (‘k’ data points at random)
- Assignment
- Each data point in the dataset is then assigned to the nearest centroid, based on distance(usually Euclidean distance)
- Update
- After all points are assigned, recalculate each centroid position, adjust them to the mean of all of its points in the cluster
- Iteration
- Repeat Assignment and Update step until centroids no longer move significantly.
Limitations/Considerations:
- Sensitivity: Based on your initial centroid placements, it can result in different outcomes.
- Convergence to local minima: Could lead to suboptimal clustering because of converging to a local minimum instead of a global minimum
- Smart initialization techniques: Algorithms such as at the K-means++ are sometimes used for smarter initialisations that spread out the starting centroids in a way that likely leads to better clustering
- Assumption of spherical clusters of similar size, which may not fit all datasets.
- Difficulty in determining the optimal number of clusters (k).
- Poor performance with high-dimensional data due to the “curse of dimensionality”.
Describe each phase of the CRISP-DM (Cross-Industry Standard Process for Data Mining) process and explain its importance in a data mining project.
Business Understanding
- This initial phase
- Focus on understanding the project objectives & requirements from a business perspective
- Convert this knowledge into a data mining problem and define a preliminary plan.
Data Understanding
- Involves collecting data, and familiarizing with it
- Identifying data quality problems
- Discovering/detecting interesting subsets in the data to form hypotheses for hidden information.
Data Preparation
- Encompasses all activities needed to construct the final dataset from the initial raw data
- Such as: cleaning data, selecting cases, and transforming variables for the modeling tool.
Modeling:
- Modeling techniques are selected and applied
- Their parameters are calibrated for optimal prediction.
- Often requires iterative back-and-forth steps until the best model(s) are identified.
- In this phase K-Means or K-Nearest Neighbour would optimally be used
Evaluation
- Evaulation of models are needed before deployment
- In context of the business objectives defined in the first phase (might require going back to the data preparation or modeling phase)
Deployment
- Final phase, involves deploying the model into the operational environment
- Could be as simple as generating a report, or as complex as implementing a repeatable data mining process across the organization.
EXTRA NOTE:
Other Considerations of K-means and KNN usage:
- Data Preparation Phase: While not their primary function, these algorithms can be used for specific tasks like feature creation or missing value imputation, as previously discussed.
- Data Understanding Phase: They might also be useful for gaining insights into the structure and relationships in the data, which can inform subsequent modeling decisions.
What are some practical applications of K-means clustering, and how does the choice of ‘k’ affect the results?
Remember:
K-means goal is to discover inherent groupings in the data, not to classify data points.
Market Segmentation:
- K-means can segment customers into groups based on purchasing behavior, demographics, etc., for targeted marketing.
Document Clustering:
- Grouping similar documents for information retrieval or organizational purposes.
Image Segmentation:
- For dividing a digital image into multiple segments to simplify and/or change the representation of an image into something more meaningful.
The choice of ‘k’, the number of clusters, significantly affects the clustering results. If ‘k’ is too small, the algorithm might combine data points into too broad clusters. If ‘k’ is too large, clusters may be split unnecessarily, capturing noise in the data rather than useful patterns. Techniques like the Elbow Method or the Silhouette Coefficient can help determine a good value for ‘k’.
Explain the concept of the kernel trick in Support Vector Machines and its benefits.
The kernel trick is a fundamental concept in Support Vector Machines (SVMs), particularly useful when dealing with non-linear data. It allows SVMs to operate in a high-dimensional space without explicitly mapping the data to that space, which can be computationally expensive.
Benefits of the kernel trick include:
- Ability to handle non-linear data effectively.
- Reduced computational complexity as it avoids the explicit mapping of data points to a higher-dimensional space, since a kernel function computes the inner products of data points in this higher-dimensional space without actually performing the transformation
- Flexibility to adapt to different types of data through the choice of different kernel functions, like polynomial, radial basis function (RBF), or sigmoid.
–
Different types of data that kernel trick adapts to:
- Linear Kernel: No transformation, equivalent to the standard dot product in the original feature space.
- Polynomial Kernel: Allows for curved boundaries in the original feature space.
- Radial Basis Function (RBF) / Gaussian Kernel: Can handle complex non-linear relationships.
Explain the term overfitting
Overfitting is a common issue in machine learning and statistical modeling where a model learns not only the underlying patterns in the training data but also the noise and random fluctuations. This results in a model that performs very well on the training data but poorly on new, unseen data. Here’s a detailed look at overfitting:
Characteristics of Overfitting:
High Performance on Training Data:
- The model shows very high accuracy or low error on the training dataset.
Poor Generalization:
- The model performs poorly on new or unseen data (test data), indicating that it has not learned the underlying patterns effectively.
Complex Models:
- Often occurs in models that are too complex for the amount or type of data available. Such models have too many parameters relative to the number of observations.
What are the causes of Overfitting, and how do you detect and prevent it?
Causes of Overfitting:
- Too Many Parameters: A model with excessively many parameters (e.g., a deep neural network with many layers) can learn detailed patterns and noise in the training data.
- Limited Training Data: If the training data is not representative of the general population or is too small, the model might learn specifics of that dataset rather than the general trend.
- Irrelevant Features: Including features that are not relevant to the prediction task can lead the model to learn associations that don’t generalize well.
- Training for Too Long: Especially in iterative models like neural networks, training for too many epochs can lead to overfitting.
How to Detect Overfitting:
- Validation Performance: A significant difference in performance between training and validation/test datasets is a classic sign of overfitting.
- Learning Curves: Plotting the performance on both training and validation sets over time. Overfitting is indicated if the training error decreases while the validation error starts to increase.
Preventing Overfitting:
- Pruning (in Decision Trees): Removing parts of the tree that provide little power to classify instances.
- Early Stopping: In iterative models, stop training before the model has a chance to learn the noise in the data.
- Simplifying the Model: Using a simpler model with fewer parameters can help.
- More Data: Increasing the size of the training dataset can improve the model’s ability to generalize.
- ## Feature Selection: Reducing the number of irrelevant or redundant features.
- Cross-Validation: Using techniques like k-fold cross-validation helps ensure that the model performs well across different subsets of the data.
- Regularization: Techniques like L1 or L2 regularization add a penalty for more complex models, discouraging overfitting.
In summary, overfitting is when a model is too closely fitted to the specificities of the training data, leading to poor performance on new data. It’s a key challenge in machine learning, and various strategies are employed to prevent it, ensuring that models are both accurate and generalizable.
What are the differences between K-means and K-nearest neighbour(KNN), give examples of their use
K-Nearest Neighbors (KNN)
- Useful For: Supervised learning tasks, specifically classification and regression.
Example Scenario: Predicting whether a patient has a particular disease based on their medical records.
- How KNN Works Here: KNN can classify a new patient as having the disease or not by comparing their medical records to those of previous patients. If the ‘k’ nearest patients in the feature space (considering features like age, symptoms, blood tests) mostly have the disease, KNN would classify the new patient as likely having the disease, and vice versa.
K-Means Clustering
- Useful For: Unsupervised learning tasks, particularly for identifying groups or clusters in data.
Example Scenario: A company wants to segment its customer base to tailor marketing strategies. - How K-Means Works Here: K-Means can group customers into clusters based on features like purchasing habits, demographics, and preferences. Each cluster represents a segment of the market with similar characteristics. The company can then develop targeted marketing strategies for each segment, improving the efficiency and effectiveness of its marketing efforts.
In summary, KNN is used for supervised learning problems where the goal is to predict an output based on input data that is similar to known examples. K-Means, on the other hand, is used for unsupervised learning problems where the goal is to discover inherent groupings in the data.
What is SVM(Support vector machines)
It’s a type of supervised machine learning algorithm. It is primarily used for classification tasks but can also be adapted for regression.
What is a Support Vector Machine (SVM)?
Fundamental Concept:
- SVMs are based on the idea of finding a hyperplane that best divides a dataset into classes. In two-dimensional space, this hyperplane is a line dividing a plane into two parts where each class lies on either side.
Support Vectors:
- These are the data points nearest to the hyperplane, which are the critical elements of a data set.
- The SVM algorithm builds a model that assigns new data points a category based on these support vectors, hence the name.
Margin Maximization:
- SVM aims to find the hyperplane with the maximum margin, meaning it tries to maximize the distance between the data points of different classes. This helps in reducing the error of the classifier.
How SVMs Work:
- Linear SVMs: In their simplest form, SVMs are used for linear classification, dividing data points using a straight line (or hyperplane in higher dimensions).
- Non-Linear SVMs: Many real-world problems are not linearly separable, meaning a straight line cannot effectively separate the classes. This is where the kernel trick becomes essential.
What is a hyperplane in SVM?
A hyperplane is essentially a decision boundary that separates different classes in the dataset.
In two-dimensional space (like a flat sheet of paper), this hyperplane is a line(In three dimensions, it would be a plane) and in higher dimensions more complex shapes.
Dividing the Dataset:
- Hyperplanes purpose in SVM is to separate the data points into different classes as CLEARLY as possible.
- Imagine data points on a graph where each point belongs to one of two classes. SVMs goal is to separate the classes by drawing a line(hyperplane) that divides these points into two groups.
Best Separation:
- The “best” hyperplane is the one that represents the largest separation, or margin, between the two classes.
- The margin = distance between the hyperplane and the nearest data point from either class.
What are the four V’s in big data and what do they mean in context? Illustrate by describing examples of data based on these four V’s.
Big data is commonly described using the framework of the “Four V’s,” which helps to understand its key characteristics. These are Volume, Velocity, Variety, and Veracity:
Volume:
- This refers to the vast amounts of data generated every second. For example, data from social media platforms like Twitter and Facebook, where millions of users generate text, images, and videos constantly, represent a large volume of data.
Velocity:
- This is about the speed at which new data is generated and the speed at which data moves around. For instance, stock market data, where prices fluctuate rapidly and data is updated in milliseconds, is an example of high-velocity data.
Variety:
- This points to the different types of data we can now use. Traditional data types were structured and fit neatly in a relational database. Today, data comes in new unstructured forms, like text, video, and images. An example is data from wearable devices, which collect a variety of data types including numerical (heart rate, steps), text (user inputs), and sometimes even images (photos of activities or meals).
Veracity:
- This refers to the quality of the data. With many forms of big data, quality and accuracy are less controllable (just think of Twitter posts with varying degrees of reliability). For example, customer reviews on websites can vary widely in terms of reliability, relevance, and accuracy, which affects the veracity of this data.
Data mining and retrieving data from a database using a SQL query are distinct processes, each with its own purpose and methods. Let’s explore the differences through examples:
Data Mining
Data mining involves extracting useful information and patterns from large datasets using algorithms, statistical methods, and machine learning techniques. It’s more about discovery and insights rather than just retrieval.
- Example: Imagine a retail company with a large database of customer transactions. Data mining could involve using clustering algorithms to segment customers into different groups based on purchasing behavior. This could reveal patterns like a group of customers who frequently buy organic products, or those who make large purchases during holiday seasons. The company could then use this information for targeted marketing campaigns.
Retrieving Data from a Database (e.g., SQL Query)
Retrieving data from a database using a SQL query is a process of accessing specific information by writing queries that specify exactly what data is needed. It’s more straightforward and involves direct queries to a structured database.
- Example: A hospital administrator wants to know how many patients were admitted for a specific condition last month. They could use a SQL query like SELECT COUNT(*) FROM patients WHERE condition = ‘X’ AND admission_date BETWEEN ‘2023-01-01’ AND ‘2023-01-31’; This query would return the exact number of patients admitted with condition X in January 2023.
Key Differences
- Purpose: Data mining is about finding patterns and insights that are not explicitly stated in the data, whereas SQL queries are for retrieving specific information based on known criteria.
- Methodology: Data mining uses complex algorithms and statistical methods to analyze data, while SQL queries involve specific syntax to extract data from databases.
- Nature of Data: Data mining is often used with large, complex datasets that may be unstructured or semi-structured, whereas SQL queries are used on structured data in relational databases.
- Results: The results from data mining are often predictive models, patterns, or new insights, while SQL queries yield specific data points or subsets of the database.
In summary, data mining is about uncovering hidden patterns and relationships in large datasets, while retrieving data using SQL is about fetching specific data from a database based on known queries.
Briefly describe each of the basic steps to transform data to knowledge in the KDD process (Knowledge Discovery in Databases, as described by Fayyad et al) and support your answer with an example
KDD is a framework for transforming raw data into useful knowledge. This process includes several key steps:
-
Selection: Identifying and gathering relevant data from various sources.
- Example: A retail company may select sales data from its database, including transaction details, customer information, and product data.
-
Preprocessing: Cleaning and transforming the selected data to correct inaccuracies, handle missing values, and prepare it for analysis.
- Example: The retail company cleans the data by removing incomplete records, correcting errors, and standardizing the format of the date and time fields.
-
Transformation: Reducing and transforming the preprocessed data into forms suitable for mining. This can involve dimensionality reduction, aggregation, and other methods to focus on important variables.
- Example: The company aggregates sales data by categories (like electronics, clothing), and uses dimensionality reduction techniques to focus on key factors affecting sales.
-
Data Mining: Applying algorithms to extract patterns and models from the transformed data. This step involves selecting the appropriate mining tasks like classification, clustering, or association rule mining.
- Example: The company uses clustering algorithms to identify customer segments based on purchasing behavior and association rule mining to find commonly co-purchased products.
-
Interpretation/Evaluation: Interpreting the mined patterns and evaluating their relevance and usefulness. This step often involves domain knowledge to make sense of the results and assess their value.
- Example: The company analyzes the customer segments and product associations to understand purchasing trends and preferences, evaluating the potential for targeted marketing strategies.
-
Knowledge Utilization: Applying the discovered knowledge to make decisions or take actions. This is where the insights gained from the data mining process are used to achieve business objectives.
- Example: Based on the insights, the company launches targeted marketing campaigns for specific customer segments and adjusts its product placement and inventory according to the identified purchasing patterns.