Tentafrågor HT22 Flashcards
When comparing two sets of observations (data), describe how the control group differs from the treatment group.
In experimental research, a control group and a treatment group are essential components for studying the effects of an intervention. Taking the example of testing a blood pressure lowering medicine at a hospital, the distinction between these groups is crucial.
Control Group:
- Definition: The control group consists of participants who do not receive the experimental treatment, in this case, the blood pressure lowering medicine.
- Role: The control group serves as a baseline for comparison, providing a reference point to understand what would happen in the absence of the experimental intervention.
- Treatment Comparison: Members of the control group may receive a placebo or no treatment, allowing researchers to isolate and measure the specific effects of the medicine by contrasting the outcomes with those of the treatment group.
Treatment Group:
- Definition: The treatment group comprises participants who receive the actual blood pressure lowering medicine being tested.
- Role: This group is exposed to the experimental treatment, and their outcomes are observed and analyzed to evaluate the impact of the intervention.
- Comparison with Control Group: By comparing the results of the treatment group with those of the control group, researchers can attribute any observed effects to the administered medicine, discerning its efficacy.
Using a control group in this way ensures that observed changes can be reasonably attributed to the experimental treatment rather than external factors. Random assignment helps minimize pre-existing differences between participants, enhancing the validity of the study.
- Summary: the control group provides a baseline for comparison, while the treatment group allows researchers to assess the specific effects of the intervention, creating a robust experimental design to draw meaningful conclusions about the tested medicine.
Consider data that maybe taken from a variety of sources. Describe 3 different potential problems with the quality of the data. (2 points per answer).
Incomplete Data:
- Description: Incomplete data occurs when certain observations or variables are missing, either entirely or for specific cases.
- Problematic Impact: Missing data can lead to biased analyses and inaccurate conclusions. The absence of critical information may hinder the ability to understand patterns, relationships, or trends in the data.
Inconsistent Data:
- Description: Inconsistent data refers to discrepancies or variations in the format, units, or definitions of variables across different sources or within the same dataset.
- Problematic Impact: Inconsistencies make it challenging to integrate and analyze the data accurately. It can lead to misinterpretation of results, as the meaning of variables may differ, and comparisons become unreliable.
Data Entry Errors:
- Description: Data entry errors occur when inaccurate information is recorded during the data collection or input phase. This can include typos, miscalculations, or misinterpretations of the data.
- Problematic Impact: Errors in data entry can introduce noise and distort the true representation of the data. They may lead to incorrect statistical analyses and conclusions, affecting the overall reliability of findings.
Briefly describe the 6 stages of the Cross Industry Process for Data Mining (CRISP-DM) and how each stage in CRISP-DM relates to the previous stage
(CRISP-DM) is a widely used framework for guiding data mining projects. It consists of six stages, each building upon the previous one.
- Business Understanding:
- Description: In this initial stage, the focus is on understanding the business problem, objectives, and requirements from a data mining perspective. This involves defining the goals of the project, understanding the business context, and determining what success looks like.
- Relation to Previous Stage: The business understanding stage sets the foundation for the entire data mining process. It helps identify the key factors that need to be addressed and ensures alignment with organizational goals.
- Data Understanding:
- Description: This stage involves exploring and understanding the available data. It includes data collection, initial data inspection, and a preliminary assessment of its quality. The goal is to familiarize the data mining team with the characteristics of the dataset.
- Relation to Previous Stage: The data understanding stage is informed by the business understanding stage. It helps to identify the data sources relevant to the business problem and provides insights into the nature of the data that will be used for analysis.
- Data Preparation:
- Description: Data preparation involves cleaning, transforming, and formatting the data to make it suitable for analysis. This stage also includes handling missing values, outliers, and other data quality issues.
- Relation to Previous Stage: The data preparation stage is directly influenced by the findings of the data understanding stage. It addresses any data quality issues identified during exploration and prepares the data for modeling.
- Modeling:
- Description: In the modeling stage, various data mining techniques are applied to build and assess models that address the business objectives. This involves selecting appropriate modeling techniques, creating models, and fine-tuning them to achieve the desired results.
- Relation to Previous Stage: The modeling stage relies on the prepared and cleaned data from the data preparation stage. The choice of modeling techniques is informed by the understanding of the business problem and the characteristics of the data.
- Evaluation:
- Description: The evaluation stage assesses the models’ performance in meeting the business objectives. It involves validating the models using independent datasets and evaluating their effectiveness based on predefined criteria.
- Relation to Previous Stage: The evaluation stage depends on the models developed in the previous stage. It provides feedback on the success of the modeling efforts and helps decide whether the models are suitable for deployment.
- Deployment:
- Description: The final stage involves deploying the data mining results into the business environment. This could include implementing the models into operational systems, creating reports, or integrating the findings into decision-making processes.
- Relation to Previous Stage: The deployment stage is the culmination of the entire CRISP-DM process. It puts the insights gained from data mining into practical use, ensuring that the business can benefit from the models and analysis.
Explain, on a conceptual level, how linear regression works as a prediction technique and explain its advantages / disadvantages.
Linear regression involves fitting a straight line through data to predict numerical values. It can also be applied to some extent for classification tasks. The technique is versatile, allowing for various levels of precision in predictions. For instance, if you have a dataset with two attributes (x and y), linear regression can help predict one attribute given the other.
Consider a scenario where the number of worked hours is represented by x and the corresponding salary by y. Linear regression enables the prediction of salary based on the given hours worked. One of the advantages of linear regression is its simplicity and ease of use. However, its accuracy might be limited, and the Mean Absolute Error (MAE) and Mean Squared Error (MSE) could still be substantial.
- Linear regression is sensitive to outliers, particularly when working with a small dataset. If outliers are present, they can significantly impact the accuracy of the regression. Moreover, linear regression might provide misleading results if there is no discernible trend in the data.
- When it comes to classification, linear regression is only effective when there is a clear separation between classes. If the classes overlap or exhibit complex relationships, linear regression may not be suitable for classification tasks.
Explain, on a conceptual level, how support vector regression works as a prediction technique and explain its advantages / disadvantages when compared to linear regression.
Support Vector Regression (SVR) shares a conceptual foundation with linear regression but extends its application to prediction tasks, emphasizing its advantages in handling non-linear relationships. Like linear regression, SVR aims to draw a line through the data, but instead of focusing on classification, it is designed for predicting numerical values.
The key concept of SVR involves maximizing the margin or distance between different classes, allowing for a certain degree of error in the prediction. This flexibility is a notable advantage, especially when dealing with datasets where clear separation between groups is challenging. SVR is capable of accommodating some level of mislabeling, making it robust in scenarios where linear regression might struggle due to overlapping or complex data patterns.
Advantages of SVR over linear regression:
- Handling Non-Linearity: One of the significant advantages of SVR is its ability to capture non-linear relationships in the data. It employs a kernel trick that transforms the input space, enabling the identification of complex patterns that linear regression might miss.
- Robustness to Outliers: SVR is generally more robust to outliers compared to linear regression. The algorithm focuses on support vectors, which are the most critical data points for defining the regression line. Outliers that are not support vectors have less impact on the overall model.
- Flexibility in High-Dimensional Spaces: SVR can effectively operate in high-dimensional spaces, making it suitable for datasets with numerous features. This flexibility is particularly advantageous when dealing with complex and multi-dimensional data.
Disadvantages of SVR compared to linear regression:
- Computational Complexity: SVR can be computationally intensive, especially when dealing with large datasets or complex kernel functions. This can result in longer training times compared to linear regression.
- Model Interpretability: While linear regression provides a straightforward interpretation of the model coefficients, SVR, especially when using non-linear kernels, can be more challenging to interpret. Understanding the impact of individual features on the prediction may be less intuitive.
- Parameter Sensitivity: SVR has parameters, such as the choice of kernel and regularization parameters, that need to be tuned appropriately for optimal performance. The sensitivity of SVR to parameter choices can be a disadvantage when compared to the simplicity of linear regression.
Explain, on a conceptual level, how k-means clustering works as a clustering technique and
explain its advantages / disadvantages.
k-means clustering drops k (number) centroids into the data and then calculates the distance
between the centroid and all data points. It then moves the centroids and does it again. The
process is iterative and the goal is for the mean distance in all clusters to be as low as
possible. Eventually the centroids will stop moving.
The advantages are that it is a fairly simple concept to understand and it can find clusters that
are relatively clearly separated into groups.
The main disadvantage is that it will not be able to find patterns in data if the data points are not
organized in such a way that it is easily fitted within the radius of separate circles if the clusters
are organized in another way. Say for example, as with an example in the lectures, the data
when visualised in a scatterplot forms a different shape (such as circles within circles) the k-
means method cannot be used to identify the pattern.
It also includes ALL data points in some cluster. Even if there are a few points that actually
don’t fit in any of the clusters.
Explain, on a conceptual level, how DBSCAN works as a clustering technique and explain its
advantages / disadvantages when compared to k-means.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a clustering algorithm that groups together data points based on their density in the feature space. Unlike k-means, DBSCAN does not assume that clusters have a spherical shape and can discover clusters of arbitrary shapes.
DBSCAN drops a point into the data and then labels all the points within that radius as part of a
certain cluster. It then check if there are any datapoints nearby that could also fit into that
cluster, moves the datapoint in that direction and labels all datapoints within that radius.
The main advantage is that it can, as previously mentioned be used to identify clusters that are
not organized in groups that can be easily fitted into a circle. It can identify patterns that are not
as easily distinguished. Another advantage is that it also allows for some datapoints that don’t
fit into any cluster to be left out. However this can also be seen as a disadvantage as it of
course means that some data will be left out of the clusters and might be disregarded.
Is the sentiment analysis task predictive or descriptive? Justify your answer.
Sentiment analysis is primarily a predictive task. In sentiment analysis, the goal is to predict the sentiment or emotion expressed in a piece of text, such as a review, comment, or tweet. The task involves classifying the sentiment as positive, negative, or neutral, or even more fine-grained sentiment categories.
Here are some reasons why sentiment analysis is considered a predictive task:
- Outcome Prediction: The main objective is to predict the sentiment or emotional tone conveyed in a given text. It involves assigning a label or score to indicate whether the sentiment is positive, negative, or neutral.
- Machine Learning Models: Sentiment analysis is often approached using machine learning models, where the algorithm is trained on labeled data to learn patterns and relationships between features and sentiment labels. The trained model is then used to make predictions on new, unseen text.
- Classification Problem: Sentiment analysis is formulated as a classification problem, where the task is to classify text into predefined sentiment categories. Classification is inherently a predictive task, as it involves assigning a class label to input instances.
- Generalization to New Data: A predictive task involves building a model that generalizes well to new, unseen data. In sentiment analysis, the goal is to create a model that can accurately predict sentiment in texts it has not encountered during training.
What potential problems can you identify with taking the rule-based approach described for the
sentiment analyser above?
For one it will be a tedious task to define all words to be tagged as positive and negative even
when using stemming.
It will also be insensitive to subtle things such as the use of irony, expressions or turns of
phrases which means that there is a risk of mislabelling data.
A person who is very happy indeed might for example have used words that out of context
may be clearly marked as negative along with one word that is deemed positive and the
sentence will then be incorrectly categorized as negative.
If we assume that sentiment analysis is predictive, should the problem be formulated as a
classification task or a regression task? Justify your answer.
Classification. Regression is mainly used in order to predict numerical values. Even though
regression can be used for classification aswell it might not be a leap to assume that the
datapoints would not be clearly separated into classes where we can easily draw a line
between them. Labelling words as either positive or negative fits quite neatly into the realm of
classification however.
Consider the following linear regression model equation and accompanying data.
y = 3x + 10
x y
1 14
2 17
3 25
4 20
5 25
Calculate the mean absolute error (MAE) and mean squared error (MSE) for the linear model
using the data above. Show each step of your calculations.