Mer tentafrågor Flashcards

1
Q

Explain the concept of “feature selection” in the context of machine learning. Why is it important?

A

Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.

It is important because:

  • Makes models easier to interpret because of simplification
  • reduces overfitting because of eliminating irrelevant features
  • Improves model accuracy
  • reduces training time.
  • Effective feature selection can enhance the performance of a model by focusing on the most relevant data and reducing noise.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the difference between a parametric and a non-parametric machine learning model.

A

The difference between parametric and non-parametric machine learning models lies in their approach to the underlying structure of the data they model:

Parametric models

  • Predetermined form that maps inputs to outputs
  • Specific formula that describes the relationship between input and output data
  • Examples would be:
    • Linear regression
      • Linear relationship between input and output variables, ex. Predicting house prices based on features like size, location and number of rooms. The key point here is that the model assumes that these features contribute linearly to the price.
    • Logistic regression
      • Used for classification problems, ex. Assumes a logistic function to estimate that an input belongs to a certain class
  • Characteristics:
    • Fixed number of parameters
      • Regardless of the amount of data, the number of parameters doesn’t change. Makes learning process easier and require less processing power
    • Limitations
      • Because of their fixed structure, it might not capture real world data effectively as those relationship may not be non-linear

Non-parametric models

  • Flexible structure
    • Non-fixed form for the function that maps input to outputs. Allows model structure to be determined by the data itself, aka more flexible.
  • Ex.
    • Decision trees
      • Helps diagnose “diseases” based on symptoms by segmenting input into simple regions.
        • Node: Represents a symptom
        • Branch: Represents the absence or presence of a symptom
    • K-nearest Neighbors(KNN)
      • Classifies datapoints based on how its neighbours are classified. Good for things like recommendation systems, KNN can recommend products based on what users who fit a similar profile bought.
      • KNN also shines when you are unsure how to group your data as its data-driven grouping. It groups data based on the similarities between data points. Grouping is entirely data-driven based on proximity in the feature space.
  • Characteristics
    • Flexibility
      • Suitable for complex and non-linear data, very adaptable to wide range of data structures
    • Challenges
      • Require more data to learn and is prone to “overfitting”, can also req a lot of processing power

In summary, parametric models are more straightforward and computationally efficient but less flexible, while non-parametric models are more adaptable to complex data patterns at the cost of needing more data and being prone to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Discuss the challenges and considerations in implementing a machine learning model in a real-world healthcare setting for predicting patient outcomes.

A

Implementing a machine learning model in healthcare poses several challenges and considerations:

  • Data Quality and Availability: Healthcare data can be fragmented, incomplete, or inconsistent, requiring careful preprocessing and integration.
  • Model Interpretability: Models must be interpretable to healthcare professionals for trust and practical application. Complex models like deep learning may offer high accuracy but lack transparency.
  • Ethical Considerations and Bias: Ensuring the model doesn’t propagate biases present in historical data, such as those based on race, gender, or socioeconomic status.
  • Regulatory Compliance: Adhering to legal standards like HIPAA in the US, which governs the use and sharing of personal health information.
  • Model Validation and Reliability: Rigorous validation is required to ensure models are reliable and generalize well to different patient populations.
  • Integration with Healthcare Systems: Models must integrate seamlessly with existing healthcare IT systems, requiring collaboration between data scientists, clinicians, and IT professionals.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain how the k-means clustering algorithm works and discuss its limitations.

A

The k-means clustering algorithm partitions data into k distinct clusters based on feature similarity. The process involves:

  • Initialising k centroids
    • The mean/central position of all points in a cluster, a calculated average position that represent the center of a cluster.
  • Random initialisation
    • Choosing ‘k’ points at random from the dataset, to serve as the initial centroids
    • ‘k’ represents the number of clusters you want to form.
    • They are the starting points of the clustering process

Basically the process is:

  • Selection
    • Select initial centroids (‘k’ data points at random)
  • Assignment
    • Each data point in the dataset is then assigned to the nearest centroid, based on distance(usually Euclidean distance)
  • Update
    • After all points are assigned, recalculate each centroid position, adjust them to the mean of all of its points in the cluster
  • Iteration
    • Repeat Assignment and Update step until centroids no longer move significantly.

Limitations/Considerations:

  • Sensitivity: Based on your initial centroid placements, it can result in different outcomes.
  • Convergence to local minima: Could lead to suboptimal clustering because of converging to a local minimum instead of a global minimum
  • Smart initialization techniques: Algorithms such as at the K-means++ are sometimes used for smarter initialisations that spread out the starting centroids in a way that likely leads to better clustering
  • Assumption of spherical clusters of similar size, which may not fit all datasets.
  • Difficulty in determining the optimal number of clusters (k).
  • Poor performance with high-dimensional data due to the “curse of dimensionality”.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe each phase of the CRISP-DM (Cross-Industry Standard Process for Data Mining) process and explain its importance in a data mining project.

A

Business Understanding

  • This initial phase
  • Focus on understanding the project objectives & requirements from a business perspective
  • Convert this knowledge into a data mining problem and define a preliminary plan.

Data Understanding

  • Involves collecting data, and familiarizing with it
  • Identifying data quality problems
  • Discovering/detecting interesting subsets in the data to form hypotheses for hidden information.

Data Preparation

  • Encompasses all activities needed to construct the final dataset from the initial raw data
  • Such as: cleaning data, selecting cases, and transforming variables for the modeling tool.

Modeling:

  • Modeling techniques are selected and applied
  • Their parameters are calibrated for optimal prediction.
  • Often requires iterative back-and-forth steps until the best model(s) are identified.
  • In this phase K-Means or K-Nearest Neighbour would optimally be used

Evaluation

  • Evaulation of models are needed before deployment
  • In context of the business objectives defined in the first phase (might require going back to the data preparation or modeling phase)

Deployment

  • Final phase, involves deploying the model into the operational environment
  • Could be as simple as generating a report, or as complex as implementing a repeatable data mining process across the organization.

EXTRA NOTE:
Other Considerations of K-means and KNN usage:

  • Data Preparation Phase: While not their primary function, these algorithms can be used for specific tasks like feature creation or missing value imputation, as previously discussed.
  • Data Understanding Phase: They might also be useful for gaining insights into the structure and relationships in the data, which can inform subsequent modeling decisions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some practical applications of K-means clustering, and how does the choice of ‘k’ affect the results?

A

Remember:
K-means goal is to discover inherent groupings in the data, not to classify data points.

Market Segmentation:

  • K-means can segment customers into groups based on purchasing behavior, demographics, etc., for targeted marketing.

Document Clustering:

  • Grouping similar documents for information retrieval or organizational purposes.

Image Segmentation:

  • For dividing a digital image into multiple segments to simplify and/or change the representation of an image into something more meaningful.

The choice of ‘k’, the number of clusters, significantly affects the clustering results. If ‘k’ is too small, the algorithm might combine data points into too broad clusters. If ‘k’ is too large, clusters may be split unnecessarily, capturing noise in the data rather than useful patterns. Techniques like the Elbow Method or the Silhouette Coefficient can help determine a good value for ‘k’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the concept of the kernel trick in Support Vector Machines and its benefits.

A

The kernel trick is a fundamental concept in Support Vector Machines (SVMs), particularly useful when dealing with non-linear data. It allows SVMs to operate in a high-dimensional space without explicitly mapping the data to that space, which can be computationally expensive.

Benefits of the kernel trick include:

  • Ability to handle non-linear data effectively.
  • Reduced computational complexity as it avoids the explicit mapping of data points to a higher-dimensional space, since a kernel function computes the inner products of data points in this higher-dimensional space without actually performing the transformation
  • Flexibility to adapt to different types of data through the choice of different kernel functions, like polynomial, radial basis function (RBF), or sigmoid.


Different types of data that kernel trick adapts to:

  • Linear Kernel: No transformation, equivalent to the standard dot product in the original feature space.
  • Polynomial Kernel: Allows for curved boundaries in the original feature space.
  • Radial Basis Function (RBF) / Gaussian Kernel: Can handle complex non-linear relationships.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the term overfitting

A

Overfitting is a common issue in machine learning and statistical modeling where a model learns not only the underlying patterns in the training data but also the noise and random fluctuations. This results in a model that performs very well on the training data but poorly on new, unseen data. Here’s a detailed look at overfitting:

Characteristics of Overfitting:

High Performance on Training Data:

  • The model shows very high accuracy or low error on the training dataset.

Poor Generalization:

  • The model performs poorly on new or unseen data (test data), indicating that it has not learned the underlying patterns effectively.

Complex Models:

  • Often occurs in models that are too complex for the amount or type of data available. Such models have too many parameters relative to the number of observations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the causes of Overfitting, and how do you detect and prevent it?

A

Causes of Overfitting:

  • Too Many Parameters: A model with excessively many parameters (e.g., a deep neural network with many layers) can learn detailed patterns and noise in the training data.
  • Limited Training Data: If the training data is not representative of the general population or is too small, the model might learn specifics of that dataset rather than the general trend.
  • Irrelevant Features: Including features that are not relevant to the prediction task can lead the model to learn associations that don’t generalize well.
  • Training for Too Long: Especially in iterative models like neural networks, training for too many epochs can lead to overfitting.

How to Detect Overfitting:

  • Validation Performance: A significant difference in performance between training and validation/test datasets is a classic sign of overfitting.
  • Learning Curves: Plotting the performance on both training and validation sets over time. Overfitting is indicated if the training error decreases while the validation error starts to increase.

Preventing Overfitting:

  • Pruning (in Decision Trees): Removing parts of the tree that provide little power to classify instances.
  • Early Stopping: In iterative models, stop training before the model has a chance to learn the noise in the data.
  • Simplifying the Model: Using a simpler model with fewer parameters can help.
  • More Data: Increasing the size of the training dataset can improve the model’s ability to generalize.
  • ## Feature Selection: Reducing the number of irrelevant or redundant features.
  • Cross-Validation: Using techniques like k-fold cross-validation helps ensure that the model performs well across different subsets of the data.
  • Regularization: Techniques like L1 or L2 regularization add a penalty for more complex models, discouraging overfitting.

In summary, overfitting is when a model is too closely fitted to the specificities of the training data, leading to poor performance on new data. It’s a key challenge in machine learning, and various strategies are employed to prevent it, ensuring that models are both accurate and generalizable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the differences between K-means and K-nearest neighbour(KNN), give examples of their use

A

K-Nearest Neighbors (KNN)

  • Useful For: Supervised learning tasks, specifically classification and regression.

Example Scenario: Predicting whether a patient has a particular disease based on their medical records.

  • How KNN Works Here: KNN can classify a new patient as having the disease or not by comparing their medical records to those of previous patients. If the ‘k’ nearest patients in the feature space (considering features like age, symptoms, blood tests) mostly have the disease, KNN would classify the new patient as likely having the disease, and vice versa.

K-Means Clustering

  • Useful For: Unsupervised learning tasks, particularly for identifying groups or clusters in data.
    Example Scenario: A company wants to segment its customer base to tailor marketing strategies.
  • How K-Means Works Here: K-Means can group customers into clusters based on features like purchasing habits, demographics, and preferences. Each cluster represents a segment of the market with similar characteristics. The company can then develop targeted marketing strategies for each segment, improving the efficiency and effectiveness of its marketing efforts.

In summary, KNN is used for supervised learning problems where the goal is to predict an output based on input data that is similar to known examples. K-Means, on the other hand, is used for unsupervised learning problems where the goal is to discover inherent groupings in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is SVM(Support vector machines)

A

It’s a type of supervised machine learning algorithm. It is primarily used for classification tasks but can also be adapted for regression.

What is a Support Vector Machine (SVM)?

Fundamental Concept:

  • SVMs are based on the idea of finding a hyperplane that best divides a dataset into classes. In two-dimensional space, this hyperplane is a line dividing a plane into two parts where each class lies on either side.

Support Vectors:

  • These are the data points nearest to the hyperplane, which are the critical elements of a data set.
  • The SVM algorithm builds a model that assigns new data points a category based on these support vectors, hence the name.

Margin Maximization:

  • SVM aims to find the hyperplane with the maximum margin, meaning it tries to maximize the distance between the data points of different classes. This helps in reducing the error of the classifier.

How SVMs Work:

  • Linear SVMs: In their simplest form, SVMs are used for linear classification, dividing data points using a straight line (or hyperplane in higher dimensions).
  • Non-Linear SVMs: Many real-world problems are not linearly separable, meaning a straight line cannot effectively separate the classes. This is where the kernel trick becomes essential.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a hyperplane in SVM?

A

A hyperplane is essentially a decision boundary that separates different classes in the dataset.
In two-dimensional space (like a flat sheet of paper), this hyperplane is a line(In three dimensions, it would be a plane) and in higher dimensions more complex shapes.

Dividing the Dataset:

  • Hyperplanes purpose in SVM is to separate the data points into different classes as CLEARLY as possible.
  • Imagine data points on a graph where each point belongs to one of two classes. SVMs goal is to separate the classes by drawing a line(hyperplane) that divides these points into two groups.

Best Separation:

  • The “best” hyperplane is the one that represents the largest separation, or margin, between the two classes.
  • The margin = distance between the hyperplane and the nearest data point from either class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the four V’s in big data and what do they mean in context? Illustrate by describing examples of data based on these four V’s.

A

Big data is commonly described using the framework of the “Four V’s,” which helps to understand its key characteristics. These are Volume, Velocity, Variety, and Veracity:

Volume:

  • This refers to the vast amounts of data generated every second. For example, data from social media platforms like Twitter and Facebook, where millions of users generate text, images, and videos constantly, represent a large volume of data.

Velocity:

  • This is about the speed at which new data is generated and the speed at which data moves around. For instance, stock market data, where prices fluctuate rapidly and data is updated in milliseconds, is an example of high-velocity data.

Variety:

  • This points to the different types of data we can now use. Traditional data types were structured and fit neatly in a relational database. Today, data comes in new unstructured forms, like text, video, and images. An example is data from wearable devices, which collect a variety of data types including numerical (heart rate, steps), text (user inputs), and sometimes even images (photos of activities or meals).

Veracity:

  • This refers to the quality of the data. With many forms of big data, quality and accuracy are less controllable (just think of Twitter posts with varying degrees of reliability). For example, customer reviews on websites can vary widely in terms of reliability, relevance, and accuracy, which affects the veracity of this data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data mining and retrieving data from a database using a SQL query are distinct processes, each with its own purpose and methods. Let’s explore the differences through examples:

A

Data Mining

Data mining involves extracting useful information and patterns from large datasets using algorithms, statistical methods, and machine learning techniques. It’s more about discovery and insights rather than just retrieval.

  • Example: Imagine a retail company with a large database of customer transactions. Data mining could involve using clustering algorithms to segment customers into different groups based on purchasing behavior. This could reveal patterns like a group of customers who frequently buy organic products, or those who make large purchases during holiday seasons. The company could then use this information for targeted marketing campaigns.

Retrieving Data from a Database (e.g., SQL Query)

Retrieving data from a database using a SQL query is a process of accessing specific information by writing queries that specify exactly what data is needed. It’s more straightforward and involves direct queries to a structured database.

  • Example: A hospital administrator wants to know how many patients were admitted for a specific condition last month. They could use a SQL query like SELECT COUNT(*) FROM patients WHERE condition = ‘X’ AND admission_date BETWEEN ‘2023-01-01’ AND ‘2023-01-31’; This query would return the exact number of patients admitted with condition X in January 2023.

Key Differences

  • Purpose: Data mining is about finding patterns and insights that are not explicitly stated in the data, whereas SQL queries are for retrieving specific information based on known criteria.
  • Methodology: Data mining uses complex algorithms and statistical methods to analyze data, while SQL queries involve specific syntax to extract data from databases.
  • Nature of Data: Data mining is often used with large, complex datasets that may be unstructured or semi-structured, whereas SQL queries are used on structured data in relational databases.
  • Results: The results from data mining are often predictive models, patterns, or new insights, while SQL queries yield specific data points or subsets of the database.
    In summary, data mining is about uncovering hidden patterns and relationships in large datasets, while retrieving data using SQL is about fetching specific data from a database based on known queries.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Briefly describe each of the basic steps to transform data to knowledge in the KDD process (Knowledge Discovery in Databases, as described by Fayyad et al) and support your answer with an example

A

KDD is a framework for transforming raw data into useful knowledge. This process includes several key steps:

  • Selection: Identifying and gathering relevant data from various sources.
    • Example: A retail company may select sales data from its database, including transaction details, customer information, and product data.
  • Preprocessing: Cleaning and transforming the selected data to correct inaccuracies, handle missing values, and prepare it for analysis.
    • Example: The retail company cleans the data by removing incomplete records, correcting errors, and standardizing the format of the date and time fields.
  • Transformation: Reducing and transforming the preprocessed data into forms suitable for mining. This can involve dimensionality reduction, aggregation, and other methods to focus on important variables.
    • Example: The company aggregates sales data by categories (like electronics, clothing), and uses dimensionality reduction techniques to focus on key factors affecting sales.
  • Data Mining: Applying algorithms to extract patterns and models from the transformed data. This step involves selecting the appropriate mining tasks like classification, clustering, or association rule mining.
    • Example: The company uses clustering algorithms to identify customer segments based on purchasing behavior and association rule mining to find commonly co-purchased products.
  • Interpretation/Evaluation: Interpreting the mined patterns and evaluating their relevance and usefulness. This step often involves domain knowledge to make sense of the results and assess their value.
    • Example: The company analyzes the customer segments and product associations to understand purchasing trends and preferences, evaluating the potential for targeted marketing strategies.
  • Knowledge Utilization: Applying the discovered knowledge to make decisions or take actions. This is where the insights gained from the data mining process are used to achieve business objectives.
    • Example: Based on the insights, the company launches targeted marketing campaigns for specific customer segments and adjusts its product placement and inventory according to the identified purchasing patterns.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Recall that data science involves finding associations and establishing causality. What is the difference between association and causality?

A

Association and causality are two fundamental concepts in data science.

Association

  • Definition: Association refers to a relationship where changes in one variable are related to changes in another variable, but this does not necessarily imply that one causes the other.
    • Characteristics:
      • Correlation: Variables show a tendency to change together, which can be measured statistically.
      • No Implied Direction: The relationship doesn’t specify which variable influences the other.
  • Example: There might be an association between the number of ice cream sales and the number of drowning incidents. As ice cream sales increase, drowning incidents also tend to increase. However, this does not mean that buying ice cream causes drowning incidents. Both are likely associated because they increase during warmer weather.

Causality

  • Definition: Causality implies that a change in one variable is responsible for a change in another. It establishes a cause-and-effect relationship.
    • Characteristics:
      • Directional: There is a clear direction from cause to effect.
      • Implies Mechanism: There is an underlying mechanism explaining why one variable affects the other.
  • Example: A classic example is the relationship between smoking and lung cancer. Extensive research has shown that smoking causes an increase in the risk of developing lung cancer. This is not merely an association; there are biological mechanisms at play where chemicals in cigarettes cause changes in lung tissue leading to cancer.
17
Q

How does identifying a control group and a treatment group then help us establish causality?

A

In machine learning and data science, establishing a control group and a treatment group helps in determining causality, particularly in A/B testing or randomized controlled trials. Here’s a more concise explanation with an example:

  • Control Group: This group does not receive the new feature or intervention. It acts as a baseline, continuing with the standard conditions.
  • Treatment Group: This group is subjected to the new feature or intervention.

Example: Consider an e-commerce website testing a new user interface. The control group continues using the current interface, while the treatment group uses the new interface. By comparing metrics like conversion rates or time spent on site between the two groups, the site can assess whether the new interface causes any significant changes in user behavior.

The key is that any significant difference in outcomes between these groups can be attributed to the new feature, assuming all other factors are constant. This approach allows for a clear comparison to determine the impact of the intervention, thereby helping establish causality.

18
Q

What do we mean by distance in a machine learning context? How can distance be measured?
Provide an example. Formulas are not required, but you should provide an explanation.

A

In a machine learning context, “distance” refers to a measure of how similar or dissimilar two data points are. It is a crucial concept in many machine learning algorithms, particularly in clustering, classification, and recommendation systems. The idea is to quantify the difference between data points in a dataset, usually represented in a multi-dimensional space.

There are several ways to measure distance, each with its own use cases:

  1. Euclidean Distance: The most common and intuitive; it’s the straight-line distance between two points in Euclidean space.
  2. Manhattan Distance: Measures the sum of the absolute differences between points across all dimensions. It’s like walking along a grid of streets and avenues.
  3. Cosine Similarity: Measures the cosine of the angle between two vectors. This is particularly useful in text analysis where the magnitude of the vector might not be as relevant as the direction (or angle).
  4. Hamming Distance: Used for categorical data, it counts the number of positions at which the corresponding symbols are different.
  5. Jaccard Similarity: Used for comparing the similarity and diversity of sample sets, measuring how many elements are shared between sets.

Example
Imagine a movie recommendation system where each movie is represented by a vector of features (such as genre, length, director, etc.). To recommend movies similar to a user’s favorite, the system calculates the distance between the user’s favorite movie and all other movies in the database.

  • Euclidean Distance: If features are numerical (like movie length, budget), the system might use Euclidean distance to find movies with similar numerical attributes.
  • Cosine Similarity: If features are more text-based (like descriptions or tags), cosine similarity could be used to find movies with similar content, regardless of the length of their descriptions.

The chosen distance measure depends on the nature of the data and the specific requirements of the machine learning task. The goal is always to quantify how similar or dissimilar the items (data points) are to each other in the context of the given problem.

19
Q

Explain, on a conceptual level, how linear regression works as a prediction technique and explain its advantages / disadvantages

A

Linear regression is a predictive modeling technique used to establish a linear relationship between a dependent variable and one or more independent variables. It involves fitting a linear equation to the observed data, aiming to minimize the differences between the predicted and actual values.

Advantages:

  • Simplicity: Easy to understand and implement.
  • Interpretability: The model’s results and the impact of each variable are clear.
  • Efficiency: Requires less data to produce a reliable model.

Disadvantages:

  • Assumes Linearity: Only effective if the relationship between variables is linear.
  • Sensitive to Outliers: Outliers can significantly skew results.
  • Assumes Independence: Requires that independent variables are not highly correlated.

Example:
In real estate, linear regression might predict house prices based on variables like size, age, and number of bedrooms, assuming these factors linearly affect the price.

In summary, while linear regression is straightforward and interpretable, its effectiveness is limited by its assumptions about linearity and variable independence.

20
Q

Describe 3 advantages and 3 disadvantages of applying text pre-processing, such as stop-word removal and stemming, in content analysis? Justify your answer.

A

Applying text pre-processing techniques like stop-word removal and stemming in content analysis can significantly impact the efficiency and accuracy of the analysis. Here are three advantages and three disadvantages of using these techniques:

Advantages

  • Reduces Complexity: Removing stop words (commonly used words like ‘the’, ‘is’, ‘at’, etc.) reduces the dataset’s complexity. This simplification can speed up the analysis process as there are fewer and more meaningful words to process.
  • Focuses on Relevant Words: By eliminating stop words, the analysis can focus more on the relevant words that contribute more significantly to the content’s meaning, improving the accuracy of topic detection or sentiment analysis.
  • Normalizes Word Forms (Stemming): Stemming reduces words to their base or root form, which helps in consolidating different forms of a word into a single representation. This unification can lead to more accurate analysis by treating different forms of a word (like ‘running’, ‘ran’, ‘runs’) as the same word (‘run’).

Disadvantages

  • Loss of Context: Removing stop words can sometimes alter the meaning of the text or remove important contextual cues. For instance, the phrase “to be or not to be” would lose its meaning without stop words.
  • Over-Simplification (Stemming): Stemming can sometimes be too crude, leading to the oversimplification of words. For example, words like ‘university’ and ‘universe’ might be incorrectly stemmed to a common root, despite being different in meaning.
  • Misinterpretation of Negations: Removing stop words can sometimes remove negations like ‘not’, which can completely change the sentiment or meaning of a text. For example, “I do not like this product” might become “I like this product” if ‘not’ is removed.

In summary, while text pre-processing can streamline and focus the analysis, it can also oversimplify the text or remove important contextual information. The decision to use these techniques should consider the specific requirements and nuances of the content analysis task.

21
Q

What are the potential problems for using a rule-based approach for content analysis?

A

Using a rules-based approach for content analysis, where specific rules are defined to classify or interpret text, can present several potential problems:

  • Rigidity: Rules-based systems lack flexibility and may not adapt well to variations or nuances in the text.
  • Contextual Misunderstanding: These systems often struggle with understanding context, leading to misinterpretation of idioms, sarcasm, or metaphors.
  • Scalability Challenges: As text complexity and variety increase, updating and maintaining the rules becomes cumbersome.
  • Language Limitations: Rules might not be effective across different languages or dialects and can quickly become outdated with new language usage.
  • Over-Specialization: Such systems can be too specific, leading to accurate but limited detections, missing broader relevant content.
  • Binary Outcomes: The approach usually yields a yes-or-no outcome, lacking the subtlety for probabilities or degrees of relevance.
    In essence, while rules-based content analysis can be precise for specific tasks, it often struggles with adaptability, context sensitivity, and handling language variations.
22
Q

Describe how supervised learning might be used for content analysis. Support your answer with at least 1 example.

A

Supervised learning in content analysis involves training a model on a dataset where each piece of content is labeled with the correct category. The model learns to associate features of the content with these labels and then applies this learning to new, unlabeled content.

Example: Sentiment Analysis

  • Training: A model is trained on a dataset of social media posts, each labeled as ‘positive’, ‘neutral’, or ‘negative’.
  • Feature Extraction: The model learns patterns associated with each sentiment, like specific words or phrases.
  • Application: Once trained, the model can analyze new posts and classify their sentiment based on its learned patterns.

This approach allows for automated, scalable analysis of content, useful for tasks like sentiment analysis, topic detection, and spam filtering. Its effectiveness hinges on the quality of the training data and the model’s ability to generalize from it.

23
Q

Describe how unsupervised learning might be used for content analysis. Support your answer with at least 1 example.

A

Unsupervised learning, such as using the K-means clustering algorithm, can be effectively utilized for content analysis to identify inherent structures or patterns in the data without pre-defined labels.

Example with K-means:

  • Application: Imagine analyzing a large collection of articles without any predefined categories.
  • Process: K-means algorithm is applied to group these articles into clusters based on similarities in their textual features (like word frequencies, TF-IDF scores).
  • Outcome: The algorithm might identify clusters representing different themes or topics, such as technology, health, sports, etc., based on the commonalities in the content of the articles.

This method is valuable for discovering natural groupings in text data, providing insights into the main themes or topics present in a dataset without needing any prior categorization.

24
Q

Consider data that maybe taken from a variety of sources. Describe 3 different potential problems with the quality of the data.

A

When collecting data from various sources, three key quality issues commonly arise that can significantly impact its usefulness and accuracy:

  • Reliability and Validity: Data may not be reliable or valid, especially if sourced from unverified or biased platforms. For example, data from social media may contain subjective opinions or inaccuracies compared to data from official records or academic studies.
  • Incomplete or Inaccurate Data: Datasets often have missing or incorrect values. Missing data can occur in fields that aren’t mandatory, while inaccuracies might arise from human errors like typos. These issues can lead to challenges in analysis, and handling them (e.g., by deleting rows with missing values) might result in the loss of valuable information.
  • Contextual Misinterpretation: Data taken out of context can be misleading. For instance, web-scraped data might lose its meaning when detached from its original environment, leading to misinterpretations or skewed analysis.
25
Q

How does Support Vector Regression(SVR) differ from SVM?

A

Both based on the principle of Support Vector learning but are used for different types of machine learning problems:

SVM is used primary for classification, while SVR is used for regression

  • Margin Concept: Unlike SVM which focuses on the margin between classes, SVR focuses on fitting the best line within a threshold error margin (ε-tube). It tries to include as many data points as possible within this margin while minimizing the error.
  • Output: Continuous values (e.g., house prices, temperatures).

While both SVM and SVR are based on similar principles of support vectors and margins, they are applied to fundamentally different types of problems – SVM for classification and SVR for regression.

26
Q

Explain DBSCAN and its advantages & disadvantages over K-means

A

DBSCAN is a clustering technique that groups points together based on density.
(Density-Based Spatial Clustering of Applications with Noise)

How DBSCAN Works:

  • Core Concept: It identifies clusters as high-density areas separated by areas of low density.
  • Process: DBSCAN begins by categorizing data points as ‘core’, ‘border’, or ‘noise’, based on the number of neighbors and a given distance threshold (ε).
  • Cluster Formation: Points within a specified radius (ε) of a core point are part of the same cluster. Clusters form as more points are added based on density criteria.

Advantages over K-means:

  • No Need to Specify Number of Clusters: DBSCAN automatically determines the number of clusters based on the data, unlike K-means, which requires pre-specifying the number of clusters (k).
  • Handles Noise and Outliers: DBSCAN can effectively deal with noise and outliers, identifying them as separate from any cluster.
  • Works with Arbitrary Shapes: DBSCAN can find clusters of any shape, not just spherical, as is often the case with K-means.

Disadvantages compared to K-means:

  • Parameter Sensitivity: Choosing the right ε and minPts parameters can be challenging and greatly affects the outcome.
  • Difficulty with Varying Densities: DBSCAN can struggle with data having clusters of varying densities.
  • Less Efficient with Large Datasets: The algorithm can be less efficient and slower on very large datasets compared to K-means.

Basically DBSCAN offers advantages in flexibility and dealing with noise and shape variety but is more sensitive to parameter selection and can be less efficient with large datasets than K-means.

27
Q

How do you calculate

  • Precision
  • Recall
  • F1score
A
  • Precision: It’s the number of true positives divided by the total number of predicted positives.
    TP / (TP + FP)
  • Recall: It’s the number of true positives divided by the total number of actual positives.
    TP / (TP + FN)
  • F1score: It’s the harmonic mean of precision and recall, which balances the two by considering both false positives and false negatives.
    (2 * (Precision * Recall) / (Precision + Recall))
28
Q

What is Supervised machine learning?

A
  • Includes the target outcome.
  • Trained to recognize input patterns that lead to a certain outcome based on examples i.e. historical data.
  • The algorithm is trained on already labeled datasets, meaning that the input data is paired with corresponding output labels.
  • Example: credit evaluation, when we use customers from the past and figuring out how to label them as a good or a bad customer.
29
Q

What is unsupervised machine learning?

A

Unsupervised:

  • No known outcomes (it could be that we don’t know what the outcome should be, other than winning at tic-tac-toe).
  • Learns to recognize patterns in the form of similarities
  • The algorithm is given data without explicit instructions on what to do with it

Example: customer segmentation, meaning you divide the customers into groups based on common characteristics. These can be grouped by for example purchase behaviour, and by using the algorithm we get an output where we now have several groups where the individuals have things in common with each other. This is done by using K-means or another unsupervised ML method.

30
Q

Describe classification.

A
  • Classification:
    Identifying a class into which a sample fits. You look at some attributes about an object and decide how to label it (classify it). This is a key part of AI. It’s also deeply useful for making sense of big data.
31
Q

Describe ANN(Artificial Neural Networks)

A

Artificial Neural Networks (ANNs) are computational models inspired by the structure and functioning of the human brain. They are a subset of machine learning algorithms designed to recognize patterns, make predictions, and perform tasks that require learning from data.

32
Q

Why is the initialization of weights in a neuron crucial for learning in perceptrons, and how does the adjustment of weights contribute to the learning process?

A
  • The initialization of weights in a neuron is crucial for learning in perceptrons because, initially, the network “knows” nothing about the relationships in the data.
  • Weights in a neural network represent the strengths of connections between neurons and are essential for making accurate predictions.
  • During the learning process, the network adjusts these weights based on the success or failure of its predictions.
  • Increasing weights makes the output more active, while decreasing weights makes it more inactive.
  • The adjustment of weights is a fundamental mechanism through which the network aligns its outputs with the expected/desired outputs, ultimately improving its ability to learn and generalize from the data.
33
Q

Describe Natural Language Processing (NLP).

A
  • Natural Language Processing (NLP):
    Is a field at the intersection of computer science, artificial intelligence, and linguistics. It focuses on the interaction between computers and human language, specifically how to program computers to process and analyze large amounts of natural language data.

The goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful

  • Sentiment Analysis: Identifying and extracting opinions within text data.
  • Speech Recognition: Converting spoken language into text.
  • Natural Language Generation: Producing text from computer data.
34
Q

What is stemming?

A

Stemming is a text normalization technique in natural language processing that involves reducing words to their root or base form. It aims to remove affixes (prefixes, suffixes) from words, leaving only the core meaning.

  • If used right it can reduce data volume and data velocity.
  • If used wrong it can lead to removing relevant information.
35
Q

What are the Ethical dilemmas of AI?
(potensiell essäfråga)

A

Ethical Dilemmas of AI:

  • Whats “right” and “wrong”?
    We don’t always agree on the “right answer” or the right choice. The decision is not up to the autonomous system, it has to be left to a human.. Who do we blame if something goes wrong?
  • Bias and Fairness:
    Example: If a facial recognition system is trained predominantly on data from a specific demographic, it may exhibit bias and inaccuracies when identifying individuals from underrepresented groups, leading to unfair treatment.
  • Privacy Concerns:
    Example: Smart devices and AI-driven systems collecting and processing personal data may raise concerns about user privacy, especially if the information is used without explicit consent or is vulnerable to hacking.
  • Transparency and Explainability:
    Example: Complex AI models, such as deep neural networks, often operate as “black boxes,” making it challenging to explain their decision-making processes. This lack of transparency raises questions about accountability and trust.
  • Job Displacement and Economic Impact:
    Example: Automation and AI-driven technologies replacing certain jobs can lead to unemployment and economic disparities. Addressing the ethical dilemma involves finding ways to reskill and support affected workers.
  • Autonomous Systems and Decision-Making:
    Example: Autonomous vehicles making split-second decisions in critical situations may pose ethical challenges. For instance, deciding between prioritizing the safety of the vehicle’s occupants or pedestrians raises moral dilemmas.
  • Accountability and Liability:
    Example: Determining responsibility for AI-driven actions, especially in scenarios where decisions lead to unintended consequences, raises questions about legal and ethical accountability.
  • Data Handling and Consent:
    Example: Collection and use of personal data without clear consent or transparent privacy policies can lead to ethical concerns. This is particularly relevant in AI applications that heavily rely on extensive datasets.
36
Q

Big data. A very large amount of data which is a mix of different kinds of structures of data. What are they?

A
  • Structured: tables w/ columns w/ meaningful rows of records.
  • Semi-structered: data from web sources, some structure, way of extracting data.
  • Unstructured: things like images, videos, audio..
37
Q
A