Lectures (ALLA föreläsningar) Flashcards

Question

when should you use mean or median?

Answer 1

There’s no situation where you should use one or the other. If you have a few rows of code with big differences, so that the mean och median are very different from each other, this could potentially give us distorted data. **Symmetry of the Data:** - Use the mean when the data is approximately symmetrically distributed and does not have extreme outliers. - Use the median when the data is skewed or contains outliers. The median is less affected by extreme values.

Answer 2

- **Sample**: some incoming data that will be analysed, for example a JPEG picture. - **Feature**: some kind of quantifiable data from the sample, in the JPEG picture example this could be colour, height, width, pixel data, etc. - **Label**: some useful information about the sample that we wish to categorise, eg. looking at a picture we can tell if it is of a person, cat, dog, etc. - **Model**: the output of some learning algorithm. Machine learning programs start out as uninitialized parametric spaces, they are blank. As the algorithm learns, it adjusts these blank parameters until the model starts giving us the predictions we want (the desired output).

Answer 3

- Includes the target outcome. - Trained to recognize input patterns that lead to a certain outcome based on examples i.e. historical data. - The algorithm is trained on already labeled datasets, meaning that the input data is paired with corresponding output labels. - Example: credit evaluation, when we use customers from the past and figuring out how to label them as a good or a bad customer.

Answer 4

Unsupervised: - No known outcomes (it could be that we don’t know what the outcome should be, other than winning at tic-tac-toe). - Learns to recognize patterns in the form of similarities - The algorithm is given data without explicit instructions on what to do with it Example: customer segmentation, meaning you divide the customers into groups based on common characteristics. These can be grouped by for example purchase behaviour, and by using the algorithm we get an output where we now have several groups where the individuals have things in common with each other. This is done by using K-means or another unsupervised ML method.

Answer 5

Reinforcement learning: - No specific target given, will explore many solutions (loops) to find the best reward, based upon feedback (rewards or punishments) from the environment. - **Simply**: a lot of trial and error involved in this method. - It’s based upon agents, states, actions, goals, rewards, environment. - Analogous/comparable to playing a game many times, you end up learning from interaction. - Example: self-driving cars or game-playing computers

Answer 6

Learning by example or historic data. So you need a training dataset to describe training examples. Learning algorithms analyse the training data and produces a predictor function that can be used for mapping new examples to outputs.

Answer 7

When we get our training data set we separate some data to use as test data, to test the accuracy of the algorithm. We can then calculate an accuracy score based on the test data. Example: 1. Define specific performance metrics (e.g., accuracy, precision, recall) that are relevant to the problem. 2. Stop training when these metrics reach a satisfactory level or plateau.

Answer 8

Accuracy = number of correct classifications / total number of test cases

Answer 9

Learning from data without examples. So there are no target outputs, but the unsupervised learner has to interpret the outputs. Learning algorithms can use various strategies to identify commonalities in data and react to the absence of these commonalities. Useful when we don’t know what we’re looking for. - Example: K-means (Clustering).

Answer 10

- Classification: Identifying a class into which a sample fits. You look at some attributes about an object and decide how to label it (classify it). This is a key part of AI. It’s also deeply useful for making sense of big data.

Answer 11

With cluster analysis we group objects by some selected attributes so that each object is similar to the other objects in the cluster and different from objects in all other clusters. This is used for classification, simplifying data and identifying relationships. - There are many different methods such as k-means, k-nearest-neighbour, mean-shift, DBSCAN..

Answer 12

Artificial Neural Networks (ANNs) are computational models inspired by the structure and functioning of the human brain. They are a subset of machine learning algorithms designed to recognize patterns, make predictions, and perform tasks that require learning from data.

Answer 13

- The initialization of weights in a neuron is crucial for learning in perceptrons because, initially, the network "knows" nothing about the relationships in the data. - **Weights** in a neural network represent the strengths of connections between neurons and are essential for making accurate predictions. - During the learning process, the network adjusts these weights based on the success or failure of its predictions. - Increasing weights makes the output more active, while decreasing weights makes it more inactive. - The adjustment of weights is a fundamental mechanism through which the network aligns its outputs with the expected/desired outputs, ultimately improving its ability to learn and generalize from the data.

Answer 14

**Advantages of ANN**: - Can learn complex patterns, the more layers, the more complex. - Can handle non-linear data. - Can handle redundant attributes, learns importance through weighting. **Disadvantages of ANN**: - Very susceptible to overfitting, the more layers, the higher the risk. - Missing values must be imputed or the corresponding records removed. - Sensitive to noise (they can easily be influenced or disrupted by random variations or irrelevant information in the input data). - Increased complexity of network requires more data and increased training time.

Answer 15

- **Natural Language Processing (NLP)**: Is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural languages). Basically, we use computational techniques to extract insights and understanding in large corpuses of natural language text.

Answer 16

Density provides information about how concentrated the data points are. This helps to understand the distribution/spread of data. - **High density** indicates that data points are closely clustered together, they are concentrated. - **Low density** tells us that the data points are more spread out. In a table, the density could indicate how many non-empty values there are.

Answer 17

- Linear regression: fitting a line to describe the relationship between variables. A good line gives us good predictions.

Answer 18

We can change the slope and create a new line and then compare it to the one we already have. This is the basics of machine learning, we have a model, measure how well it works (by measuring the error) and then creating a new model and comparing it to the already existing one.

Answer 19

If we have two lines that gives us good accuracy, we could say that they’re equivalent, they all give us good results. But, we can instead use support vector machines (SVM). Tries to maximise the margin between classes. - SVM for regression: A linear regression that tries to set the margin (slack) to consider as many, but not all, data points. Like using linear regression but we can reasonably fit to noisy data by ignoring some data points outside of the margins. The lines are on top of the data points in regression.

Answer 20

Decision trees learn multiples of rules on how to split the data by analysing the attributes of a dataset. It is a tree-like model that makes decisions based on the features of the input data.

Answer 21

Stemming is a text normalization technique in natural language processing that involves reducing words to their root or base form. It aims to remove affixes (prefixes, suffixes) from words, leaving only the core meaning. - If used right it can reduce data volume and data velocity. - If used wrong it can lead to removing relevant information.

Answer 22

The problem is about when we don’t know the correct answer. E.g. the travelling salesmen problem. How do we know we’ve found the optimum route?

Answer 23

Ethical Dilemmas of AI: - **Whats "right" and "wrong"**? We don’t always agree on the “right answer” or the right choice. The decision is not up to the autonomous system, it has to be left to a human.. Who do we blame if something goes wrong? - **Bias and Fairness**: Example: If a facial recognition system is trained predominantly on data from a specific demographic, it may exhibit bias and inaccuracies when identifying individuals from underrepresented groups, leading to unfair treatment. - **Privacy Concerns**: Example: Smart devices and AI-driven systems collecting and processing personal data may raise concerns about user privacy, especially if the information is used without explicit consent or is vulnerable to hacking. - **Transparency and Explainability**: Example: Complex AI models, such as deep neural networks, often operate as "black boxes," making it challenging to explain their decision-making processes. This lack of transparency raises questions about accountability and trust. - **Job Displacement and Economic Impact**: Example: Automation and AI-driven technologies replacing certain jobs can lead to unemployment and economic disparities. Addressing the ethical dilemma involves finding ways to reskill and support affected workers. - **Autonomous Systems and Decision-Making**: Example: Autonomous vehicles making split-second decisions in critical situations may pose ethical challenges. For instance, deciding between prioritizing the safety of the vehicle's occupants or pedestrians raises moral dilemmas. - **Accountability and Liability**: Example: Determining responsibility for AI-driven actions, especially in scenarios where decisions lead to unintended consequences, raises questions about legal and ethical accountability. - **Data Handling and Consent**: Example: Collection and use of personal data without clear consent or transparent privacy policies can lead to ethical concerns. This is particularly relevant in AI applications that heavily rely on extensive datasets.

Answer 24

- Step 1: **Business understanding**, identify the key processes, goals and actors, what does the business need? - Step 2: **Data understanding**, what data do we have, does it give insight about the business model, give context.. How clean is the data, visualize the data.. - Step 3: **Data preparation**, how do we organize the data for modeling? Select data, clean data, integrate data, format data (convert values to other data types).. - Step 4: **Modeling**, choosing what modeling technique to use (regression, etc), developing, training, testing. - Step 5: **Evaluation**, which model best meets the demands? Creating reports, writing documentation for maintenance and modification. - Step 6: **Deployment**, planning and performing the deployment (distribution) of the model into the business. How do stakeholders access the results?

Answer 25

- **Structured**: tables w/ columns w/ meaningful rows of records. - **Semi-structered**: data from web sources, some structure, way of extracting data. - **Unstructured**: things like images, videos, audio..

Answer 26

You link together click data, social media (like LinkedIn for jobs), advertisements, online shopping, google searches. So now the data that is collected from for example Google is combined with data from other services, so we can do even more with the data. This is a specific characteristic of big data and enables us to understand the users even better.

Answer 27

- **Data**: Raw facts, subjective facts, no meaning, no context, we haven’t interpret it. - **Information**: Think of data that’s been transformed into something more useful. Maybe we’ve asked questions, what does the data concern, what things have been measured, where was it collected from? Adding more contextual data around the raw data. - **Knowledge**: Actionable information, noticing patterns in the data. - **Wisdom**: Idea that you found out so much from data that you can make decisions from it.

Answer 28

The Mean Squared Error (MSE) is a metric used to measure the average squared difference between the predicted values and the actual values in a dataset. It provides a way to quantify the overall magnitude of errors and emphasizes larger errors more than smaller ones. - **Formula:** Sum square error/Number of datapoints. Here's a short and simple explanation: - **Calculate Squared Differences:** For each data point, find the squared difference between the predicted value (from your model) and the actual value (from your dataset). - **Average the Squared Differences:** Sum up all the squared differences and then divide by the number of data points. This gives you the average squared difference, which is the MSE. - **Interpretation:** A lower MSE indicates that the model's predictions are closer to the actual values, while a higher MSE suggests larger discrepancies between predictions and actual outcomes.

Answer 29

The Mean Absolute Error (MAE) is essentially the average of the absolute differences between the predicted and actual values. It provides a straightforward way to quantify the average magnitude of errors without considering their direction. - **Formula:** Sum of Absolute Errors/ Total datapoints. Here's a simple explanation of the components and formula: - **𝑛 (number of data points):** This represents the size of your dataset, indicating how many observations or instances are in the dataset. - **y2 (predicted value):** This is the value predicted by your linear model equation. In the context of regression analysis, it's the value the model estimates for a given input. - **y(real value):** This is the actual value from your dataset. It represents the true outcome or target value associated with a particular input. - **Absolute Value (|...|):** The vertical bars indicate the absolute value of the expression inside, meaning it gives the distance between two values without considering their direction. If the result is negative, taking the absolute value makes it positive. - **Mean (average):** The formula calculates the average absolute difference over all data points.

Answer 30

Clustering is a technique in data mining that involves grouping similar data points together based on certain characteristics, with the goal of discovering inherent patterns, structures, or relationships within the data. The primary objective is to create clusters or groups in such a way that data points within the same cluster are more similar to each other than to those in other clusters. **Strengths**: - Pattern Discovery: Identifies natural groupings and patterns. - Unsupervised Learning: Doesn't require labeled data. - Anomaly Detection: Highlights outliers. Data Reduction: Reduces dimensionality. **Weaknesses**: - Sensitivity: Results can vary with initial conditions. - Subjectivity: Interpretation may be subjective. - Scalability: May struggle with large datasets. - Similarity Assumption: Results depend on chosen similarity metrics. - Handling Noise: Sensitive to noise and outliers.

Answer 31

Iteratively work towards finding the optimal clusters. It’s a bit like randomly drawing a line through our data, but instead we can randomly draw two imaginary points out. **Fördelar**: * Beräkningsmässigt effektivt i jämförelse med Hierarkisk klustring (Om k är liten). * Enkel och välkänd metod. * Kan användas för ett brett spektrum av datatyper **Nackdelar**: * Definitionen av K är viktig. * Kan producera tomma kluster, t.ex. om du valt K som är större än antal datapunkter eller valt ett dåligt ursprungsläge för centroiderna. Kan lösas genom att tvinga utgånspunkterna för centroiderna att vara faktiska datapunkter. * Problematik vid detektering av vissa typer kluster t.ex kluster med ovanliga former. * Känsligt för uteliggare vid användning av SSE, kräver preprocessing.

Answer 32

Recall is a metric used in binary classification to measure the ability of a model to correctly identify all relevant instances, specifically the actual positives. It answers the question: "Of all the actual positives, how many did the model correctly predict?" - **Formula**: True Positives/ True Positives+ False Negatives = Recall - True Positives (TP): Instances correctly predicted as positive (correctly identified counterfeit bills). - False Negatives (FN): Instances incorrectly predicted as negative (counterfeit bills that were missed).

Answer 33

Precision is the percentage of predicted positives (predicted counterfeit bills) that are actual positives (counterfeit bills). - **Formula**: True Positives / False Positives + True Positives = Precision. - True Positives (TP): Instances correctly predicted as positive (correctly identified counterfeit bills). - False Positives (FP): Instances incorrectly predicted as positive (incorrectly identified non-counterfeit bills as counterfeit).

Answer 34

**Support Vector Classification** (SVC) is a type of machine learning algorithm used for classification tasks. The algorithm works by finding a hyperplane in a high-dimensional space that best separates the data points of one class from those of the other classes. - Example from lab: ```classifier = SVC() classifier.fit(x_train.values, y_train.values) ``` **classifier** is an instance of the SVC model. fit is used to train the model on the training data (X_train and y_train). **x_train** represents the feature values of your training instances. **y_train** represents the corresponding class labels for each training instance. This trained model (classifier) can then be used to make predictions on new data or evaluate its performance on the test data.

Answer 35

Data ethics in the context of data mining and machine learning involves the responsible and ethical handling of data throughout the entire data lifecycle, from collection to processing and analysis. Here are key considerations and discussions related to data ethics: **Privacy Protection:** - Challenge: The collection and analysis of large datasets can potentially lead to the identification of individuals, raising concerns about privacy. - Ethical Principle: Anonymizing and de-identifying data, obtaining informed consent, and ensuring compliance with privacy regulations (e.g., GDPR, HIPAA) are essential for protecting individuals' privacy. **Bias and Fairness:** - Challenge: Biases in data can result in unfair or discriminatory outcomes, especially when the data used for training machine learning models reflects existing societal biases. - Ethical Principle: Striving for fairness and equity in algorithmic decision-making by identifying and mitigating biases, promoting diversity in data sources, and transparently communicating about potential biases. **Accountability and Responsibility:** - Challenge: Determining accountability for decisions made by machine learning models can be challenging, especially when multiple stakeholders are involved. - Ethical Principle: Defining clear lines of responsibility, ensuring accountability for the impact of algorithmic decisions, and establishing mechanisms for addressing unintended consequences.

Answer 36

KDD processen är ett helhetsbegrepp som beskriver omvandlingen från rådata till användbar information, likt CRISP-DM. Denna processen består av: * Steg 1: **Data Selection:** Man väljer ut utifrån olika källor vilken data man ska använda sig av. =>Targetdata * Steg 2: **Preprocessing:** Man städar och organiserar datan genom att bl.a. ta bort felaktigheter, uteliggare och gör integreringar mellan olika dataset. Preprocessing är den mest tidskrävande delen. =>Processed Data * Steg 3: **Transformation:** Man transformerar datan till ett format som krävs för den specifika Data Mining metod man ska använda sig utav. => Transformed Data * Steg 4: **Data Mining:** Själva utförandet av Data Mining metoden och data-utvinningen görs. => Patterns/mönster * Steg 5: **Interpretation and analysis:** Man tolkar och analyserar svaren man fått vid datautvinningen. => Konwledge/kunskap

Answer 37

Euclidean distance: also known as straight-line distance or Euclidean norm, is a measure of the straight-line distance between two points in Euclidean space, representing the length of the shortest path between them. Formeln: - **c^2 = (x1 - x2)^2 + (y1 - y2)^2**

Answer 38

Manhattan distance: also known as L1 distance or taxicab distance, is a measure of the distance between two points in a grid-based system (like a city grid) calculated along the grid lines. It is named "Manhattan distance" because it resembles the way the streets in Manhattan are arranged in a grid pattern. Formeln: - d = (x1 - x2) + (y1 - y2)

Answer 39

- Recall (Sensitivity or True Positive Rate): Recall is the percentage of actual positives (relevant instances) that were correctly predicted as positives by the model. It measures the ability of the model to capture all the relevant instances. **Formula Recall:** TP / TP + FN

Answer 40

- Precision (Positive Predictive Value): Precision is the percentage of predicted positives (instances predicted as positive by the model) that are actually positives (relevant instances). It measures the accuracy of the positive predictions made by the model. **Formula Precision:** TP / TP +FP

Answer 41

Tänk twitter analys. Sentimentanalys betraktas vanligtvis som en prediktiv uppgift. Målet med sentimentanalys är att förutsäga känslan eller den emotionella tonen som uttrycks i en text. Sentimentet kan vara positivt, negativt, neutralt eller till och med kategoriserat i mer specifika känslor. Å andra sidan involverar deskriptiva uppgifter att summera eller beskriva befintliga data utan att göra förutsägelser om framtida resultat. I sentimentanalys fokuserar man på att förutsäga känslan i texten baserat på de använda orden och uttrycken och möjliggör därmed klassificering av sentimentkategorin.

Answer 42

Det finns flera problem som kan uppstå när vi använder en regelbaserad teknik som denna för att klassificera vår data som positive eller negative. Ett av problemen är att vi måste fördefiniera regler, t.ex. if "cheer" and "best" --> positive eller if “sorry” and “ruin” --> negative. Men vad händer om alla dessa fyra ord finns i en mening? Hur gör vi då? Eftersom detta är en klassificering kan man använda sig av ett desicion tree för att klassificera datan, problemet som då kan uppstå är att modellen måste fråga samma fråga flera gånger och på så sätt skapa ett mycket komplex decision tree. Ett annat problem som kan uppstå är vilken fråga som är mest "relevant" som man vill ha högst upp i trädet. Vissa ord/frågor kan vara irrelevanta på egen hand och på så sätt inte ge någon mening till när vi analyserar meningen.

Lectures (ALLA föreläsningar) Flashcards

Skrivit frågorna själv. (66 cards)