Lectures (ALLA föreläsningar) Flashcards

Skrivit frågorna själv.

1
Q

Explain the 4 v’s characteristics of big data:

A
  • Volume: the size of the data.
  • Velocity: the speed of which the data is being created. High velocity can result in a high volume of data.
  • Variety: the type of data. How do we combine data from different sources to process it? For example, say we have two different sources
    of data that are collecting weather data like temperature from around the world,
    some of these work with celsius and some work with fahrenheit
  • Veracity: the accuracy. How do we deal with accuracy when we’re combining data? If we collect data regarding temperature, there might be
    one or two sensors that are wrong and give us the wrong numbers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does Data science typically involve

A
  • Exploration:
    Identifying patterns in information. For example collecting data regarding the prices in a supermarket. If you collect data for a year you might have enough data to prove that the prices change each day. You can use it as evidence. Uses visualisations.
  • Inference:
    Quantifying whether the patterns that have been identified are reliable. Uses randomization, which considers what would have happened under all possible random assignments, not just the one that happened to be selected for the experiment. This reduces bias in sample data.
  • Prediction:
    Making informed guesses. Reliably making an educated guess. Predict with confidence. Uses the data that we gathered before to make informed guesses. This can be done by using several different techniques. Uses machine learning.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Causality?

A

Why one thing impacts another thing. Cause and effect. Measured by thinking of data in terms of experiment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does Association mean?

A
  • Association:
    Identifying and observing an effect. Example: Is there any relation between chocolate consumption and heart disease?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What type of Groups have been dissscused in the course when it comes to making comparisons?

A

When making comparisons you have a treatment/target group and a control/reference group (those who don’t receive the treatment).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain User DNA

A

User DNA is created when you link together click data, social media (like LinkedIn for jobs), advertisements, online shopping, google searches. So now the data that is collected from for example Google is combined with data from other services, so we can do even more with the data.
This is a specific characteristic of big data and enables us to understand the users even better.
Once you’ve collected the data and want to combine it, what problems might face you?
The V’s come in here, we have a dataset that is combined with other data which is a volume problem, there might be data coming from somewhere else, which is a velocity problem.

Can we trust the accuracy of the data (veracity)? We might have to do some data quality testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we load data into a dataframe?

A

To load the data into a DataFrame-object we use pandas and store it in a variable. When we inspect the variable we get the full table with all the data. To inspect the column names, we use the columns attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do we use the functions head(), shape() and describe() for ?

A
  • The head() function can be used to view the first rows of a dataset.
  • The shape() attribute gives us the length and width of the dataset. So we can say dataset.shape
  • The describe() function gives us summary statistics of the dataset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

To get general information about the data set, such as how many values are not empty what fuction can we use?

A

To get general information about the data set, such as how many values are not empty, use the info() function. With dataset.info() we get info about datatype such as object, int, float…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Big data?

A
  • Big data:

Has to do with computing hardware, data storage and data collection. When we handle a very large amount of data, we refer to it as big data. This enables us to look beyond the data in our own business and try to combine it with other types of data. Big data had a massive impact on the whole industry and a lot of companies and applications solely work with it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is big data any different from regular data?

A

The data in Big data isn’t just data that comes in nice tables, it’s a mix of different kinds of structures of data. Big data is a mix of structured, semi-structured and unstructured data:

  • Structured data include tables with columns that have meaningful rows of records.
  • Semi-structured data can be data from web sources, where there’s some structure and there’s a way of extracting the data if you want to.
  • Unstructured data (misleading since there’s always some structure to data) are things like images, videos and audio. So big data is about how we can try to solve business problems by combining these di erent types of data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Big data is usually characterised by the 4 V’s. Name and disscribe them.

A
  • Volume:
    is the size of data that we’re trying to handle, the amount of data is unprecedented (unheard of). The amount is always changing, so it’s diffcult to say how big big is..
  • Velocity:
    is the speed of which the data is being created. It means that we characterise the rate at which new data is being created by a computer system. Think of how Amazon logs every single click a user makes on the website, if you scale that up to a million users that is a lot of data. This can vary and result in a high volume of data.
  • Variety:
    has to do with different types of data, there’s a variety of data formats, sources and systems. This brings a whole lot of problems, how do we combine data from di erent sources to process it? For example, say we have two di erent sources of data that are collecting weather data like temperature from around the world, some of these work with celsius and some work with fahrenheit. So, there are di erent kinds of scales, how do we deal with these when they are reporting completely di erent numbers?
  • Veracity:
    is about accuracy. If we collect data regarding temperature, there might be one or two sensors that are wrong and give us the wrong numbers. So how do we deal with accuracy when we’re combining the data altogether? Maybe it doesn’t matter if only two are reporting wrong numbers?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is CRISP-DM (CRoss-Industry Standard Process for Data Mining)?

A

An open standard which can be freely used. Modelled an ongoing, iterative cycle as folows:
1. Business Understanding

  • Fastställa affärsmål (bakgrund, mål, kriterier).
  • Utvärdera situtationen (resurser, krav, risker).
  • Faställa mål för datamining.
  • Utforma plan (bedömning av verktyg och metoder).

2. Data Understanding

  • Sammla in initial data (provdata, integreation).
  • Beskriv data (typer, mängder, egenskaper).
  • Utforska data (inledande analys, statistik, visualisering).
  • Verifiera datakvalitet (avvikande värden, skadad data, saknad data).

3. Data Preperation

  • Välja data (vilken data och varför).
  • Rensa data (hantera saknad data).
  • Integrera data (sammanfoga från olika källor).
  • Formatera data (enligt krav).

4. Modeling

  • Välj modelleringstekniker (beroende på mål).
  • Beslutsträdmodellering (classification, k-nearest neighbor for clustering).
  • Generera testdesign (hur man testar resultatet).
  • Bygg modell.

5. Evalution

  • Utvärdera resultat (utifrpn framgångskriterier).
  • Granska processen (Har något missats/ misslyckas/ problem?).
  • Fastställa nästa steg.

6. Deployment

  • Planera implementering (strategi).
  • Planera övervakning och underhåll (ex. ändrade krav).
  • Skapa slutrapport (dokumentation).
  • Utvärdera projektet (vad gick bra/dåligt?).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is information and what is data?

A
  • Data:
    Raw facts, subjective facts, no meaning, no context, we haven’t interpret it.
  • Information:
    Think of data that’s been transformed into something more useful. Maybe we’ve asked questions, what does the data concern, what things have been measured, where was it collected from? Adding more contextual data around the raw data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe what data is?

A

Data refers to raw facts, information, or observations that are typically collected, stored, and analyzed for a specific purpose. It can take various forms, including numbers, text, images, audio, video, and more. Data is the foundation of information, knowledge, and decision-making processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Attributes can be categorised based upon the mathematical operations, what are the two categories of mathematical operations?

A
  • Qualitative
    Distinctiveness: =, ≠
    Order: <, <=, >, >=
  • Quantitative
    Addition/subtraction: +, -
    Multiplication/division: *, /
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Discribe the Nominal (qualitative) method?

A

Giving something a name (labelling things). Nominal scales, categorical data for grouping data objects.

  • For example we might label the colour of someones hair as being black, blonde, gray.. We can say something about the distinctiveness, black hair is not the same as blonde hair. But we can’t say something about the order, black hair is heavier than blonde hair or brown hair is better than black hair. We are just saying that they di er, not how they di er. Distinct, can be counted, like frequency. Binary is a special case of nominal scale data, only two possible categories, e.g. yes or no, true or false.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Discribe the what is Ordinal (qualitative) method?

A

Ordered data with meaningful ranking, but distances are not necessarily uniform. We can say something about the order but not about HOW different they are. E.g. grades, opinion data.
Distinct and ordered, so order and count-based operations can be used, in addition to those for the nominal scale. Operations like rank order, median, percentiles, rank correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is data exhaust?

A
  • Data exhaust:

Data exhaust is the trail of activity, or residual data, left behind by some other kinds of business or computing process. Like mobile phone data e.g. calls and locations, financial data e.g. transactions, residuals of Internet users activity, e.g. online searches, server access logs, and administrative data e.g. organisational transactions, record keeping. You could do data mining on how many phone calls someone makes every month to find out your financial status, like if you make many calls then you probably have a better economy?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Whats is Linear regreasion? (give the formula)

A
  • Linear Regression:

LinearRegression is a built in model from the sklearn Python package, which we build a linear regression model with. With linear regression we predict the value of a variable based on the value of another variable. With the first example of the lab, we predict the variable weight based on the variable height. In the lecture we predict someones dept based on their income.

The linear regression equation is of the form: y=mx+b

Where:

  • y is the dependent variable (t.ex. examination score).
  • x is the independent variable (t.ex. hours studied).
  • m is the slope.
  • b is the y-intercept.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a Panda series?

A
  • Series:
    A pandas series is like a column in a table, it is a one-dimensional array holding data of any type.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Vad är syftet med intercept inom data mining?

A
  • Inom data mining används intercept för att fånga avvikande värden. Det fungerar som en central punkt där alla värden klustras. Om ett värde avviker kraftigt från övriga data, kommer det att interceptas och identifieras som avvikande.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the classifcation method?

A
  • Classification:
    This is a supervised machine learning method where the model tries to predict the correct label of a given input data. The model gets fully trained using the training data, and then it is evaluated on test data before being used to perform prediction on new unseen data. These two datasets (training and test) are kept separate during the training process. The content of the test data set should not be included in the training process.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Mean imputation?

A
  • Mean imputation:

When you replace the missing observation with the mean values of the column. Mean value is similar to default value, we take the mean(medelvärde) of the other entries in the column. Imputation is when we replace the missing values.

-Here’s a step-by-step explanation:

  1. Identify Missing Values: Identify the entries in your dataset that contain missing values.
  2. Calculate Mean: For each variable or column with missing values, calculate the mean of the available data points in that column.
  3. Replace Missing Values: Substitute the missing values in each column with the mean value calculated for that column.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

when should you use mean or median?

A

There’s no situation where you should use one or the other.
If you have a few rows of code with big differences, so that the mean och median are very different from each other, this could potentially give us distorted data.

Symmetry of the Data:

  • Use the mean when the data is approximately symmetrically distributed and does not have extreme outliers.
  • Use the median when the data is skewed or contains outliers. The median is less affected by extreme values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Terminology, desscribe the terms:
Sample, Feature, Label and Model.

A
  • Sample: some incoming data that will be analysed, for example a JPEG picture.
  • Feature: some kind of quantifiable data from the sample, in the JPEG picture example this could be colour, height, width, pixel data, etc.
  • Label: some useful information about the sample that we wish to categorise, eg. looking at a picture we can tell if it is of a person, cat, dog, etc.
  • Model: the output of some learning algorithm. Machine learning programs start out as uninitialized parametric spaces, they are blank. As the algorithm learns, it adjusts these blank parameters until the model starts giving us the predictions we want (the desired output).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is Supervised machine learning?

A
  • Includes the target outcome.
  • Trained to recognize input patterns that lead to a certain outcome based on examples i.e. historical data.
  • The algorithm is trained on already labeled datasets, meaning that the input data is paired with corresponding output labels.
  • Example: credit evaluation, when we use customers from the past and figuring out how to label them as a good or a bad customer.
28
Q

What is unsupervised machine learning?

A

Unsupervised:
- No known outcomes (it could be that we don’t know what the outcome should be, other than winning at tic-tac-toe).

  • Learns to recognize patterns in the form of similarities
  • The algorithm is given data without explicit instructions on what to do with it

Example: customer segmentation, meaning you divide the customers into groups based on common characteristics. These can be grouped by for example purchase behaviour, and by using the algorithm we get an output where we now have several groups where the individuals have things in common with each other. This is done by using K-means or another unsupervised ML method.

29
Q

What is reinforced machine learning?

A

Reinforcement learning:

  • No specific target given, will explore many solutions (loops) to find the best reward, based upon feedback (rewards or punishments) from the environment.
  • Simply: a lot of trial and error involved in this method.
  • It’s based upon agents, states, actions, goals, rewards, environment.
  • Analogous/comparable to playing a game many times, you end up learning from interaction.
  • Example: self-driving cars or game-playing computers
30
Q

How to do supervised learning?

A

Learning by example or historic data. So you need a training dataset to describe training examples. Learning algorithms analyse the training data and produces a predictor function that can be used for mapping new examples to outputs.

31
Q

How do we know when to stop training the algorithm?

A

When we get our training data set we separate some data to use as test data, to test the accuracy of the algorithm.
We can then calculate an accuracy score based on the test data.

Example:
1. Define specific performance metrics (e.g., accuracy, precision, recall) that are relevant to the problem.
2. Stop training when these metrics reach a satisfactory level or plateau.

32
Q

How do we calculate Accuracy?

A

Accuracy = number of correct classifications / total number of test cases

33
Q

How to do unsupervised learning?

A

Learning from data without examples. So there are no target outputs, but the unsupervised learner has to interpret the outputs. Learning algorithms can use various strategies to identify commonalities in data and react to the absence of these commonalities. Useful when we don’t know what we’re looking for.

  • Example: K-means (Clustering).
34
Q

Describe classification.

A
  • Classification:
    Identifying a class into which a sample fits. You look at some attributes about an object and decide how to label it (classify it). This is a key part of AI. It’s also deeply useful for making sense of big data.
35
Q

What is a cluster analysis?

A

With cluster analysis we group objects by some selected attributes so that each object is similar to the other objects in the cluster and different from objects in all other clusters. This is used for classification, simplifying data and identifying relationships.

  • There are many different methods such as k-means, k-nearest-neighbour, mean-shift, DBSCAN..
36
Q

Describe ANN(Artificial Neural Networks).

A

Artificial Neural Networks (ANNs) are computational models inspired by the structure and functioning of the human brain. They are a subset of machine learning algorithms designed to recognize patterns, make predictions, and perform tasks that require learning from data.

37
Q

Why is the initialization of weights in a neuron crucial for learning in perceptrons, and how does the adjustment of weights contribute to the learning process?

A
  • The initialization of weights in a neuron is crucial for learning in perceptrons because, initially, the network “knows” nothing about the relationships in the data.
  • Weights in a neural network represent the strengths of connections between neurons and are essential for making accurate predictions.
  • During the learning process, the network adjusts these weights based on the success or failure of its predictions.
  • Increasing weights makes the output more active, while decreasing weights makes it more inactive.
  • The adjustment of weights is a fundamental mechanism through which the network aligns its outputs with the expected/desired outputs, ultimately improving its ability to learn and generalize from the data.
38
Q

What are the advantages and disadvanteges of ANN?

A

Advantages of ANN:

  • Can learn complex patterns, the more layers, the more complex.
  • Can handle non-linear data.
  • Can handle redundant attributes, learns importance through weighting.

Disadvantages of ANN:

  • Very susceptible to overfitting, the more layers, the higher the risk.
  • Missing values must be imputed or the corresponding records removed.
  • Sensitive to noise (they can easily be influenced or disrupted by random variations or irrelevant information in the input data).
  • Increased complexity of network requires more data and increased training time.
39
Q

Describe Natural Language Processing (NLP).

A
  • Natural Language Processing (NLP):
    Is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural languages). Basically, we use computational techniques to extract insights and understanding in large corpuses of natural language text.
40
Q

How can Density (number of non-empty cells / total number of cells) be used to validate data?

A

Density provides information about how concentrated the data points are. This helps to understand the distribution/spread of data.

  • High density indicates that data points are closely clustered together, they are concentrated.
  • Low density tells us that the data points are more spread out.

In a table, the density could indicate how many non-empty values there are.

41
Q

Describe linear regression:

A
  • Linear regression: fitting a line to describe the relationship between variables. A good line gives us good predictions.
42
Q

How can we find a better model/make a better linear regression line?

A

We can change the slope and create a new line and then compare it to the one we already have. This is the basics of machine learning, we have a model, measure how well it works (by measuring the error) and then creating a new model and comparing it to the already existing one.

43
Q

Describe Support Vector Machines(SVM) and how it can be useful for regression:

A

If we have two lines that gives us good accuracy, we could say that they’re equivalent, they all give us good results. But, we can instead use support vector machines (SVM). Tries to maximise the margin between classes.

  • SVM for regression:

A linear regression that tries to set the margin (slack) to consider as many, but not all, data points. Like using linear regression but we can reasonably fit to noisy data by ignoring some data points outside of the margins. The lines are on top of the data points in regression.

44
Q

What is a Decision tree in machine learning?

A

Decision trees learn multiples of rules on how to split the data by analysing the attributes of a dataset. It is a tree-like model that makes decisions based on the features of the input data.

45
Q

What is stemming?

A

Stemming is a text normalization technique in natural language processing that involves reducing words to their root or base form. It aims to remove affixes (prefixes, suffixes) from words, leaving only the core meaning.

  • If used right it can reduce data volume and data velocity.
  • If used wrong it can lead to removing relevant information.
46
Q

Whats the The oracle problem?

A

The problem is about when we don’t know the correct answer. E.g. the travelling salesmen problem. How do we know we’ve found the optimum route?

47
Q

What are the Ethical dilemmas of AI?
(potensiell essäfråga)

A

Ethical Dilemmas of AI:

  • Whats “right” and “wrong”?
    We don’t always agree on the “right answer” or the right choice. The decision is not up to the autonomous system, it has to be left to a human.. Who do we blame if something goes wrong?
  • Bias and Fairness:
    Example: If a facial recognition system is trained predominantly on data from a specific demographic, it may exhibit bias and inaccuracies when identifying individuals from underrepresented groups, leading to unfair treatment.
  • Privacy Concerns:
    Example: Smart devices and AI-driven systems collecting and processing personal data may raise concerns about user privacy, especially if the information is used without explicit consent or is vulnerable to hacking.
  • Transparency and Explainability:
    Example: Complex AI models, such as deep neural networks, often operate as “black boxes,” making it challenging to explain their decision-making processes. This lack of transparency raises questions about accountability and trust.
  • Job Displacement and Economic Impact:
    Example: Automation and AI-driven technologies replacing certain jobs can lead to unemployment and economic disparities. Addressing the ethical dilemma involves finding ways to reskill and support affected workers.
  • Autonomous Systems and Decision-Making:
    Example: Autonomous vehicles making split-second decisions in critical situations may pose ethical challenges. For instance, deciding between prioritizing the safety of the vehicle’s occupants or pedestrians raises moral dilemmas.
  • Accountability and Liability:
    Example: Determining responsibility for AI-driven actions, especially in scenarios where decisions lead to unintended consequences, raises questions about legal and ethical accountability.
  • Data Handling and Consent:
    Example: Collection and use of personal data without clear consent or transparent privacy policies can lead to ethical concerns. This is particularly relevant in AI applications that heavily rely on extensive datasets.
48
Q

What are the 6 steps of CRISP-DM (Cross-industry standard process for data mining)

A
  • Step 1: Business understanding, identify the key processes, goals and actors, what does the business need?
  • Step 2: Data understanding, what data do we have, does it give insight about the business model, give context.. How clean is the data, visualize the data..
  • Step 3: Data preparation, how do we organize the data for modeling? Select data, clean data, integrate data, format data (convert values to other data types)..
  • Step 4: Modeling, choosing what modeling technique to use (regression, etc), developing, training, testing.
  • Step 5: Evaluation, which model best meets the demands? Creating reports, writing documentation for maintenance and modification.
  • Step 6: Deployment, planning and performing the deployment (distribution) of the model into the business. How do stakeholders access the results?
49
Q

Big data. A very large amount of data which is a mix of different kinds of structures of data. What are they?

A
  • Structured: tables w/ columns w/ meaningful rows of records.
  • Semi-structered: data from web sources, some structure, way of extracting data.
  • Unstructured: things like images, videos, audio..
50
Q

Explain User DNA and how it relates to Big data.

A

You link together click data, social media (like LinkedIn for jobs), advertisements, online shopping, google searches. So now the data that is collected from for example Google is combined with data from other services, so we can do even more with the data. This is a specific characteristic of big data and enables us to understand the users even better.

51
Q

Explain each layer from the DIKIW pyramid.

A
  • Data:
    Raw facts, subjective facts, no meaning, no context, we haven’t interpret it.
  • Information:
    Think of data that’s been transformed into something more useful. Maybe we’ve asked questions, what does the data concern, what things have been measured, where was it collected from? Adding more contextual data around the raw data.
  • Knowledge:
    Actionable information, noticing patterns in the data.
  • Wisdom:
    Idea that you found out so much from data that you can make decisions from it.
52
Q

Explain the method MSE:

A

The Mean Squared Error (MSE) is a metric used to measure the average squared difference between the predicted values and the actual values in a dataset. It provides a way to quantify the overall magnitude of errors and emphasizes larger errors more than smaller ones.

  • Formula: Sum square error/Number of datapoints.

Here’s a short and simple explanation:

  • Calculate Squared Differences: For each data point, find the squared difference between the predicted value (from your model) and the actual value (from your dataset).
  • Average the Squared Differences: Sum up all the squared differences and then divide by the number of data points. This gives you the average squared difference, which is the MSE.
  • Interpretation: A lower MSE indicates that the model’s predictions are closer to the actual values, while a higher MSE suggests larger discrepancies between predictions and actual outcomes.
53
Q

Explain the method MAE:

A

The Mean Absolute Error (MAE) is essentially the average of the absolute differences between the predicted and actual values. It provides a straightforward way to quantify the average magnitude of errors without considering their direction.

  • Formula: Sum of Absolute Errors/ Total datapoints.

Here’s a simple explanation of the components and formula:
- 𝑛 (number of data points): This represents the size of your dataset, indicating how many observations or instances are in the dataset.

  • y2 (predicted value): This is the value predicted by your linear model equation. In the context of regression analysis, it’s the value the model estimates for a given input.
  • y(real value): This is the actual value from your dataset. It represents the true outcome or target value associated with a particular input.
  • Absolute Value (|…|): The vertical bars indicate the absolute value of the expression inside, meaning it gives the distance between two values without considering their direction. If the result is negative, taking the absolute value makes it positive.
  • Mean (average): The formula calculates the average absolute difference over all data points.
54
Q

Explain clustering and potensial weaknesses and strength with it?

A

Clustering is a technique in data mining that involves grouping similar data points together based on certain characteristics, with the goal of discovering inherent patterns, structures, or relationships within the data. The primary objective is to create clusters or groups in such a way that data points within the same cluster are more similar to each other than to those in other clusters.

Strengths:

  • Pattern Discovery: Identifies natural groupings and patterns.
  • Unsupervised Learning: Doesn’t require labeled data.
  • Anomaly Detection: Highlights outliers.
    Data Reduction: Reduces dimensionality.

Weaknesses:

  • Sensitivity: Results can vary with initial conditions.
  • Subjectivity: Interpretation may be subjective.
  • Scalability: May struggle with large datasets.
  • Similarity Assumption: Results depend on chosen similarity metrics.
  • Handling Noise: Sensitive to noise and outliers.
55
Q

Explain K-means (clustering). What are the benefits and downside of using it?

A

Iteratively work towards finding the optimal clusters. It’s a bit like randomly drawing a line through our data, but instead we can randomly draw two imaginary points out.

Fördelar:

  • Beräkningsmässigt effektivt i jämförelse med Hierarkisk klustring (Om k är liten).
  • Enkel och välkänd metod.
  • Kan användas för ett brett spektrum av datatyper

Nackdelar:

  • Definitionen av K är viktig.
  • Kan producera tomma kluster, t.ex. om du valt K som är större än antal datapunkter eller valt ett dåligt ursprungsläge för centroiderna. Kan lösas genom att tvinga utgånspunkterna för centroiderna att vara faktiska datapunkter.
  • Problematik vid detektering av vissa typer kluster t.ex kluster med ovanliga former.
  • Känsligt för uteliggare vid användning av SSE, kräver preprocessing.
56
Q

How do calculate Recall and what does the measurment indicate?

A

Recall is a metric used in binary classification to measure the ability of a model to correctly identify all relevant instances, specifically the actual positives. It answers the question: “Of all the actual positives, how many did the model correctly predict?”

  • Formula: True Positives/ True Positives+ False Negatives = Recall
  • True Positives (TP): Instances correctly predicted as positive (correctly identified counterfeit bills).
  • False Negatives (FN): Instances incorrectly predicted as negative (counterfeit bills that were missed).
57
Q

How do calculate Precision and what does the measurment indicate?

A

Precision is the percentage of predicted positives (predicted counterfeit bills) that are actual positives (counterfeit bills).

  • Formula: True Positives / False Positives + True Positives = Precision.
  • True Positives (TP): Instances correctly predicted as positive (correctly identified counterfeit bills).
  • False Positives (FP): Instances incorrectly predicted as positive (incorrectly identified non-counterfeit bills as counterfeit).
58
Q

What is SVC?

A

Support Vector Classification (SVC) is a type of machine learning algorithm used for classification tasks. The algorithm works by finding a hyperplane in a high-dimensional space that best separates the data points of one class from those of the other classes.

  • Example from lab:

```classifier = SVC()
classifier.fit(x_train.values, y_train.values)
~~~

classifier is an instance of the SVC model.
fit is used to train the model on the training data (X_train and y_train).

x_train represents the feature values of your training instances.

y_train represents the corresponding class labels for each training instance.

This trained model (classifier) can then be used to make predictions on new data or evaluate its performance on the test data.

59
Q

Discuss Data ethics related to data mining and machine learning.
(essä fråga)

A

Data ethics in the context of data mining and machine learning involves the responsible and ethical handling of data throughout the entire data lifecycle, from collection to processing and analysis. Here are key considerations and discussions related to data ethics:

Privacy Protection:

  • Challenge: The collection and analysis of large datasets can potentially lead to the identification of individuals, raising concerns about privacy.
  • Ethical Principle: Anonymizing and de-identifying data, obtaining informed consent, and ensuring compliance with privacy regulations (e.g., GDPR, HIPAA) are essential for protecting individuals’ privacy.

Bias and Fairness:

  • Challenge: Biases in data can result in unfair or discriminatory outcomes, especially when the data used for training machine learning models reflects existing societal biases.
  • Ethical Principle: Striving for fairness and equity in algorithmic decision-making by identifying and mitigating biases, promoting diversity in data sources, and transparently communicating about potential biases.

Accountability and Responsibility:

  • Challenge: Determining accountability for decisions made by machine learning models can be challenging, especially when multiple stakeholders are involved.
  • Ethical Principle: Defining clear lines of responsibility, ensuring accountability for the impact of algorithmic decisions, and establishing mechanisms for addressing unintended consequences.
60
Q

Förklara KDD (Knowledge Discovery in Databases)

A

KDD processen är ett helhetsbegrepp som beskriver omvandlingen från rådata till användbar information, likt CRISP-DM. Denna processen består av:

  • Steg 1: Data Selection:
    Man väljer ut utifrån olika källor vilken data man ska använda sig av. =>Targetdata
  • Steg 2: Preprocessing:
    Man städar och organiserar datan genom att bl.a. ta bort felaktigheter, uteliggare och gör integreringar mellan olika dataset. Preprocessing är den mest tidskrävande delen. =>Processed Data
  • Steg 3: Transformation:
    Man transformerar datan till ett format som krävs för den specifika Data Mining metod man ska använda sig utav. => Transformed Data
  • Steg 4: Data Mining:
    Själva utförandet av Data Mining metoden och data-utvinningen görs. => Patterns/mönster
  • Steg 5: Interpretation and analysis:
    Man tolkar och analyserar svaren man fått vid datautvinningen. => Konwledge/kunskap
61
Q

What is Euclidean distance?

A

Euclidean distance: also known as straight-line distance or Euclidean norm, is a measure of the straight-line distance between two points in Euclidean space, representing the length of the shortest path between them.

Formeln:

  • c^2 = (x1 - x2)^2 + (y1 - y2)^2
62
Q

What is Manhattan distance?

A

Manhattan distance: also known as L1 distance or taxicab distance, is a measure of the distance between two points in a grid-based system (like a city grid) calculated along the grid lines. It is named “Manhattan distance” because it resembles the way the streets in Manhattan are arranged in a grid pattern.

Formeln:

  • d = (x1 - x2) + (y1 - y2)
63
Q

Explain Recall and give formula.

A
  • Recall (Sensitivity or True Positive Rate):

Recall is the percentage of actual positives (relevant instances) that were correctly predicted as positives by the model. It measures the ability of the model to capture all the relevant instances.

Formula Recall: TP / TP + FN

64
Q

Explain Precision and give formula.

A
  • Precision (Positive Predictive Value):

Precision is the percentage of predicted positives (instances predicted as positive by the model) that are actually positives (relevant instances). It measures the accuracy of the positive predictions made by the model.

Formula Precision: TP / TP +FP

65
Q

Is the sentiment analysis task predictive or descriptive?

A

Tänk twitter analys.
Sentimentanalys betraktas vanligtvis som en prediktiv uppgift. Målet med sentimentanalys är att förutsäga känslan eller den emotionella tonen som uttrycks i en text. Sentimentet kan vara positivt, negativt, neutralt eller till och med kategoriserat i mer specifika känslor.

Å andra sidan involverar deskriptiva uppgifter att summera eller beskriva befintliga data utan att göra förutsägelser om framtida resultat. I sentimentanalys fokuserar man på att förutsäga känslan i texten baserat på de använda orden och uttrycken och möjliggör därmed klassificering av sentimentkategorin.

66
Q

What potential problems can you identify with taking the rule-based approach described for the sentiment analyser?

A

Det finns flera problem som kan uppstå när vi använder en regelbaserad teknik som denna för
att klassificera vår data som positive eller negative.

Ett av problemen är att vi måste fördefiniera regler, t.ex. if “cheer” and “best” –> positive eller if “sorry” and “ruin” –> negative.

Men vad händer om alla dessa fyra ord finns i en mening? Hur gör vi då? Eftersom detta är en
klassificering kan man använda sig av ett desicion tree för att klassificera datan, problemet
som då kan uppstå är att modellen måste fråga samma fråga flera gånger och på så sätt
skapa ett mycket komplex decision tree. Ett annat problem som kan uppstå är vilken fråga
som är mest “relevant” som man vill ha högst upp i trädet. Vissa ord/frågor kan vara
irrelevanta på egen hand och på så sätt inte ge någon mening till när vi analyserar meningen.