tenta Flashcards

1
Q

what is Data science

A

Data science is an interdisciplinary field with focus on extracting knowledge and insights from data. Includes: Computation, statistics, understanding of domain. (Domain knowledge is the understanding of a specific industry, discipline or activity.) and scientific methods.

Findings within data science is driven from different domain areas often to drive business decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Exploration, Inference and Prediction in Data science?

A

Exploration
Initial phase where focus is gaining a preliminary understanding of the dataset.
Identifying patterns in information
Uses Visualizations.

Inference
Inference is data analysis and statistics to make conclusions based on observed data. **Quantifying where those patterns are reliable. **
Uses randomization

Prediction
Making informed guesses
Uses machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Relations: what is Association?

A

“Any relation” == association
If phenomenon x has any relation to y there is an association.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Relations: what is Causality?

A

“lead to” == Causality

If phenomenon x leads to y there is casual relationship. While correlation indicates that there is a statistical relationship, it doesn’t necessarily imply causation. Causation implies a direct influence of one variable on another. Data science is very much about looking for cause and effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 4 V’s of big data

A

Usually characterized by the 4 V’s:’

Volume
Size, amount of data. Exceeds the processing capacity of conventional databases. The ability to handle and process large volumes of data efficiently is a fundamental aspect of big data analytics.

Velocity
Represents the speed at which data is generated, collected and processed.

Variety
Refers to the diversity of data types and sources. Is the data structured? unstructured? semi? Design of solutions made for handling this diversity.

Veracity
Deals with the quality and reliability of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is data mining?

A

“Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve business problems through data analysis. Data mining techniques and tools enable enterprises to predict future trends and make more-informed business decisions.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

KDD refers to extraction of implicit previously unknown and potentially useful information from data. What are the 5 steps of KDD?

A
  1. Data selection
    Selecting appropriate data from various sources.
  2. Data pre-processing
    Cleaning, removing errors, removing irrelevant data.
  3. Transformation
    Transformation, Transforming the data into a format usable by the data mining method. Like normalization. normalization is a technique used in DM to transform the values of a dataset into a common scale. If a dataset has multiple attributes but the attributes have values on different scales this may lead to poor data models while preforming data mining operations.

4.Data mining
The Actual application of the appropriate DM methods.

5.Interpretation and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what does it mean if a task is predictive?

A

Predictive
Is to predict a value of an attribute (target valuable) based upon the values of other attributes. Methods of predictive tasks usually fall under the category of Supervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what does it mean if a task is Descriptive?

A

Descriptive**
Detects patterns that summarize (describe) the underlying relationship in the data**. Methods of Descriptive tasks usually fall under the category of Unsupervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is unsupervised learning?

A

Unsupervised learning. Explores patterns and relationships with unlabeled data. While primarily descriptive (uncovers hidden structures) its finding can indirectly be used in predictive tasks. No explicit feedback on the correctness of predictions. Objective: The primary goal is to explore the inherent structure of the data. Unsupervised learning seeks to find patterns, groupings, or relationships within the data without relying on predefined output labels. Use cases for this could be for example clustering or dimensionality reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is supervised learning?

A

Supervised learning. In supervised learning, the algorithm is trained on a labeled dataset, where each input is paired with the corresponding correct output or label. **The algorithm learns the mapping between inputs and outputs. **The algorithm receives feedback during training. Objective: The primary goal is to learn a mapping or relationship between inputs and outputs based on the labeled training data. The learned model can then make predictions on new, unseen data.Use Cases here could be classification and regression (predicting a continuous output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is CRISP-DM

A

CRoss-Industry Standard Process for Data Mining: Open standard and can be used freely. Intended as a model for best practice. Modeled as an ongoing, iterative cycle.
The Model outlines the stages involved in a typical data mining project. CRISP-DM provides a structured framework for guiding organizations and data scientists through the data mining process. The process is iterative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are the 6 steps of CRISP-DM

A
  1. Business Understanding:
    Understand the business objectives and goals of the DM project. Defining the problem, understanding the requirements and scope. Produce plan.
  2. Data Understanding
    Understanding of the available data. Exploring the nature of the data, relationships and potential issues. This includes initial data preprocessing. Verify data quality
  3. Data Preparation
    The data preparation involves transforming the data to make it suitable to analysis. Merge data from different sources. Deal with missing data.
  4. Modeling
    Various DM techniques are applied to build and train models based on the dataset.
  5. Evaluation
    Assessing the performance of the models based on the business objectives. This point is closely tied to what we learned in 1. Business Understanding.
  6. Deployment
    The successful models are implemented into the operational environment. Plan monitoring and maintenance. Documentation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the DIKW pyramid?

A

“Refers loosely to a class of models for representing purported structural and/or functional relationships between data, information, knowledge, and wisdom. “Typically information is defined in terms of data, knowledge in terms of information, and wisdom in terms of knowledge”.”- wiki

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is Ordinal (qualitative) data?

A

Ordinal attributes, on the other hand, represent categories with a clear order or ranking, but the intervals between the categories may not be uniform or meaningful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is Nominal (qualitative) data?

A

Nominal (qualitative)
Could be for example labels, names even when denoted by numerical values. Operations based on arithmetic are not applicable here. Can be binary aswell, TRUE / FALSE. When transforming the data labels can be freely changed ex green = 1, Blue = 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Datasets. Terminologi:

Dimensionality

A

The number of attributes, Dimension reduction may occur in some processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Datasets. Terminologi:

Sparsity

A

Sparsity:
Asymmetric features -> Refers to the proportion of zero or empty values in a dataset. The more values are missing sparsity increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Datasets. Terminologi:

Resolution

A

Resolution:
Level of detail -> Ex higher resolution could here mean more decimal places that enable more precise calculations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Datasets. Terminologi:

Record Data

A

Record Data:
When each record represents a distinct unit of information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Datasets. Terminologi:

Web scraping

A

Web scraping: Web scraping allows us to programmatically extract data from public Web pages, provides semi-structured data usually. “Social listening” can potentially give early-warnings to events that are not yet reported in the media by scraping information of the internet that users generate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Datasets. Terminologi:

Data exhaust:

A

Data exhaust: Data exhaust is the trail of activity, or residual data, left behind by some
other kinds of business or computing process. ex: Transactions, calls, locations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Terminologi
Document term matrix:

A

A matrix (as above) that represents the frequency of terms / words in a collection of documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what is
Machine learning:

A

Machine learning is about methods that can be used to improve the performance of an intelligent agent over time, based on stimuli (data) from the environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

what is the difference between supervised and unsupervised learning?

A

In supervised learning, the algorithm is trained on a labeled dataset, where each input has a corresponding output. The goal is to learn a mapping from inputs to outputs. In unsupervised learning, the algorithm is given unlabeled data and must find patterns or relationships within the data without explicit guidance on the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

what is Linear Regression (supervised learning):

A

Fitting a line to describe the relationship between variables. “The goal of linear regression is to find the best-fitting linear relationship that can be used for making predictions.”

Main idea: If classes of points can be separated by a line, you can use a linear model to classify data points.
Is best suited for problems where the goal is to predict a continuous value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

what is Support Vector Machines (supervised learning)?

A

An algorithm used for classification tasks. SVM helps draw a line, as linear regression in a way to separate data objects from different groups. The line drawn is called a decision boundary and it is drawn to have the maximum distance between the decision boundary and the nearest data point from either group.

28
Q

what is the difference between linear regression and SVM?

A

Difference from Linear regression. Linear regression deals with a continuous output for example predicting prices. while the SVM is of a categorical label. (Is this a spam email or not?) SVM is designed to find the optimal decision boundary to separate different classes.

29
Q

what is Clustering? (Unsupervised learning)

A

groups of data points usually of similar values.

30
Q

what is K-means clustering:

A

K-means clustering is an unsupervised learning algorithm that falls under descriptive modeling.

Iteratively work towards finding the optimum cluster centers for a specified number of clusters / groups. Data points belong to a cluster that is defined by the closest centroid.

31
Q

what is DBSCAN?

A

Groups data points that are close together. Density-based spatial clustering.

32
Q

what is Hierarchical clustering?

A

Hierarchical clustering is a type of clustering algorithm used in unsupervised machine learning to group similar items into clusters. The term “hierarchical” is used because the algorithm creates a hierarchy of clusters. This clustering technique builds a tree-like structure of clusters, known as a dendrogram, which visually represents the relationships and similarities between different data points.

33
Q

what is Association rule mining (Unsupervised)?

A

DM technique that identifies interesting relationships, patterns / associations among a set of items in large datasets. For example: An association between which products are frequently purchased together?

34
Q

what is Support within association rule mining?

A

Support == A measure of how frequently a set of items appear in the dataset.

35
Q

what is Confidence in association rule mining?

A

Confidence
Confidence is that if there is a rule 𝐵𝑒𝑒𝑓,𝐶ℎ𝑖𝑐𝑘𝑒𝑛→𝐴𝑝𝑝𝑙𝑒
and has a confidence of 33%, we mean that if there is beef and chicken bought together, there is 33% chance that there are also apples in the shopping cart.

36
Q

what is lift in association rule mining?

A

Lift gives us a metric about how good a rule is. If the lift is >1 then the rule is better than guessing. If the lift is ≤1 the rule is pretty much as good as guessing.

37
Q

Is regression or classification a predictive or descriptive task?

A

Predictive

38
Q

is Clustering or rules a descriptive or a predictive task?

A

Descriptive

39
Q

What is the difference between Ordinal and Nominal attributes?

A

Nominal Attributes:

Nominal attributes are categorical variables that represent different categories or groups with no inherent order or ranking among them.

Nominal data can be represented by labels or names, and mathematical operations like addition or subtraction are not meaningful in this context.

Ordinal Attributes:

Ordinal attributes, on the other hand, represent categories with a clear order or ranking, but the intervals between the categories may not be uniform or meaningful. While there is a meaningful order among the categories, the differences between them may not be consistent.

40
Q

What are the two types of quantitative attributes?

A

Interval:
The distance between each step is the same size but with no absolute zero. Zero is arbitrary. (0 Celsius is not absence of temperature)

Ratio:
As Interval, But zero is a meaningful property indicating the absence of a property.

41
Q

What is cross-validation?

A

Cross-Validation is a technique used to assess the preformance of a predictive model. In a typical ML model the dataset is divided into traning and test set. Cross validation divides these further multiple times into subsets using each subset as a test set.

42
Q

What is the advantage of cross-validation?

A

The advantage of using cross-validation is a more reliable way of estimating a models performance. This provides a accurate assessment of how well a model will do on unseen data. “Overfitting” is when a model does good on seen data and bad on unseen.

43
Q

What is the training set?

A

The training data is used to train the model, during training the model learns patterns, relationships and attributes. When predictions are iteratively made on the training data, the difference between the prediction and the actual outcome is used to update the models parameters (optimization) Training data is typically 70-80% of the dataset.

44
Q

what is test set?

A

The test set is used to evaluate the performance of the trained model on new, unseen data. It stimulates how the model is to preform when applied to real-world scenarios.

45
Q

What is Stratification?

A

The objective is to ensure that the proportion between classes is maintained correctly in different subsets of the data. This is done to prevent biases.

For example in cross validation each subset that is created aims to have a proportional distribution of classes as the whole dataset. The same is true for when splitting the data into training and test sets

46
Q

What is re-weighing?

A

Re-Weighing adjusts the weights assigned to different instances / classes in a dataset. This is done to address class imbalances. This makes underrepresented classes more influential during learning and helps the model prioritize learning patterns from the minority class.

47
Q

What is SVR and what is it used for?

A

The idea behind Support Vector Regression is to extend the concept of SVM. To predict continuous outcomes rather than handle classification problems. The objective of SVR is the same as linear regression.

48
Q

What are some of the advantages of using SVR instead of linear regression?

A

Linear regression assumes a straight-line relationship whereas SVR is more flexible and can capture non-linear relationships.

SVR is also less sensitive to outliers, data points that deviate significantly from the overall pattern.

49
Q

What are some of the advantages of using linear regression over SVR?

A

Computationally it is less expensive. If real-time prediction is needed in a large dataset Linear regression might be preferable.

Easy to implement, interpret. Works fine if the data is not to noisy, truly linear or has few features.

50
Q

How does identifying a control group and
a treatment group then help us establish causality

A

Treatment group is a subset of individuals exposed to something that we would like to know effects that group. A new interface, a new drug etc. The goal is to observe the impact of this “treatment”.

The Control Group serves as a comparison / baseline for the treatment group. Subset of individuals that are not exposed to the “treatment”.

This makes it possible to infer causality by studying the impact of the treatment and compare the results with the control group.

51
Q

what is K-nearest neighbour?

A

An algorithm that can be used for both regression and classification.

52
Q

how does K-nearest neighbour work for classification?

A

In k-NN classification, the goal is to assign an object to a specific group or category. Imagine the object asking its nearby neighbours for advice on which group it belongs to. The “k” in k-NN represents how many neighbours it asks. The object then joins the group that the majority of its closest neighbours belong to. If it only asks one neighbour (k = 1), it simply joins the same group as that single closest neighbour.

53
Q

What do we mean by distance in a machine learning context? How can distance be measured?
Provide an example.

A

In a machine learning context, “distance” generally refers to a measure of dissimilarity or similarity between two data points in a feature space. The idea is to quantify how far apart or close together two instances are in the context of the given features.

54
Q

Advantages:
Stemming

A
  1. Reduces variation. This helps group together variations of the same word. This is the main objective of stemming.
  2. Simplifies analysis. It simplifies the analysis of text data by focusing on the core meaning of words.

3.Computational Efficiency: Stemming can improve computational efficiency since the reduced dimensionality makes subsequent processing faster.

55
Q

Disadvantages:
Stemming

A
  1. Over-Stemming and Under-Stemming: Stemming algorithms may sometimes over-stem (remove too many letters, leading to loss of meaning) or under-stem (leave too many letters, failing to reduce related words to the same stem).
  2. Loss of Interpretability: The stemmed words may not be easily interpretable, making it challenging to understand the original context.
  3. Language Dependence: Stemming algorithms are language-dependent, and the effectiveness may vary across different languages.
56
Q

what is Mean and Median Imputation?

A

Mean and median imputation are methods used to replace missing values in a dataset with the mean or median of the observed values for that variable. These techniques are commonly employed in data preprocessing to handle missing data, ensuring that the dataset remains suitable for analysis.

57
Q

𝑀𝐴𝐸?

A

Mean Average error. Average difference mellan prediction och outcome

58
Q

MSE?

A

Mean Square Error
Take the difference between the actual and predicted values for each data point.
Square each of these differences.
Take the average of the squared differences.

59
Q

Calc recall

A

TP / (TP + FN)

60
Q

Calc Precision

A

TP / (TP + FP)

61
Q

F1

A

2 * (Recall * precision)
/
(Recall + precision)

62
Q

Manhattan distance.

A

Manhattan distance.
(abs(x1 - x2) + (y1 - y2))

63
Q

what are Decision Trees (DTs)

A

Decision Trees (DTs) are a predictive supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

64
Q

When is MAE and MSE used and why?

A

Mean Square Error (MSE) and Mean Absolute Error (MAE) are commonly used metrics for evaluating the performance of regression models. These metrics are useful when you’re dealing when the goal is to predict a continuous numerical value, as opposed to a classification task.

MAE is less sensitive to outliers compared to MSE because it doesn’t square the differences.

65
Q

How is KNN calculated

A

Put the distances from the new data point to the rest.
Choose the k shortest.
if [yes,no,yes,no]
Calc average distance for yes and no.
Group to the shortest alternative.

66
Q

Listwise deletion

A

Ta bort om saknas

67
Q
A