AI Flashcards

Question

appllications ?

Answer 1

The hemoglobin levels and the amount of blood transfused can be estimated with less error than before because of ML Supervised machine learning methods trained using SNPs and total baseline depression scores predicted remission and response at 8 weeks with area under the receiver operating curve (AUC) 0.7 70% prediction acccuracy Assesment of drug drug interactions in polypharmacy using graph convolutional networks AI performs as well as doctors in university tests

Answer 2

Unsupervised Machine Learning: In unsupervised learning, the algorithm is given a dataset without explicit instructions on what to do with it. The algorithm must find patterns, structure, or relationships within the data on its own. Unsupervised learning techniques include clustering, dimensionality reduction, and association rule learning. Clustering algorithms like K-means or hierarchical clustering group similar data points together based on their inherent patterns or similarities. Dimensionality reduction techniques such as Principal Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) aim to reduce the number of features in a dataset while preserving its important characteristics. Unsupervised learning is often used for tasks such as anomaly detection, data compression, and exploratory data analysis. Deep Learning: Deep learning is a subset of machine learning that involves neural networks with multiple layers (deep neural networks). These networks can automatically learn hierarchical representations of data. Deep learning models are typically trained using large amounts of labeled data, and they learn to extract features directly from the raw data without the need for manual feature engineering. Deep learning has shown remarkable success in various tasks such as image recognition, natural language processing, speech recognition, and recommendation systems. Common architectures in deep learning include Convolutional Neural Networks (CNNs) for image-related tasks, Recurrent Neural Networks (RNNs) for sequence data, and Transformers for natural language processing tasks. While deep learning models can be used for unsupervised learning tasks (e.g., autoencoders for dimensionality reduction or generative adversarial networks for generating synthetic data), they are more commonly associated with supervised learning where they learn to map inputs to outputs.

Answer 3

Personalized Medicine: Machine learning and AI techniques enable the development of personalized medicine approaches in pharmacogenomics. By analyzing an individual's genetic information, along with other relevant clinical data, these techniques can predict a patient's response to a particular drug or dosage regimen. Predictive Modeling: Machine learning algorithms can build predictive models that identify genetic markers or signatures associated with drug response or adverse reactions. These models can be used to stratify patient populations and guide treatment decisions, ultimately leading to more effective and safer drug therapies. Drug Discovery and Development: AI algorithms can accelerate drug discovery and development processes by analyzing vast amounts of genomic and chemical data. These techniques help identify potential drug targets, predict drug-drug interactions, optimize drug candidates, and design more effective clinical trials. Genomic Data Analysis: Machine learning methods are instrumental in analyzing large-scale genomic datasets, including genome-wide association studies (GWAS) and next-generation sequencing data. These techniques can uncover genetic variants associated with drug metabolism, pharmacokinetics, and pharmacodynamics. Drug Repurposing: AI-driven approaches facilitate drug repurposing efforts by identifying new therapeutic indications for existing drugs based on their genomic and pharmacological profiles. This approach can expedite the development of novel treatments for various diseases. Adverse Drug Reaction Prediction: Machine learning models can predict the likelihood of adverse drug reactions based on genetic factors, enabling proactive measures to mitigate risks and improve patient safety. Clinical Decision Support Systems: AI-powered clinical decision support systems integrate genomic data with electronic health records (EHRs) to provide healthcare professionals with personalized treatment recommendations and dosage adjustments tailored to individual patients.

Answer 4

Linear regression, K-nearest neighbours (KNN), Random Forest

Answer 5

K-means clustering, Hierarchical clustering, Principal component

Answer 6

Machine learning (ML) is a subfield of AI, or a path to AI Algorithms to learn insights and recognise patterns from data Deep Learning and Neural Networks are methods of ML Deep Learning structures algorithms in Neural Networks, with the aim of teaching them to take decisions

Answer 7

In SML, algorithms learn from labelled data * Regression is used to understand the relationship between dependent and independent variables * Classification assign test data into categories based on specific variables

Answer 8

Used to predict (forecast) the value of the dependent variable based on the independent variable * Linear regression is applied on continuous variables, whilst logistic regression on discrete

Answer 9

* Residuals can be used to validate the model by making sure that they are independent and normally distributed * As independent variables increases, multiple linear regression is applied 𝑦 = 𝑎 + 𝑏𝑥+ ∈

Answer 10

* Builds a model to describe Y in the best way using Xn * Use independent variables to predict the dependent variable. Example: à Total Cholesterol = a + b1*BMI + b2*Time exercising + b3*Shoe size… + ∈ * But is shoe size relevant? 𝑦 = 𝑎 + 𝑏!𝑥! + 𝑏"𝑥" + 𝑏#𝑥# + …+ ∈

Answer 11

* Parametric test based on assumptions: - Linear relationship between Y and X - Xi are not highly correlated with each other -The variance of the residuals is constant - Independence of observations - Residuals are normally distributed

Answer 12

* Model can be tested with Root Mean Square Error (RMSE), the standard deviation of the residuals: adding all the residuals squared, deviding by the number sample size, quare rooting all

Answer 13

1. Create a random 80/20 split of the data, generating training data (80%) and test data (20%) 2. Train a regression model on the training data 3. Apply the model on the test data 4. Calculate RMSE of the training data (in-sample RMSE) and test data (out-of-sample RMSE) * Compare the RMSE. Indicates how well the model performs on new data. * More complex model à Decreasing RMSE à Overfitting

Answer 14

Pros: * Can be used on continuous (linear) and discrete (logistic) data * Determine influence of independent variables on the dependent * Identifying outliers Cons: * No mixed data (continuous & discrete * Many assumptions * Requires complete data and no missing data

Answer 15

* Non-parametric algorithm i.e. no strong assumptions * Often used for classification, predicting the group of a data point * Applies majority voting based on: Distance metrics Number of K’s 1. Calculate the distances, usually with Euclidean distance 2. Find the nearest neighbours by ranking the distances 3. Majority vote on the predicted class label based on the K nearest neighbours K is the number of nearest neighbors taken into account

Answer 16

Pros: * It is easy to implement * No need to train a model * Versatile, distance algorithms can handle different types of data Cons: * Data should be of the same scale which can be difficult with large datasets * Setting the K can be challenging Tips: * Test different K’s * K should be odd numbers to avoid any draws

Answer 17

Random forest is based on decision tree’s * Generates many decision tree’s creates the random forest to classify unlabeled dataà A single tree is not accurate * Can use both categorical and continuous variables Random forest 1. Create a bootstrapped dataset that is the same size of the original à Randomly selected data, where duplicates are allowed 2. Create a decision tree using the bootstrapped data using a random subset of variables 3. Repeat 1 and 2 multiple times 4. Impute your unlabeled data and let the random forests’ many classifiers label 5. Majority vote classifies the unlabeled data Random forest validation with Out-of-Bag * The Random forest model can be validated using the Out-of-bag error * The Random forest is used to predict labels of data not selected for the bootstrapped data (test set)

Answer 18

Pros: * Can be used on many types and mixes of data * Can be applied on both classification and regression problems * Can be applied on data with missing values * No overfitting and curse of dimensionality Cons: * Very complex and you can’t follow the decision of the tree * Training the model takes time and computing power

Answer 19

* In UML, algorithms are used to analyze and cluster unlabelled data àData grouping based on patterns àSimilarities and differences of the data * Clustering is applied on raw data and groups it based on similarities and differences between the structure and/or patterns of the data * Dimensionality reduction can be applied to reduce complexity of data whilst preserving the structure to reduce ”noise” and overfitting ML algorithms.

Answer 20

* Not to be confused with KNN * Groups similar datapoints in clusters * K is the number of cluster and means generated 1. Set the number of K’s With Elbow plot 2. Generates K random centroids 3. Creates K clusters by assigning each data point to closest centroid 4. Calculates new centroids for each cluster 5. Reassigns points with new centroids If new assignments, repeat 4 If no new assignments, terminate algorithmElbow plot determines number of K’s * First step of K-means clustering is to set the K * The Elbow method is common * Distortions is the sum of squared distances of data points from cluster centers -Decreases as K increases. -0 when K = number of points

Answer 21

* Groups similar data points to clusters * Defines clusters that are distinct from each other and datapoints within are similar * Creates cluster by ordering clusters: - Bottom-up (Agglomerative) - Top-down (Divisive) * The length of the branch in the dendogram show how similar the data points are. à Long branch = dissimilar, short branch = similar

Answer 22

Pros: * Easy to use * The dendrogram gives information about the data structure * Can be used to set number of clusters Cons: * Sensitive to outliers * Does not work well with missing data or mixed data * In complex data, difficult to determine number of relevant clusters

Answer 23

Common and versatile method used for: * Analysing the structure of data features * Pre-processing for other ML algorithms * Visualisation Summarises large multi-dimensional datasets to smaller number of dimensions (ideally 2) that can be visualised 1. Plot the data. Gene 1 & 2 is higher in sample 1 & 2… 2. Calculate the average of gene 1 and 2 (and n) to find the center of the data. 3. Center the data at the origin (0,0) Find the line, through the origin, with the best fit. The best fit is defined by PCA projecting the distance of the point to the line and minimizing it. The line is called Principal Component 1 (PC1) The eigenvectors are calculated. Higher loading indicated more influence on the PC i.e. Gene 1 (0.82) influence more than Gene 2 (0.57). Multi-dimensions and PC n * PC2 is perpendicular to PC1. PC3 is perpendicular to PC1 and PC2 etc. * PCs are the same number as genes * PC1 explains most of the variance in the data. P2 the second most etc. * Projection in 2D, so two PC’s are projected * The datapoints are projected onto PC. * Hopefully, we see some clustering…

Answer 24

Pros: * Can remove noise (correlated features) * Improve ML algorithms by removing noise à Reduces overfitting * Visualisation Cons: * PCA turns independent variables to PC’s which can be hard to interpretate * Requires standardised data and therefore does not work well on mixed data Karolinska Institutet 03/02/2023 35 tSNE and UMAP are advancements of PCA, projecting the data better making clustering easier

Answer 25

* ML algorithms are used together * Nested in networks or parts of pipelines * Used as tools, from a ML toolbox * Important to know when and why to use it

Answer 26

Supervised classifiers are often used in image analysis, for example when diagnosing rare diseases. Here, KNN is nested into a Deep Neural Network. Datapoints in the KNN is other phenotype patients

Answer 27

* Most used method in analysing bulk RNAsequencing data * Other methods are limma and edgeR. Commom aim is to find differentially expressed genes (proteins, lipids etc.)

AI Flashcards

(51 cards)