Models & Elements Flashcards
Attention Mechanism
Enables the model to selectively focus on specific, relevant parts of the input data while making predictions or generating text.
A component commonly used in neural network architectures, particularly in natural language processing (NLP) and computer vision tasks. It enables models to focus on specific parts of input data dynamically, assigning different weights to different parts of the input sequence. This process allows the model to selectively attend to relevant information, enhancing its ability to capture long-range dependencies and handle variable-length input sequences effectively. Here’s how it works:
Encoding Input Features: input sequence is encoded into a sequence of feature vectors using an encoder network. Each feature vector represents a specific part of the input sequence and contains information about that part’s content.
Calculating Attention Weights: Next, the attention mechanism calculates attention weights for each feature vector in the input sequence. These weights determine the importance or relevance of each part of the input sequence with respect to the current context.
Weighted Sum: The attention weights are applied to the corresponding feature vectors, producing a weighted sum of the input sequence. This weighted sum emphasizes the parts of the input sequence that are deemed most relevant or informative for the current task or context.
Context Vector: The weighted sum of the input sequence, known as the context vector, is then passed to the subsequent layers of the neural network for further processing. This context vector encapsulates the most relevant information from the input sequence, enabling the model to make more informed predictions or decisions.
Training and Learning: During training, the attention mechanism learns to dynamically adjust the attention weights based on the input sequence and the task at hand. This learning process enables the model to adaptively focus on different parts of the input sequence as needed, improving its performance on various tasks such as machine translation, text summarization, and image captioning.
Attention RNN
A specialized architecture within recurrent neural networks (RNNs) that selectively focuses on different parts of the input sequence during processing. Unlike traditional RNNs, which process sequences with a fixed-size internal state, Attention RNN dynamically adjusts its attention weights, allowing it to give more importance to relevant inputs while suppressing irrelevant ones. It is commonly used in natural language processing tasks such as machine translation and text summarization, where understanding context and relevance within sequences is crucial. For example, in machine translation, Attention RNN enables the model to align words from the source language to the target language more effectively by attending to specific words in the source sentence during translation.
Auto-Encoders
Autoencoder is a feed-forward NN with encoder-decoder architecture. It is trained to reconstruct its input.
Autoencoders are a special type of neural network used in machine learning for unsupervised learning tasks. They are essentially designed to learn efficient representations of data by trying to compress the data and then recreate it.
- Structure: An autoencoder consists of two main parts: an encoder and a decoder.
The encoder takes the input data (like an image or a text snippet) and compresses it into a lower-dimensional representation, often called the “latent space” or “code.” This code captures the essential features of the input data.
The decoder then receives this compressed code and tries to reconstruct the original input data from it. - Training: During training, the autoencoder is given a set of input data. It then tries to encode this data and decode it back, minimizing the difference between the original data and the reconstructed data. This forces the encoder to learn a good representation of the data in the latent space, since it needs this information to create an accurate reconstruction.
- Applications: Autoencoders have various applications because of their ability to learn data representations. Here are a few examples:
a) Dimensionality reduction: By compressing data into a lower-dimensional latent space, autoencoders can be used to reduce the storage space needed for data or improve the efficiency of other algorithms.
b) Denoising: Autoencoders can be trained to ignore noise in the data by focusing on reconstructing the underlying clean patterns. This can be useful for tasks like image denoising or filtering audio data.
c) Anomaly detection: Since autoencoders learn what “normal” data looks like, they can identify deviations from the norm. This can be helpful for anomaly detection in areas like fraud detection or system health monitoring.
By learning compressed representations of data, autoencoders offer a powerful tool for various tasks in machine learning, especially when dealing with unlabeled data.
Bagging (Ensemble models)
Bagging involves training multiple instances of the same learning algorithm on different subsets of the training data and then averaging the predictions to make the final prediction.
Process:
Bootstrap Sampling: Randomly sample subsets (with replacement) from the training data.
Model Training: Train a base model (e.g., decision tree) on each bootstrap sample.
Prediction Aggregation: Combine predictions from all base models, often by averaging (for regression) or voting (for classification).
BERT
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer-based language model developed by Google. It leverages bidirectional context and transformer architecture to learn rich contextual representations of words and sentences from large corpora of text data. BERT has achieved state-of-the-art performance on various natural language processing (NLP) tasks, including question answering, sentiment analysis, named entity recognition, and language understanding tasks.
Bi-directional RNN
Type of recurrent neural network architecture that processes input sequences in both forward and backward directions. By utilizing information from past and future timesteps simultaneously, bi-directional RNNs can capture contextual dependencies more effectively than traditional RNNs, making them well-suited for tasks such as sequence labeling, machine translation, and sentiment analysis.
Bias direction (NLM)
Bias direction in the context of word embeddings refers to the vector direction between the position of a word in the embedding space and the position where it ideally should be This direction signifies the deviation of the word’s representation in the embedding space from a bias-free position. Identifying bias direction involves analyzing the displacement of word vectors in relation to desired unbiased representations. Techniques for addressing bias direction include debiasing algorithms, which aim to adjust word embeddings to minimize biases while preserving semantic information.
specific orientation or vector space within the embedding where certain biases are encoded. Word embeddings, created through techniques like Word2Vec or GloVe, map words to high-dimensional vectors in a continuous vector space. Bias direction arises when these embeddings exhibit systematic biases toward certain concepts, genders, races, or other social categories present in the training data. For example, if certain occupations are predominantly associated with one gender in the training data, the embedding space might reflect this bias by positioning words related to those occupations closer to gender-specific words. Identifying bias direction involves analyzing the spatial relationships between word vectors and understanding which dimensions in the embedding space correspond to biased attributes.
Bias in word embedings
Tendency of word embedding models to encode and reinforce societal biases present in the training data. Word embeddings are vector representations of words learned from large text corpora using techniques such as Word2Vec or GloVe. However, these embeddings may inadvertently capture stereotypes, prejudices, or cultural biases present in the text data, leading to biased representations of certain concepts or groups.
depending on attributes with withch each words is associated a word can be represented as a points in multidimentional space. The distance of words to sertain features can be measured as a vector. This vector has a magnitude and direction that tells us the strenght and type of relation between two words. Having this information we can deterrmine similar meanings between words or biases.
Biases (NN)
Verry simplay way to understand it is:
Each neuron has a function which takes inputs and multiplies each with specific weight - the bias is an added value that tells us how much we want to “favor” a function created by neuron. The higher the bias the more likely it is the function created in neuron will be activated
Each neuron in a neural network has its own bias value, independent of the weights. Bias is essentially an intercept term similar to the intercept in linear regression. While weights determine the strength of the connections between neurons, biases allow neurons to adjust the output independently of the input. bias term, acts as an adjustable parameter that allows the neuron to output non-zero values even when all inputs are zero. This means the neuron can still fire and contribute to the network’s output even if the weighted sum of inputs is low or zero. Without a bias, the model’s output might always be confined to a certain range. Think of neurons as making decisions by ‘activating’ when the weighted sum plus bias exceeds a threshold. The bias can make it easier or harder for a neuron to activate, controlling its baseline behavior.
In a neuron, biases are added to the weighted sum of inputs before applying the activation function
weighted_sum = (w1 * x1) + (w2 * x2) + (w3 * x3) + b
While weights control the slope of the relationship between a neuron’s input and output, the bias determines the intercept, or where the line crosses the y-axis. In other words: biases shift the activation function (left or right in relation to y-axis). This is critical for increasing the flexibility to where the activation function can trigger. Without a bias, the model might be forced to pass a decision boundary through the origin. Biases allow the boundary to shift, enabling the model to fit different patterns. Think of predicting temperatures – even if all your input values are zero, a bias can allow you to model non-zero temperatures.
Biases are also learned parameters. They are initialized (usually to small values) and then updated during training through backpropagation, just like weights. Biases can sometimes aid in faster convergence during the training process. By starting with biases already set, the model might start closer to a good solution. Common initialization techniques include initializing biases to zeros, small random values, or using techniques like Xavier or He initialization.
Boosting (Ensemble models)
Boosting sequentially trains a series of weak learners (e.g., shallow decision trees) and adjusts the weights of data points to emphasize the mistakes made by previous models.
Process:
Iterative Training: Train a series of weak models, each focusing on the mistakes of the previous ones.
Weight Adjustment: Assign higher weights to misclassified data points to make them more influential in subsequent iterations.
Model Combination: Combine predictions from all weak models, often by weighted averaging.
Cascade classifier
A machine learning model used for object detection in images or video streams. It consists of a sequence of stages, each containing a classifier trained to detect specific features or patterns of interest (e.g., Haar-like features in Viola-Jones algorithm). Cascade classifiers are designed to efficiently reject negative samples at early stages and focus computation on regions likely to contain objects, enabling real-time performance in applications such as face detection and pedestrian detection.
Cell state
in Long Short-Term Memory (LSTM) networks, the “state” refers to the current setup of parameters within the LSTM cell that determines what information is deemed important to retain for future time steps.
The cell state serves as a long-term memory store that enables the LSTM network to capture dependencies and patterns over extended sequences of data.This mechanism allows the network to retain relevant information and discard irrelevant information as it processes sequential data, helping it to make more accurate predictions or classifications.
Conditional Random Fields (CRF)
A type of probabilistic graphical model used for modeling structured prediction tasks in machine learning and natural language processing. CRFs model the conditional probability distribution of output variables given input features and capture dependencies among neighboring variables in structured data, such as sequences, graphs, or grids. They are widely used for tasks like sequence labeling, named entity recognition, part-of-speech tagging, and information extraction.
Cost-sensitive learning
A machine learning paradigm that incorporates the differential costs of errors or misclassifications into the training process. In cost-sensitive learning, the objective is to minimize a loss function that considers the varying costs associated with different types of errors or outcomes. Cost-sensitive learning is particularly relevant in imbalanced classification problems, where the classes have unequal costs or misclassification penalties. It enables models to prioritize accurate predictions for minority classes or critical outcomes, enhancing their performance and applicability in real-world scenarios.
It is a field of study that is closely related to the field of imbalanced learning that is concerned with classification on datasets with a skewed class distribution.
Definitional word (NLP)
Understanding the relationship between definitional words of concepts helps structure knowledge bases and ontologies. The words used in a definition can help determine the correct meaning of a word in a specific context. Automatically identifying definitional words can help extract definitions of terms from large amounts of text.
Definitional word refers to the core words or phrases that explain the essence of a concept and can be used to create its definition.
Example: If the concept is “dog”, definitional words might include “mammal”, “pet”, “bark”, “loyal”.
This focuses on the kinds of words typically found in formal definitions:
Genus: The broader category the term belongs to (“dog” is a type of “animal”).
Differentia: What distinguishes it from other members of the category (dogs “bark”, have “fur”).
Dense layer
Also known as a fully connected layer, is a type of layer where each neuron or node is connected to every neuron in the previous layer. Dense layers play a fundamental role in feedforward neural networks, where they perform linear transformations and apply activation functions to input data. They enable neural networks to learn complex mappings between input and output data by capturing non-linear relationships and hierarchical features. Dense layers are commonly used in deep learning architectures for tasks such as classification, regression, and function approximation.
Feeding-forward or forward pass (in NN)
Process of passing input data through the network’s layers in a forward direction, from the input layer through the hidden layers to the output layer. During this process, each layer performs a series of computations, such as linear transformations and activation functions, to generate predictions or representations of the input data. Feeding-forward is the basic operation performed during both training and inference in neural networks.
Taking an input and passing it through the network’s layers from input to output.
Performing the calculations at each neuron (weighted sums, activation functions).
Producing the final output (prediction, classification, etc.).
Gated recurent unit
Capture long-range dependencies and handle vanishing or exploding gradient problems. Gated units incorporate gating mechanisms that control the flow of information within the network, allowing it to selectively retain or discard information at each time step.
Input Gate (i): Controls new input information’s influence on the cell state.It decides which parts of the current input should be used to update the cell state.
Forget Gate (f) (in LSTM): Decides which previous cell state information to keep or discard. Controls whether information from the previous time step should be forgotten or retained in the cell state.
Output Gate (o): Determines which cell state information should be output or used as the current output.
Gates
In neural network architectures like LSTMs and GRUs, gates are specialized components that regulate the flow of information within the network by controlling how much information is passed along and retained at each time step. The most common types of gates include:
Input Gates: Input gates determine how much new information is added to the memory cell at each time step. They regulate the update of the memory cell state based on the current input and the previous hidden state.
Forget Gates: Forget gates control how much information from the previous memory cell state is retained or forgotten at each time step. They decide which information is relevant to retain and which can be discarded.
Output Gates: Output gates determine how much information from the current memory cell state is passed to the output at each time step. They control the information flow from the memory cell to the output of the network.
The gates output values between 0 and 1 due to the sigmoid function. A value of 0 means “block this information” and a value of 1 means “let this information pass through”. The gates perform element-wise multiplication with both the previous cell state and new information, allowing fine-grained control over what’s kept and what’s added.
Gates play a crucial role in addressing the challenges of capturing long-term dependencies and mitigating the vanishing gradient problem in recurrent neural networks, enabling them to effectively model sequential data and time series.
Generative Adversarial Networks (GANs)
Used in unsupervised machine learning (GANs learn directly from the structure of real data without requiring explicit labels), particularly for generating synthetic data samples that resemble real data. GANs consist of two neural networks: a generator and a discriminator, which are trained simultaneously in a competitive setting. The generator learns to generate realistic-looking data samples from random noise and gradually refining it to resemble real data. Discriminator learns to distinguish between real data samples and fake ones generated by the generator, improving its ability to spot the differences over time. Through adversarial training, GANs learn to generate high-quality, diverse data samples across various domains, such as images, text, and music, with applications in image synthesis, data augmentation, and creative AI.
Discriminator Training: The discriminator is exposed to both real data samples and fake samples generated by the generator. It learns to classify them as “real” or “fake” based on their features.
Generator Training: The generator attempts to fool the discriminator by generating increasingly realistic samples. It receives feedback from the discriminator, helping it improve its output quality
GloVe (Global Vectors)
GloVe (Global Vectors for Word Representation) is a word embedding technique used to represent words as dense vectors in a continuous vector space. GloVe learns word embeddings by analyzing the global co-occurrence statistics of words in large text corpora. It captures semantic relationships between words by considering their contextual usage patterns across the entire corpus. GloVe embeddings encode semantic similarities and syntactic relationships between words, making them useful for natural language processing tasks such as word similarity calculation, document classification, and sentiment analysis. GloVe embeddings are pre-trained on large corpora and are widely used in various machine learning applications.
The key principle is capturing the global co-occurrence statistics of words within a large text corpus. Here’s how it works in steps:
- Co-occurrence Matrix: GloVe starts by constructing a matrix where each row represents a word and each column represents another word. The value in each cell indicates how often the words appear together within a certain window size in the corpus.
- Focusing on Ratios: Instead of focusing solely on raw occurrence counts, GloVe considers the ratios of how often words co-occur with each other. This emphasizes the meaningful relationships between words beyond just their frequency.
- Loss Function: GloVe defines a loss function that aims to minimize the difference between the dot product of two word vectors and the logarithm of their co-occurrence probability in the corpus. Training the model tries to find vector representations that respect these global relationships observed in the data.
GloVe aims to create a vector space where the distances and angles between word vectors reflect their semantic relationships (similar words are closer, related words have meaningful angles between them). If the words “ice” and “steam” frequently co-occur with the word “water”, their respective vectors will end up being quite similar after GloVe training.
Hidden layer
Hidden layers are the layers within a neural network that sit between the input layer (where your data enters) and the output layer (where the predictions or results are produced). What Makes Them “Hidden” is that they have no direct contact with the outside: They don’t directly receive data or send outputs outside the network. Hidden layers are where the network discovers intricate patterns and relationships within the data. What occurs in these layers is often hard to interpret, hence the “hidden” aspect.
Like other layers, hidden layers consist of artificial neurons. These neurons perform calculations:
- Take in data (either from the input layer or a previous hidden layer).
- Apply weights, biases, and an activation function.
- Pass on the transformed data to the next layer.
Each hidden layer builds upon the work of the previous one. This hierarchical structure allows the network to learn increasingly complex representations of the data. Hidden layers help break a problem into smaller, more manageable sub-problems that the network can solve incrementally. Activation functions in hidden layers introduce non-linearity, which is crucial for neural networks to model more than just simple linear relationships in data.
Hidden state 𝑎
State (stage of informations gained) of the neurons or units in the hidden layers of the network.
Hyperbolic Tangent function (TanH)
A mathematical function that smoothly squashes input values between -1 and 1. It looks like a smoother, flattened version of the sigmoid function. TanH is a common activation function often used in hidden layers. TanH is sometimes preferred in RNNs due to its ability to mitigate vanishing/exploding gradient problems a bit better compared to sigmoid.
Why It’s Used in Neural Networks:
Zero-Centered Outputs: Unlike the sigmoid function (0 to 1 range), tanh conveniently outputs both positive and negative values. This can be useful for certain layers or problems. Some problems require the model to signal both increase and decrease relative to something. For example, predicting stock price movement (up or down) would benefit from negative outputs.
Stronger Gradients: Especially near zero, TanH tends to have larger gradients than the sigmoid function. This can sometimes lead to faster training convergence.
Mitigating Vanishing Gradients: While not totally immune, TanH can be a bit better at tackling vanishing gradient issues in some cases compared to sigmoid.
K-Nearest Neighbours
A simple and intuitive machine learning algorithm used for classification and regression tasks. Given a new data point, KNN predicts its class label or numerical value based on the majority vote or average of its k nearest neighbors in the training dataset. KNN relies on the assumption that similar data points tend to belong to the same class or have similar target values. It is a non-parametric and lazy learning algorithm that does not require training a model explicitly.
Linear regression (model)
A supervised learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the input features and the target variable.
Linear regression does search for the line (or hyperplane in higher dimensions) that minimizes the sum of the squared distances (or residuals) between the observed data points and the predicted values on the line. This method is known as the method of least squares.
Linear regression models are commonly used for prediction and inference tasks, and they provide interpretable coefficients that indicate the strength and direction of the relationships between variables.
Long-term, Short-term memory unit
Traditional neural networks struggle to handle data with a sequential nature (e.g., text, time series). RNNs address this by having a “memory” mechanism to retain information from previous steps. But long sequences make it hard for RNNs to learn long-range dependencies. Information from earlier steps can fade away as it’s propagated.
LSTMs are an advanced type of RNN cell designed to overcome the vanishing gradient problem. They not just remember last states but they have ability to decide what parts are worth of remembering.
Key Components of an LSTM Unit:
1. Cell State: This is the “long-term memory” of the LSTM. It runs through the entire chain, with only minor interactions, keeping information flowing.
2. Gates: These are what make LSTMs special:
Forget Gate: Selectively decides what information from the previous cell state should be discarded.
Input Gate: Determines what new information from the current input should be added to the cell state.
Output Gate: Controls which parts of the updated cell state become part of the output.
How it Works (Simplified)
a) The forget gate looks at the previous hidden state and current input and decides what old information to keep.
b) The input gate processes the current input and creates a “candidate” for updating the cell state.
c) The cell state is updated by combining parts of the old state (what the forget gate didn’t discard) and the new candidate values.
d) The output gate selects relevant parts of the cell state to generate an output.
At their core, LSTMs are neural network layers with a complex internal structure. This includes the cell state and the three gates (forget, input, and output). The gates contain sigmoid and hyperbolic tangent activation functions. Each gate and the calculations for updating the cell state involve sets of weights and biases. These are just like the weights and biases found in other parts of a neural network. LSTMs are trained as part of the overall neural network using the same principles of gradient descent and backpropagation:
Naive Bayes Classifier
A probabilistic machine learning model based on Bayes’ theorem and the assumption of conditional independence between features. It calculates the probability of each class label given a set of input features and selects the class label with the highest probability as the predicted label for the input. Despite its simplicity and the naive assumption of feature independence, Naive Bayes classifiers are widely used for text classification, spam filtering, and other classification tasks.
Named Entity Recognition (NER)
NER is a subfield of Natural Language Processing (NLP) focused on automatically identifying and classifying specific entities within a body of text. These are predefined categories like: People, Organizations (e.g., Google), Locations (e.g., France), Dates & Times (e.g., July 4th, 2023), Quantities (e.g., $1 Million), … and even custom entity types for your specific application.
Various ML algorithms are used for NER, including Traditional ML like Conditional Random Fields (CRFs), Support Vector Machines (SVMs), Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), Transformers. The NER model predicts the type of entity each word or group of words represents (or indicates that it’s not an entity).
Why NER is important:
* NER Extracts structured data from unstructured text which unlocks many applications. it helps tasks like machine translation, question answering, and text summarization.
* Business Applications include: Customer support chatbots can identify key issues and people mentioned. Analyzing legal documents to extract contract terms. Monitoring news feeds for relevant company or market trends.
Challenges
* Ambiguity: Words can belong to different categories depending on context (e.g., ‘Apple’ could be a company or a fruit).
* New Entities: Models need to be adaptable to handle previously unseen entities.
Typically, the output of an NER system might look like this:
Original Text: “John Doe visited Paris on July 4th, 2023 and met with the CEO of Acme Inc.”
NER Output:
* John Doe (Person)
* Paris (Location)
* July 4th, 2023 (Date)
Neuron (NN)
A neuron is the most basic processing unit within an artificial neural network. The concept of artificial neurons in neural networks is loosely inspired by biological neurons in the brain. Biological neurons receive signals (inputs) through connections called dendrites, process them, and send an output signal through the axon if a certain threshold is met.
Neural networks learn by adjusting the weights and biases during training. The goal is to find the optimal values that produce the desired output given a specific input. Artificial neural networks are organized into layers: an input layer, one or more hidden layers, and an output layer. Neurons in one layer are connected to neurons in the next, creating a complex network of calculations.
In a neural network, a neuron is a mathematical function that performs the following:
1) Inputs: A neuron receives multiple input values. These inputs could come from raw data (e.g., pixel values of an image) or be the outputs of neurons from a previous layer in the neural network.
2) Weights: Each input is multiplied by a corresponding weight. Weights are like knobs that determine how much influence each input has on the neuron’s output.
3) Summation: The weighted inputs are summed together.
4) Bias: A bias term is added to the sum. The bias is like an adjustment that helps the neuron learn how much we want to activate this neuron
5) Activation Function: The result of the summation (and bias) is passed through a non-linear activation function. This function introduces non-linearity into the model, which is essential for neural networks to learn complex patterns. Common activation functions include:
- Sigmoid
- Tanh
- ReLU (Rectified Linear Unit)
6) Output: The output of the activation function is the final output of the neuron. This output can then be sent to neurons in the next layer of the neural network.
Simple Analogy
Imagine a neuron like a decision-maker. Consider the decision of whether to wear a coat outside:
1) Inputs: Temperature, wind speed, likelihood of rain.
2) Weights: How heavily you weigh each factor (you might care more about temperature than wind, etc.)
3) Bias: Your general predisposition towards wearing a coat (some people are more likely to get cold).
4) Activation Function: Your mental model deciding if the combined factors cross a threshold for putting on a coat.
5) Output: The decision – coat or no coat.
Neutralization (bias)
sometimes called debiasing. Real-world data often contains biases reflecting social prejudices or historical patterns of discrimination.
ML models trained on this biased data learn and perpetuate these biases, resulting in unfair or harmful predictions. Neutralization is a collection of techniques aimed at reducing the influence of these unwanted biases in ML models.
Approaches to Neutralization
Pre-processing: Modifying the training data to be more balanced or remove sensitive attributes.
In-processing: Changing the model’s training process:
Regularization terms to penalize reliance on biased features.
Adversarial learning setups where a part of the model tries to identify biases to help another part counteract them.
Post-processing: Adjusting model outputs to ensure fairness according to specific metrics.
Most importandly in NLP:
These vector representations of words, which are foundational for many NLP tasks, can capture societal biases. For example, “doctor” might be closer to “man” and “nurse” closer to “woman” in the embedding space.
Debiasing Techniques in NLP
Data Pre-processing
Balanced Corpora: Curating datasets that have more balanced representation of different groups or perspectives.
Data Augmentation: Generating synthetic examples to counterbalance underrepresented groups or viewpoints.
Embedding Debiasing
Geometric Techniques: Realigning word embeddings in the vector space to mitigate biased associations.
Contextualized Embeddings: Instead of static word vectors, using models like BERT that dynamically generate embeddings based on the surrounding sentence, reducing some forms of bias.
Model-level Adjustments
Adversarial Training: Using a setup where one part of the model tries to predict a protected attribute (like gender) from the text, and the other part tries to perform the main task without relying on that protected attribute.
Fairness-aware Regularization: Adding terms to the loss function that penalize biased predictions across groups.