Terminology Flashcards
Watermarking
embedding unique, identifiable signals into AI-generated content that is invisible to humans
System card
explains how a group of models work together to form a system (similar to model cards)
Synthetic data
artificially created data that mimics the statistical properties of real-world data and minimizes privacy risks
Retrieval-augmented generation (RAG)
framework that enhances LLM’s by supplementing outputs with reference material which is generally not included in the training data to provide more accurate outputs
(think uploading a doc for summary)
Prompt engineering
intentional process of structuring detailed instructions, sequences, and keywords to obtain specific outputs
Prompt
user input or instruction to generate an output
Adaptive learning
ML model that learns a students strengths and weaknesses to tailor personalized instructions and content
Variance
statistical measure of the spread of numbers from the average value
Random forest
Supervised ML algorithm that builds and merges multiple decision trees from random data to get more accurate and stable predictions
(useful for data sets with missing data)
Greedy algorithm
Makes optimal choices for immediate objectives and ignores long term optimal solutions
Entropy
Measure of unpredictability or randomness in an ML dataset
(Higher entropy == greater uncertainty in predictions)
Bootstrap aggregating
ML method that aggregates multiple versions of a model trained on random subsets of data to make it more stable and accurate
Also called “bagging”
Active learning
Subfield of ML where the algorithm chooses the data it learns from
Also called “query learning”
Algorithm
set of instructions and rules designed to perform a task
Corpus
large collection of texts and data AI uses to find patterns and make predictions
Inference
ML model outputs (predictions or decisions)
Input data
data provided to the model which is the basis of ML “learning”
Labeled data
data with labels, tags, or classes and provides context or meaning for the model
ML model
learned representation of patterns and relationships underlying the data
Training data
subset of data used to train an ML model by recognizing patterns and relationships that the model can identify and make predictions from
60-80% of the data set
Supervised learning
pre-labeled data used to train a model
Example: spam or ham
Data labeling
Enriching data for training, validating and texting
Semi-supervised learning
using both labeled and unlabeled datasets to train a model to improve reliability while keeping costs down
Unsupervised learning
no prelabeled data, so features are extracted from the data by grouping
Example: group animals by type, color, tails