AI moves fast, and the language around it moves even faster. This glossary covers the terms you actually need. From agent orchestration and MCP to RAG, function calling, and evals, ordered by what matters most in practice.
AI has developed its own language.
Not just the obvious terms like LLM or prompt, but a growing layer of insider vocabulary: agents, RAG, MCP, function calling, fine-tuning, evals, context engineering, and dozens more. The problem is that these terms are often used loosely, inconsistently, and sometimes flat-out wrong.
That creates a gap. People want to understand how modern AI actually works, but the language around it is fragmented, technical, and filled with jargon that gets repeated faster than it gets explained.
This glossary is meant to close that gap.
It starts with the terms that matter most in today’s AI products, especially agents, tools, orchestration, and retrieval, then moves into the deeper model, training, evaluation, and infrastructure concepts underneath them. The goal is simple: make the vocabulary of AI clear enough that you can actually follow the conversation.
1. Agents, Tools, and Orchestration
AI Agent
An AI agent is a system that uses a model plus tools, memory, and decision logic to pursue a goal and take actions. A chatbot that only answers questions is not really an agent. A system that can read your inbox, decide what matters, call tools, and follow through on a task is.
Agent Orchestration
Agent orchestration is the coordination layer that manages how models, tools, memory, and workflows work together. It is what keeps an agent from being just a smart text generator and turns it into a system that can actually complete tasks.
Agent Loop
The agent loop is the repeating cycle of observing the situation, reasoning about what to do next, taking an action, and updating context. This loop is the core operating pattern of most agent systems.
Tool Use
Tool use means the model can call external systems instead of relying only on its own internal knowledge. This could mean searching the web, querying a calendar, writing to a CRM, or sending an email.
Function Calling
Function calling is a structured way for a model to choose a tool and pass arguments to it in a format software can reliably execute. It is one of the key bridges between language models and real product functionality.
Tool Schema
A tool schema is the formal definition of what a tool does and what inputs it accepts. It tells the model how a tool should be called.
Tool Router
A tool router is the logic that decides which tool or toolset should be used for a task. In simple systems this is rule-based. In more advanced systems a model may help make the decision.
Planner
A planner is the part of an agent system that breaks a task into steps. If a user asks something complex, the planner decides how to approach it rather than trying to do everything in one jump.
Executor
The executor is the part of the system that carries out a planned step, often by calling tools, running prompts, or triggering sub-agents.
Controller
The controller governs the overall flow of the agent. It decides when to continue, when to stop, when to retry, and when to escalate to a human.
Workflow
A workflow is a defined sequence of steps used to complete a task. Some workflows are fixed. Others are dynamic and adapt based on what happens along the way.
Multi-Agent System
A multi-agent system is a setup where multiple specialized agents collaborate. One agent might handle planning, another retrieval, another calendar operations, and another messaging.
Sub-Agent
A sub-agent is a specialized agent invoked by another agent for a narrower responsibility. It is like delegating a task to a specialist.
Human in the Loop (HITL)
Human in the loop means a human reviews, approves, or corrects certain AI actions. This is especially important for sensitive tasks like sending emails, making purchases, or changing important records.
Computer Use
Computer use refers to an agent's ability to interact with graphical interfaces like a human, such as clicking buttons, filling forms, or navigating websites.
Toolformer Pattern
This is the general pattern where models learn or are designed to decide when and how to use tools during task execution.
Context Bridge
A context bridge is the mechanism that passes relevant information between the model and external systems. It helps the model stay aware of the state of the world outside its own context window.
Model Context Protocol (MCP)
MCP is an open protocol for connecting models to external tools and data sources in a standardized way. It is basically an interoperability layer for model-powered systems.
MCP Server
An MCP server exposes tools or resources over the MCP standard so a model-enabled client can use them.
MCP Client
An MCP client is the application layer that connects a model runtime to one or more MCP servers.
Agentic RAG
Agentic RAG is a retrieval-augmented system where the agent dynamically decides what information to retrieve, when to retrieve it, and how to use it across multiple steps.
Context Engineering
Context engineering is the design of how instructions, tools, memory, retrieved information, and system state are assembled so an agent can perform reliably. It is one of the most important ideas in modern AI product building.
Harness
In AI, a harness usually refers to an evaluation or testing framework used to run prompts, tasks, and measurements against models or agents in a repeatable way.
2. Prompting and Interaction Terms
Prompt
A prompt is the input given to a model. It can include instructions, examples, context, and formatting constraints.
Prompt Engineering
Prompt engineering is the practice of designing prompts to improve output quality, reliability, structure, or behavior.
System Prompt
The system prompt contains high-priority instructions that define the model's role, rules, and behavioral constraints.
User Prompt
The user prompt is the direct input from the user.
Assistant Message
The assistant message is the model's reply in a conversational exchange.
Prompt Template
A prompt template is a reusable prompt structure with variable placeholders.
Few-Shot Prompting
Few-shot prompting means giving the model a small number of examples to show it what kind of answer or behavior you want.
Zero-Shot Prompting
Zero-shot prompting means asking the model to perform a task without giving examples.
Chain-of-Thought Prompting
Chain-of-thought prompting encourages the model to reason through intermediate steps. In practice, teams often want the benefits of structured reasoning without exposing all of it directly to users.
Role Prompting
Role prompting assigns the model a role or perspective, such as "act like a recruiter" or "act like a product strategist."
Structured Prompting
Structured prompting means organizing a prompt into clear sections, such as task, context, constraints, tools, and output format.
Delimiter
A delimiter is a marker used to separate parts of a prompt, such as XML tags, headings, or clear section breaks.
Output Schema
An output schema is a predefined response structure the model is expected to follow.
Structured Outputs
Structured outputs are responses constrained to a defined schema so the result is predictable and machine-readable.
JSON Mode
JSON mode is a generation mode that pushes the model to return valid JSON.
Grounding
Grounding means tying the model's answer to trusted context, documents, or external sources rather than letting it guess freely.
Hallucination
A hallucination is when a model generates false, fabricated, or unsupported information as if it were true.
Prompt Injection
Prompt injection is when malicious or unintended instructions are hidden inside user input or retrieved content to manipulate the model's behavior.
Jailbreak
A jailbreak is a prompt designed to bypass safety policies or behavioral constraints.
Stop Sequence
A stop sequence is a token pattern that tells the model when to stop generating.
Streaming
Streaming means returning output incrementally as the model generates it instead of waiting for the full answer.
Temperature
Temperature controls how random or conservative the output is. Lower temperature is usually more predictable. Higher temperature is usually more varied.
Top-k Sampling
Top-k sampling limits the next-token choices to the top k most likely options before selecting one.
Top-p Sampling
Top-p sampling limits the next-token choices to the smallest set of tokens whose combined probability passes a threshold.
Greedy Decoding
Greedy decoding always picks the single most likely next token.
Decoding
Decoding is the process of turning model probability distributions into actual output tokens.
Determinism
Determinism refers to how consistently the same input produces the same output.
Stochastic Output
Stochastic output means the answer may vary across runs because of randomness in sampling.
Max Tokens
Max tokens is the limit on how many tokens the model can generate in a response.
Token Budget
The token budget is the total amount of input and output text that can fit within a request.
Context Window
The context window is the maximum amount of text a model can consider at once.
Context Truncation
Context truncation happens when part of the prompt or conversation is cut off to fit the context window.
Context Compaction
Context compaction means summarizing or compressing information so more relevant context fits into the model's window.
3. Retrieval, Search, and Knowledge Systems
Retrieval-Augmented Generation (RAG)
RAG is an approach where a model retrieves external information and uses it to generate a more grounded answer.
Retriever
The retriever is the component that finds relevant chunks of information for a query.
Re-Ranker
A re-ranker reorders retrieved results to improve relevance before passing them to the model.
Knowledge Base
A knowledge base is the collection of indexed information an AI system can retrieve from.
Vector Database
A vector database stores embeddings and supports semantic search across them.
Vector Search
Vector search finds semantically similar items based on embeddings rather than exact keyword matches.
Embedding Model
An embedding model converts text, images, or other data into vectors that capture meaning.
Embedding
An embedding is the numeric vector representation of content.
Semantic Search
Semantic search finds results based on meaning rather than exact word overlap.
Similarity Score
A similarity score measures how close two vectors or pieces of content are.
Cosine Similarity
Cosine similarity is a common metric used to compare the directional similarity of vectors.
Nearest Neighbor Search
Nearest neighbor search finds the most similar vectors to a query vector.
Approximate Nearest Neighbor (ANN)
ANN is a faster form of nearest neighbor search that trades a bit of exactness for speed.
Hybrid Search
Hybrid search combines semantic search with keyword or lexical search.
Chunking
Chunking is the process of splitting documents into smaller pieces before embedding or retrieval.
Chunk Size
Chunk size is the amount of text included in each chunk.
Chunk Overlap
Chunk overlap repeats some text between adjacent chunks to preserve continuity.
Metadata Filtering
Metadata filtering narrows retrieval using attributes like date, author, source, or file type.
Ground Truth
Ground truth is the verified reference data used to judge whether retrieval or output is correct.
Context Hydration
Context hydration means loading the most relevant retrieved information into the prompt at the right time.
4. Core AI and ML Foundations
Artificial Intelligence (AI)
AI is the broad field focused on building systems that perform tasks associated with human intelligence.
Machine Learning (ML)
Machine learning is the part of AI where systems learn patterns from data instead of being explicitly programmed with every rule.
Deep Learning
Deep learning is a branch of machine learning that uses multi-layer neural networks.
Foundation Model
A foundation model is a large pretrained model that can be adapted to many tasks.
Large Language Model (LLM)
An LLM is a model trained on large text datasets to understand and generate language.
Generative AI
Generative AI refers to systems that create new content such as text, images, code, audio, or video.
Transformer
The transformer is the neural network architecture behind most modern language models.
Neural Network
A neural network is a model made of interconnected computational units that transform inputs into outputs.
Model
A model is a trained system that maps inputs to predictions or generated outputs.
Base Model
A base model is the pretrained model before additional alignment or instruction tuning.
Instruction-Tuned Model
An instruction-tuned model has been further trained to follow prompts and tasks more reliably.
Chat Model
A chat model is a model tuned for conversational interaction.
Multimodal Model
A multimodal model can work with more than one data type, such as text and images.
Autoregressive Model
An autoregressive model predicts the next token based on previous tokens.
Next-Token Prediction
Next-token prediction is the core training objective used by many language models.
Token
A token is a chunk of text processed by the model.
Tokenizer
The tokenizer converts raw text into tokens and back again.
Vocabulary
The vocabulary is the full set of tokens recognized by the tokenizer.
Attention
Attention is the mechanism that lets a model focus on relevant parts of the input.
Attention Mask
An attention mask controls which tokens the model is allowed to pay attention to.
Positional Encoding
Positional encoding gives the model information about token order.
Hidden State
A hidden state is an internal vector representation created as the model processes input.
Parameter
A parameter is a learned weight inside the model.
Logit
A logit is the raw score the model assigns to each possible next token before normalization.
Log Probability (logprob)
Log probability is the logged probability assigned to a token or sequence.
5. LLM and Transformer Basics
Training
Training is the process of adjusting model parameters using data.
Inference
Inference is the process of running a trained model on new input to get an output.
Dataset
A dataset is the collection of examples used for training, validation, or testing.
Feature
A feature is an input variable or representation used by a model.
Label
A label is the target output a supervised model is trained to predict.
Supervised Learning
Supervised learning uses labeled examples with known outputs.
Unsupervised Learning
Unsupervised learning finds structure in data without labels.
Semi-Supervised Learning
Semi-supervised learning combines labeled and unlabeled data.
Self-Supervised Learning
Self-supervised learning creates training signals from the data itself.
Reinforcement Learning (RL)
Reinforcement learning optimizes behavior based on reward from interacting with an environment.
Classification
Classification predicts a category.
Regression
Regression predicts a continuous value.
Clustering
Clustering groups similar data points without predefined labels.
Dimensionality Reduction
Dimensionality reduction compresses data into fewer variables while preserving useful structure.
Pretraining
Pretraining is the initial large-scale training stage on broad data.
Fine-Tuning
Fine-tuning is additional training on task-specific data.
Supervised Fine-Tuning (SFT)
SFT fine-tunes a model using prompt-response examples with known good answers.
Post-Training
Post-training refers to optimization steps after the main pretraining phase.
Transfer Learning
Transfer learning reuses knowledge learned in one domain or task for another.
Domain Adaptation
Domain adaptation improves model performance for a specific domain such as finance, law, or medicine.
Distillation
Distillation trains a smaller model to imitate a larger one.
Alignment
Alignment is the effort to make model behavior match intended goals, values, or rules.
RLHF
RLHF stands for Reinforcement Learning from Human Feedback, where human preference data is used to improve model behavior.
RLAIF
RLAIF stands for Reinforcement Learning from AI Feedback, where AI-generated judgments are used in the training loop.
Constitutional AI
Constitutional AI is an alignment approach where a model critiques and revises outputs according to explicit principles.
Preference Model
A preference model scores which outputs humans are more likely to prefer.
Reward Model
A reward model estimates how good a given output is for reinforcement learning.
Policy Model
The policy model is the model being optimized to choose actions or outputs.
Hyperparameter
A hyperparameter is a setting chosen by developers rather than learned during training.
Learning Rate
The learning rate controls how large each training update is.
Epoch
An epoch is one full pass through the training dataset.
Checkpoint
A checkpoint is a saved snapshot of model weights.
Generalization
Generalization is how well a model performs on unseen data.
Overfitting
Overfitting happens when a model learns the training data too specifically and performs poorly on new data.
Underfitting
Underfitting happens when a model fails to learn enough from the training data.
6. Training, Alignment, and Adaptation
PEFT
PEFT stands for Parameter-Efficient Fine-Tuning, a family of methods that adapts a model while changing only a small subset of its weights.
LoRA
LoRA stands for Low-Rank Adaptation, a PEFT method that learns small update matrices instead of changing the full model.
Adapter
An adapter is a lightweight trainable module inserted into a model for task-specific adaptation.
Quantization
Quantization reduces numeric precision to make models smaller and faster.
Post-Training Quantization (PTQ)
PTQ quantizes a trained model without fully retraining it.
Pruning
Pruning removes less important weights or connections to reduce size or cost.
Sparsity
Sparsity refers to a model where many weights or activations are zero or near zero.
Compression
Compression is the general process of reducing model or data size.
Model Parallelism
Model parallelism splits a model across multiple devices.
Data Parallelism
Data parallelism replicates the model across devices and splits the training data between them.
Mixed Precision
Mixed precision uses different numeric precisions together for better efficiency.
KV Cache
The KV cache stores attention states during generation so the model can respond faster.
Context Length Scaling
Context length scaling refers to techniques that help models handle longer inputs more effectively.
Inference Optimization
Inference optimization includes techniques that reduce runtime cost or latency.
Serving Stack
The serving stack is the software layer that hosts and delivers model inference.
Model Serving
Model serving means running the model in production so applications can call it reliably.
GPU
A GPU is a highly parallel processor commonly used for training and inference.
TPU
A TPU is Google's specialized hardware for machine learning workloads.
Accelerator
An accelerator is any hardware optimized for AI or high-performance compute.
Memory Footprint
Memory footprint is the amount of memory the system needs during execution.
Latency
Latency is how long it takes to get a response.
Throughput
Throughput is how much inference work can be completed in a given amount of time.
Batching
Batching means processing multiple requests together for efficiency.
Deployment
Deployment is the act of releasing a model or AI system into production.
Production
Production is the live environment where real users depend on the system.
Canary Release
A canary release rolls out a change to a small subset of traffic first.
A/B Test
An A/B test compares two versions of a model, workflow, or experience to see which performs better.
Rollback
A rollback reverts to an earlier version after a bad change.
Versioning
Versioning tracks different versions of models, prompts, workflows, or datasets.
Rate Limit
A rate limit caps how many requests can be made in a given time period.
Quota
A quota is the allowed amount of usage, such as tokens, requests, compute, or storage.
7. Efficient Tuning and Model Optimization
Evaluation (Eval)
An eval is a test used to measure whether a model or AI system performs as intended.
Benchmark
A benchmark is a standardized task set used to compare systems or models.
Offline Eval
An offline eval is run on stored datasets instead of live traffic.
Online Eval
An online eval is performed in production or with live users.
Human Eval
A human eval uses people to judge quality, correctness, or preference.
Model Grader
A model grader is a model used to score or critique another model's output.
Pass Rate
Pass rate is the percentage of cases that meet a defined success standard.
Precision
Precision measures how many predicted positives or retrieved results are actually correct.
Recall
Recall measures how many of the truly relevant items were successfully found.
Accuracy
Accuracy measures the fraction of overall predictions that are correct.
F1 Score
F1 score balances precision and recall into a single metric.
Reliability
Reliability is how consistently a system behaves as expected.
Regression
A regression is a performance drop introduced by a change.
Failure Mode
A failure mode is a repeatable way a system breaks.
Trace
A trace is a record of the steps, tool calls, and decisions taken during execution.
Observability
Observability is the ability to inspect system behavior through logs, metrics, traces, and events.
Guardrail
A guardrail is a control layer that constrains outputs, tool use, or behavior.
Fallback
A fallback is a backup behavior used when the preferred path fails.
Latency SLA
A latency SLA is the target or guarantee for response time.
8. Evaluation, Reliability, and Observability
Safety Policy
A safety policy defines what the system is allowed or not allowed to generate or do.
Moderation
Moderation is the detection and handling of harmful or policy-violating content.
Content Filter
A content filter screens unsafe or disallowed inputs or outputs.
Red Teaming
Red teaming is adversarial testing designed to expose weaknesses or abuse paths.
Prompt Security
Prompt security is the practice of defending prompts, tools, and context against attacks such as injection or data exfiltration.
Data Leakage
Data leakage is the unintended exposure of sensitive or private information.
PII
PII stands for Personally Identifiable Information, meaning data that can identify a person.
Privacy-Preserving AI
Privacy-preserving AI uses methods that reduce exposure or misuse of sensitive data.
Access Control
Access control governs who or what can use a model, tool, or dataset.
Policy Enforcement
Policy enforcement automatically applies rules during model operation.
Alignment Risk
Alignment risk is the risk that a model optimizes for something other than the intended objective or values.
Governance
Governance is the broader system of policies, controls, accountability, and oversight used to manage AI responsibly.
9. Safety, Governance, and Deployment
Artificial Neural Network
Another name for a neural network, especially in more formal ML language.
Foundation Model Stack
The set of components around a foundation model, including serving, safety, orchestration, retrieval, and evaluation layers.
Model Stack
The broader application stack that includes the model plus all the systems around it.
Feature Engineering
Feature engineering is the process of selecting or transforming inputs for classical machine learning models.
Training Split
The training split is the portion of the data used to fit the model.
Validation Split
The validation split is used during development to tune the model and compare options.
Test Split
The test split is used at the end to estimate real-world performance.
Loss Function
The loss function measures how wrong the model's predictions are during training.
Optimization
Optimization is the process of updating model parameters to reduce loss.
Gradient
A gradient tells the model how to adjust its parameters to reduce error.
Backpropagation
Backpropagation is the algorithm used to send error information backward through a neural network during training.
Weights
Weights are the learned numeric values that shape model behavior.
Bias Term
A bias term is an additional learned value that helps the model shift outputs.
Activation Function
An activation function determines how a neural network node transforms its input.
Regularization
Regularization is a technique used to reduce overfitting.
Data Drift
Data drift happens when the input data in production changes from what the model saw during training.
Concept Drift
Concept drift happens when the relationship between inputs and outputs changes over time.
Distribution Shift
Distribution shift is the general change between training-time and real-world data distributions.
Calibration
Calibration is how well a model's confidence matches actual correctness.
Ensemble
An ensemble combines multiple models or methods to improve performance or robustness.
Baseline
A baseline is the simple reference system you compare improvements against.
Oracle
In evaluation, an oracle is the ideal or best-possible reference outcome.
Latent Space
Latent space is the internal representational space where models encode patterns and meaning.
Representation Learning
Representation learning is the process of learning useful internal encodings of data automatically.
Foundation Model Adaptation
Foundation model adaptation is the broader set of methods used to tailor a base model to a specific use case.
Why this glossary matters
Most people think AI is mainly about the model.
That is outdated.
In practice, the most important part of modern AI products is not just the model. It is the system around the model: the orchestration layer, the retrieval layer, the tool layer, the evaluation layer, and the safety layer.
That is why terms like agent orchestration, MCP, RAG, function calling, context engineering, and evals matter so much right now. They are the language of turning raw model capability into something useful, reliable, and commercially valuable.
If you understand those terms first, the rest of the field starts to make a lot more sense.