Guardrails
Protect your agents with LLM-powered content moderation and policy enforcement.
Guardrails are specialized agents that evaluate input and output content to enforce policies, filter harmful content, and ensure your agents operate within defined boundaries.
What are Guardrails?
A Guardrail Agent uses an LLM to evaluate messages against a custom expression or policy. When the evaluation fails, the guardrail can abort the conversation and return a predefined response.
┌──────────────────────────────────────────────────────────┐
│ GUARDRAIL FLOW │
├──────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ │
│ │ User │ │
│ │ Input │ │
│ └────┬────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Guardrail │ ← Evaluates content against policy │
│ │ Agent │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────┴────┐ │
│ │ │ │
│ PASS ✓ FAIL ✗ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────┐ ┌─────────────────┐ │
│ │ Main │ │ Response on │ │
│ │ Agent │ │ Failure Message │ │
│ └───────┘ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘Use Cases
| Use Case | Description |
|---|---|
| Content Moderation | Filter harmful, toxic, or inappropriate content |
| Policy Enforcement | Ensure conversations stay within business guidelines |
| Input Validation | Check that user inputs meet requirements |
| Output Filtering | Verify agent responses before sending to users |
| Compliance | Enforce regulatory or legal requirements |
Configuration
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
connectorId | Select | Yes | LLM provider connector |
model | Select | Yes | Model for evaluation |
systemMessage | Markdown | Yes | Instructions for the guardrail LLM |
userPromptMessage | Text | No | Message to evaluate (uses {{ctx.user_prompt_message}} by default) |
temperature | Slider | No | Model temperature (0.0-2.0, default: 0.7) |
maxTokens | Number | No | Maximum tokens for evaluation |
expression | Text | Yes | Expression to evaluate the LLM response |
responseOnFailure | Text | Yes | Message returned when guardrail fails |
System Message
The default system message instructs the LLM to return a binary classification:
Return exactly one character: 0 or 1. Do not include any explanation or punctuation.
0 = Content does NOT violate this policy.
1 = Content violates this policy.You can customize this prompt to match your specific policy requirements.
Expression
The expression evaluates the LLM response to determine pass/fail. Use template syntax to access the response:
{{llmResponse == "0"}}The expression should return true for content that passes the guardrail, and false for content that should be blocked.
Response on Failure
When the guardrail fails, this message is returned to the user instead of continuing the conversation:
Sorry, I cannot help with that request. Please ask me something else.Example: Content Moderation
Configuration
System Message:
You are a content safety classifier. Analyze the following message for:
- Harmful or dangerous content
- Personal information requests
- Inappropriate language
Return exactly: 0 (safe) or 1 (unsafe). No explanation.Expression:
{{llmResponse == "0"}}Response on Failure:
I'm sorry, but I can't assist with that request. Please ask me something else.Behavior
User: "How do I hack into someone's account?"
Guardrail LLM evaluates → Returns "1" (unsafe)
Expression: llmResponse == "0" → false
Result: BLOCKED
Response: "I'm sorry, but I can't assist with that request. Please ask me something else."User: "What's the weather like today?"
Guardrail LLM evaluates → Returns "0" (safe)
Expression: llmResponse == "0" → true
Result: PASS
→ Main agent receives the message and responds normallyIntegration with Agents
Connect a Guardrail Agent to your main agent flow:
- Add a Guardrail node from the node panel
- Configure the LLM connector and model
- Set your policy in the system message
- Define the pass/fail expression
- Set the response for blocked content
- Connect the guardrail before your main agent
Flow Example
Start → Guardrail Agent → [PASS] → Main Agent → Response
→ [FAIL] → "Sorry, I cannot help with that."Best Practices
Keep Evaluations Simple
Use binary classification (0/1 or yes/no) for reliable results:
Return only "safe" or "unsafe". No explanation.Use Low Temperature
Set temperature to 0.0-0.3 for consistent, deterministic evaluations:
temperature: 0.1Be Specific About Policies
Define clear criteria in your system message:
Evaluate if the message requests:
1. Personal information (SSN, passwords, addresses)
2. Harmful activities (hacking, violence)
3. Illegal content
Return 1 if ANY criteria is met, otherwise 0.Test Edge Cases
Test your guardrail with:
- Obvious violations (should block)
- Obvious safe content (should pass)
- Borderline cases (verify expected behavior)
- Rephrased violations (ensure robustness)
Consider Performance
Guardrails add latency to every request. Optimize by:
- Using faster models for evaluation
- Keeping system messages concise
- Setting appropriate max tokens
Limitations
- Each guardrail call adds LLM latency
- Classification accuracy depends on the model
- Complex policies may require multiple guardrails
- Does not replace application-level security
Guardrails are a defense-in-depth measure. Combine with application-level validation and security controls for comprehensive protection.