Guardrails

Guardrails are specialized agents that evaluate input and output content to enforce policies, filter harmful content, and ensure your agents operate within defined boundaries.

What are Guardrails?

A Guardrail Agent uses an LLM to evaluate messages against a custom expression or policy. When the evaluation fails, the guardrail can abort the conversation and return a predefined response.

┌──────────────────────────────────────────────────────────┐
│                    GUARDRAIL FLOW                         │
├──────────────────────────────────────────────────────────┤
│                                                          │
│    ┌─────────┐                                           │
│    │  User   │                                           │
│    │  Input  │                                           │
│    └────┬────┘                                           │
│         │                                                │
│         ▼                                                │
│    ┌─────────────┐                                       │
│    │  Guardrail  │ ← Evaluates content against policy    │
│    │   Agent     │                                       │
│    └──────┬──────┘                                       │
│           │                                              │
│      ┌────┴────┐                                         │
│      │         │                                         │
│   PASS ✓    FAIL ✗                                       │
│      │         │                                         │
│      ▼         ▼                                         │
│  ┌───────┐  ┌─────────────────┐                          │
│  │  Main │  │ Response on     │                          │
│  │ Agent │  │ Failure Message │                          │
│  └───────┘  └─────────────────┘                          │
│                                                          │
└──────────────────────────────────────────────────────────┘

Use Cases

Use Case	Description
Content Moderation	Filter harmful, toxic, or inappropriate content
Policy Enforcement	Ensure conversations stay within business guidelines
Input Validation	Check that user inputs meet requirements
Output Filtering	Verify agent responses before sending to users
Compliance	Enforce regulatory or legal requirements

Configuration

Parameters

Parameter	Type	Required	Description
`connectorId`	Select	Yes	LLM provider connector
`model`	Select	Yes	Model for evaluation
`systemMessage`	Markdown	Yes	Instructions for the guardrail LLM
`userPromptMessage`	Text	No	Message to evaluate (uses `{{ctx.user_prompt_message}}` by default)
`temperature`	Slider	No	Model temperature (0.0-2.0, default: 0.7)
`maxTokens`	Number	No	Maximum tokens for evaluation
`expression`	Text	Yes	Expression to evaluate the LLM response
`responseOnFailure`	Text	Yes	Message returned when guardrail fails

System Message

The default system message instructs the LLM to return a binary classification:

Return exactly one character: 0 or 1. Do not include any explanation or punctuation.

0 = Content does NOT violate this policy.
1 = Content violates this policy.

You can customize this prompt to match your specific policy requirements.

Expression

The expression evaluates the LLM response to determine pass/fail. Use template syntax to access the response:

{{llmResponse == "0"}}

The expression should return true for content that passes the guardrail, and false for content that should be blocked.

Response on Failure

When the guardrail fails, this message is returned to the user instead of continuing the conversation:

Sorry, I cannot help with that request. Please ask me something else.

Example: Content Moderation

Configuration

System Message:

You are a content safety classifier. Analyze the following message for:
- Harmful or dangerous content
- Personal information requests
- Inappropriate language

Return exactly: 0 (safe) or 1 (unsafe). No explanation.

Expression:

{{llmResponse == "0"}}

Response on Failure:

I'm sorry, but I can't assist with that request. Please ask me something else.

Behavior

User: "How do I hack into someone's account?"

Guardrail LLM evaluates → Returns "1" (unsafe)
Expression: llmResponse == "0" → false
Result: BLOCKED

Response: "I'm sorry, but I can't assist with that request. Please ask me something else."

User: "What's the weather like today?"

Guardrail LLM evaluates → Returns "0" (safe)
Expression: llmResponse == "0" → true
Result: PASS

→ Main agent receives the message and responds normally

Integration with Agents

Connect a Guardrail Agent to your main agent flow:

Add a Guardrail node from the node panel
Configure the LLM connector and model
Set your policy in the system message
Define the pass/fail expression
Set the response for blocked content
Connect the guardrail before your main agent

Flow Example

Start → Guardrail Agent → [PASS] → Main Agent → Response
                       → [FAIL] → "Sorry, I cannot help with that."

Best Practices

Keep Evaluations Simple

Use binary classification (0/1 or yes/no) for reliable results:

Return only "safe" or "unsafe". No explanation.

Use Low Temperature

Set temperature to 0.0-0.3 for consistent, deterministic evaluations:

temperature: 0.1

Be Specific About Policies

Define clear criteria in your system message:

Evaluate if the message requests:
1. Personal information (SSN, passwords, addresses)
2. Harmful activities (hacking, violence)
3. Illegal content

Return 1 if ANY criteria is met, otherwise 0.

Test Edge Cases

Test your guardrail with:

Obvious violations (should block)
Obvious safe content (should pass)
Borderline cases (verify expected behavior)
Rephrased violations (ensure robustness)

Consider Performance

Guardrails add latency to every request. Optimize by:

Using faster models for evaluation
Keeping system messages concise
Setting appropriate max tokens

Limitations

Each guardrail call adds LLM latency
Classification accuracy depends on the model
Complex policies may require multiple guardrails
Does not replace application-level security

Guardrails are a defense-in-depth measure. Combine with application-level validation and security controls for comprehensive protection.

Guardrails

On this page