logo_smallAxellero.io

Guardrails

Protect your agents with LLM-powered content moderation and policy enforcement.

Guardrails are specialized agents that evaluate input and output content to enforce policies, filter harmful content, and ensure your agents operate within defined boundaries.

What are Guardrails?

A Guardrail Agent uses an LLM to evaluate messages against a custom expression or policy. When the evaluation fails, the guardrail can abort the conversation and return a predefined response.

┌──────────────────────────────────────────────────────────┐
│                    GUARDRAIL FLOW                         │
├──────────────────────────────────────────────────────────┤
│                                                          │
│    ┌─────────┐                                           │
│    │  User   │                                           │
│    │  Input  │                                           │
│    └────┬────┘                                           │
│         │                                                │
│         ▼                                                │
│    ┌─────────────┐                                       │
│    │  Guardrail  │ ← Evaluates content against policy    │
│    │   Agent     │                                       │
│    └──────┬──────┘                                       │
│           │                                              │
│      ┌────┴────┐                                         │
│      │         │                                         │
│   PASS ✓    FAIL ✗                                       │
│      │         │                                         │
│      ▼         ▼                                         │
│  ┌───────┐  ┌─────────────────┐                          │
│  │  Main │  │ Response on     │                          │
│  │ Agent │  │ Failure Message │                          │
│  └───────┘  └─────────────────┘                          │
│                                                          │
└──────────────────────────────────────────────────────────┘

Use Cases

Use CaseDescription
Content ModerationFilter harmful, toxic, or inappropriate content
Policy EnforcementEnsure conversations stay within business guidelines
Input ValidationCheck that user inputs meet requirements
Output FilteringVerify agent responses before sending to users
ComplianceEnforce regulatory or legal requirements

Configuration

Parameters

ParameterTypeRequiredDescription
connectorIdSelectYesLLM provider connector
modelSelectYesModel for evaluation
systemMessageMarkdownYesInstructions for the guardrail LLM
userPromptMessageTextNoMessage to evaluate (uses {{ctx.user_prompt_message}} by default)
temperatureSliderNoModel temperature (0.0-2.0, default: 0.7)
maxTokensNumberNoMaximum tokens for evaluation
expressionTextYesExpression to evaluate the LLM response
responseOnFailureTextYesMessage returned when guardrail fails

System Message

The default system message instructs the LLM to return a binary classification:

Return exactly one character: 0 or 1. Do not include any explanation or punctuation.

0 = Content does NOT violate this policy.
1 = Content violates this policy.

You can customize this prompt to match your specific policy requirements.

Expression

The expression evaluates the LLM response to determine pass/fail. Use template syntax to access the response:

{{llmResponse == "0"}}

The expression should return true for content that passes the guardrail, and false for content that should be blocked.

Response on Failure

When the guardrail fails, this message is returned to the user instead of continuing the conversation:

Sorry, I cannot help with that request. Please ask me something else.

Example: Content Moderation

Configuration

System Message:

You are a content safety classifier. Analyze the following message for:
- Harmful or dangerous content
- Personal information requests
- Inappropriate language

Return exactly: 0 (safe) or 1 (unsafe). No explanation.

Expression:

{{llmResponse == "0"}}

Response on Failure:

I'm sorry, but I can't assist with that request. Please ask me something else.

Behavior

User: "How do I hack into someone's account?"

Guardrail LLM evaluates → Returns "1" (unsafe)
Expression: llmResponse == "0" → false
Result: BLOCKED

Response: "I'm sorry, but I can't assist with that request. Please ask me something else."
User: "What's the weather like today?"

Guardrail LLM evaluates → Returns "0" (safe)
Expression: llmResponse == "0" → true
Result: PASS

→ Main agent receives the message and responds normally

Integration with Agents

Connect a Guardrail Agent to your main agent flow:

  1. Add a Guardrail node from the node panel
  2. Configure the LLM connector and model
  3. Set your policy in the system message
  4. Define the pass/fail expression
  5. Set the response for blocked content
  6. Connect the guardrail before your main agent

Flow Example

Start → Guardrail Agent → [PASS] → Main Agent → Response
                       → [FAIL] → "Sorry, I cannot help with that."

Best Practices

Keep Evaluations Simple

Use binary classification (0/1 or yes/no) for reliable results:

Return only "safe" or "unsafe". No explanation.

Use Low Temperature

Set temperature to 0.0-0.3 for consistent, deterministic evaluations:

temperature: 0.1

Be Specific About Policies

Define clear criteria in your system message:

Evaluate if the message requests:
1. Personal information (SSN, passwords, addresses)
2. Harmful activities (hacking, violence)
3. Illegal content

Return 1 if ANY criteria is met, otherwise 0.

Test Edge Cases

Test your guardrail with:

  • Obvious violations (should block)
  • Obvious safe content (should pass)
  • Borderline cases (verify expected behavior)
  • Rephrased violations (ensure robustness)

Consider Performance

Guardrails add latency to every request. Optimize by:

  • Using faster models for evaluation
  • Keeping system messages concise
  • Setting appropriate max tokens

Limitations

  • Each guardrail call adds LLM latency
  • Classification accuracy depends on the model
  • Complex policies may require multiple guardrails
  • Does not replace application-level security

Guardrails are a defense-in-depth measure. Combine with application-level validation and security controls for comprehensive protection.