Overview
The ensemble layer is the “brain” of Guardian API, intelligently combining outputs from three independent models into a single, actionable moderation decision.Ensemble Algorithm
Step 1: Weighted Score Fusion
The ensemble uses a weighted average of model scores:Weight Rationale
Weight Rationale
Why 35-35-30?
- Sexism (35%): Custom-trained, high accuracy on target domain
- Toxicity (35%): State-of-the-art transformer, broad coverage
- Rules (30%): High precision but limited recall, perfect for critical issues
Rule Score Calculation
Rule Score Calculation
The rule score is computed based on detected flags:
If multiple flags are detected, the maximum score is used.Additionally, if any critical rule (slur, self-harm, threat) is detected, the rule score is set to at least 0.70.
| Flag Detected | Rule Score |
|---|---|
| Self-harm | 0.95 (highest) |
| Slur | 0.90 |
| Threat | 0.85 |
| Profanity | 0.40 |
| None | 0.00 |
Step 2: Conflict Resolution
Rule-based detections can override the weighted average for critical issues:1
Slur or Self-Harm Detected
If slurs or self-harm phrases are detected:Ensures high-risk content is never underestimated.
2
Threat Detected
If threat patterns are detected:Elevates score even if ML models missed it.
3
No Critical Flags
Use the weighted average as-is:
Why Override?Rule-based detection has high precision (low false positives) for critical issues. When rules detect slurs, threats, or self-harm, we trust them even if ML models disagree.
Step 3: Primary Issue Identification
The ensemble identifies the main concern based on individual model scores:Step 4: Summary Classification
The final score is mapped to a human-readable summary:| Score Range | Summary | Meaning |
|---|---|---|
| 0.6 - 1.0 | highly_harmful | Strong evidence of harmful content |
| 0.3 - 0.6 | likely_harmful | Probable harmful content |
| 0.1 - 0.3 | potentially_harmful | Some harmful indicators |
| 0.0 - 0.1 | likely_safe | No significant harmful content |
Step 5: Severity Calculation
Severity is computed from the final score:Example Scenarios
- Agreement (Safe)
- Agreement (Harmful)
- Conflict (Rule Override)
- Disagreement
Text: “I love this product!”Model Outputs:Result:
- Sexism: 0.05
- Toxicity: 0.02
- Rules: 0.00 (no flags)
Design Principles
1. Conservative for Critical Issues
Fail-Safe Approach
When in doubt about threats, slurs, or self-harm, err on the side of caution. Better to flag for human review than miss dangerous content.
2. Balanced for Common Cases
Nuanced Scoring
For typical moderation cases (toxicity, rudeness), use weighted fusion to capture nuance. Not everything needs maximum severity.
3. Explainable Decisions
Primary Issue
Always identify the main reason for flagging. Users need to understand why content was moderated.
4. Configurable Weights
Tunable System
Weights can be adjusted based on use case. Social media might weight toxicity higher; professional forums might weight profanity lower.
Tuning the Ensemble
Adjusting Weights
Modify weights inbackend/app/core/ensemble.py:
Adjusting Override Thresholds
Modify conflict resolution thresholds:Adjusting Severity Boundaries
Modify thresholds inbackend/app/config.py: