Overview
Guardian API uses four models working in parallel to provide comprehensive content moderation. Each model specializes in different aspects of harmful content detection.Model 1: Sexism Classifier
LASSO Regression Model
Custom-trained binary classifier for sexism detection
Technical Details
| Attribute | Value |
|---|---|
| Algorithm | LASSO (Least Absolute Shrinkage and Selection Operator) |
| Training Data | ~40,000 labeled tweets |
| Features | 2,500 n-grams (1-2) + 3 additional features |
| Threshold | 0.400 (optimized for F1 score) |
| Performance | ~82% F1 score on test set |
| Version | sexism_lasso_v1 |
Feature Engineering
The model uses a combination of text and numerical features:- Text Features
- Additional Features
CountVectorizer Configuration:
- max_features: 2,500
- ngram_range: (1, 2)
- min_df: 2 (minimum document frequency)
- max_df: 0.8 (maximum document frequency)
- stop_words: English (with gendered words preserved)
- Pronouns: he, him, she, her, etc.
- Nouns: man, woman, men, women, boy, girl
- Important for detecting sexist language patterns
Prediction Process
Output Format
Model 2: Toxicity Transformer
HuggingFace Transformer
Multi-label toxicity detection using RoBERTa
Technical Details
| Attribute | Value |
|---|---|
| Architecture | RoBERTa (Robustly Optimized BERT) |
| Model Name | unitary/unbiased-toxic-roberta |
| Type | Multi-label classification |
| Device | CUDA (GPU) if available, CPU fallback |
| Max Length | 512 tokens |
| Version | toxic_roberta_v1 |
Toxicity Categories
The model detects 7 categories of toxicity:Overall Toxicity
General toxic language score
Severe Toxicity
Extremely harmful content
Obscene
Vulgar or obscene language
Threat
Threatening language
Insult
Personal insults and attacks
Identity Attack
Attacks on identity groups
Sexual Explicit
Sexually explicit content
Device Management
GPU Acceleration
GPU Acceleration
The toxicity model automatically uses GPU if available:Performance Comparison:
- GPU (RTX 4050): ~10-15ms per request
- CPU: ~40-60ms per request
- GPU: ~2GB VRAM
- CPU: ~1GB RAM
Output Format
The
overall score is automatically set to at least the maximum of all sub-category scores.Model 3: Rule Engine
Heuristic-Based System
Pattern matching and rule-based detection
Technical Details
| Attribute | Value |
|---|---|
| Type | Rule-based heuristics |
| Rules | JSON configuration files |
| Pattern Matching | Regex + exact matching |
| Extensible | Easy to add new rules |
| Version | rules_v1 |
Rule Categories
- Slurs
- Threats
- Self-Harm
- Profanity
- Style Checks
File:
backend/app/models/rules/slurs.jsonDetection: Exact word matching (case-insensitive)Purpose: Identify hate speech and slursFormat:Output Format
Customization
Adding new rules is straightforward:1
Edit JSON File
Navigate to
backend/app/models/rules/ and edit the appropriate JSON file2
Add Your Rules
- For slurs/profanity: Add words to the array
- For threats: Add regex patterns
- For self-harm: Add phrases
3
Restart API
The API will automatically load the new rules on startup
Model 4: Ensemble
Aggregation Layer
Combines outputs from all three models
Model Comparison
| Feature | Sexism Classifier | Toxicity Model | Rule Engine |
|---|---|---|---|
| Type | ML (LASSO) | ML (Transformer) | Heuristic |
| Speed | Fast (~5ms) | Medium (~15ms GPU) | Fast (~2ms) |
| Accuracy | High (82% F1) | High | Rule-dependent |
| Extensibility | Requires retraining | Requires retraining | Easy (JSON) |
| Resource | Low | Medium-High | Very Low |
| False Positives | Low | Low-Medium | Medium |