Skip to main content

Overview

Guardian API returns structured JSON responses with comprehensive moderation data from all four models.

Response Schema

{
  "text": string,           // Original or sanitized input
  "label": {                // Individual model outputs
    "sexism": {...},
    "toxicity": {...},
    "rules": {...}
  },
  "ensemble": {...},        // Combined decision
  "meta": {...}             // Request metadata
}

Top-Level Fields

text (string)

The input text that was moderated. May be truncated if very long.
{
  "text": "Your input text here"
}

label (object)

Outputs from all three detection models.

ensemble (object)

Final moderation decision combining all models.

meta (object)

Request metadata including processing time and models used.

Label Structure

Sexism Label

{
  "label": {
    "sexism": {
      "score": 0.724,                    // Confidence score (0-1)
      "severity": "high",                // "low", "moderate", or "high"
      "model_version": "sexism_lasso_v1",  // Model identifier
      "threshold_met": true              // Whether score exceeds threshold
    }
  }
}
FieldTypeDescription
scorefloatLASSO model confidence (0.0 to 1.0)
severitystringSeverity level based on score
model_versionstringModel identifier for tracking
threshold_metbooleanTrue if score ≥ 0.400

Toxicity Label

{
  "label": {
    "toxicity": {
      "overall": 0.742,                  // Overall toxicity score
      "insult": 0.631,                   // Insult score
      "threat": 0.123,                   // Threat score
      "identity_attack": 0.412,          // Identity attack score
      "profanity": 0.584,                // Profanity score
      "model_version": "toxic_roberta_v1"  // Model identifier
    }
  }
}
FieldTypeDescription
overallfloatMaximum toxicity across all categories
insultfloatPersonal insult score (0-1)
threatfloatThreatening language score (0-1)
identity_attackfloatIdentity-based attack score (0-1)
profanityfloatProfane language score (0-1)
model_versionstringModel identifier
The overall score is automatically set to at least the maximum of all sub-category scores.

Rules Label

{
  "label": {
    "rules": {
      "slur_detected": false,            // Slur flag
      "threat_detected": true,           // Threat pattern flag
      "self_harm_flag": false,           // Self-harm phrase flag
      "profanity_flag": true,            // Profanity flag
      "caps_abuse": false,               // Excessive caps flag
      "character_repetition": false,     // Repeated chars flag
      "model_version": "rules_v1"        // Model identifier
    }
  }
}
FieldTypeDescription
slur_detectedbooleanTrue if slurs found
threat_detectedbooleanTrue if threat patterns matched
self_harm_flagbooleanTrue if self-harm phrases found
profanity_flagbooleanTrue if profanity detected
caps_abusebooleanTrue if >70% uppercase
character_repetitionbooleanTrue if 3+ repeated characters
model_versionstringModel identifier

Ensemble Structure

The ensemble object provides the final moderation decision:
{
  "ensemble": {
    "summary": "likely_harmful",       // Overall assessment
    "primary_issue": "sexism",         // Main concern
    "score": 0.612,                    // Combined score (0-1)
    "severity": "high"                 // "low", "moderate", or "high"
  }
}

Summary Values

ValueScore RangeMeaning
likely_safe0.0 - 0.1No significant harmful content
potentially_harmful0.1 - 0.3Some harmful indicators present
likely_harmful0.3 - 0.6Probable harmful content
highly_harmful0.6 - 1.0Strong evidence of harmful content

Primary Issue Values

ValueDescription
"none"No significant issues detected
"sexism"Sexist content is the main concern
"toxicity"Toxic language is the main concern
"slur"Slur detected by rules
"threat"Threat detected by rules
"self_harm"Self-harm content detected
"harmful_content"Generic harmful content

Meta Structure

Metadata about the request and processing:
{
  "meta": {
    "processing_time_ms": 24,          // Total processing time
    "models_used": [                   // Models that ran
      "sexism_lasso_v1",
      "toxic_roberta_v1",
      "rules_v1"
    ]
  }
}
FieldTypeDescription
processing_time_msintegerTotal time to process request (milliseconds)
models_usedarrayList of model versions used

Complete Example

  • Harmful Content
  • Safe Content
  • Mixed Signals
Request:
{
  "text": "Women belong in the kitchen"
}
Response:
{
  "text": "Women belong in the kitchen",
  "label": {
    "sexism": {
      "score": 0.847,
      "severity": "high",
      "model_version": "sexism_lasso_v1",
      "threshold_met": true
    },
    "toxicity": {
      "overall": 0.621,
      "insult": 0.543,
      "threat": 0.087,
      "identity_attack": 0.621,
      "profanity": 0.124,
      "model_version": "toxic_roberta_v1"
    },
    "rules": {
      "slur_detected": false,
      "threat_detected": false,
      "self_harm_flag": false,
      "profanity_flag": false,
      "caps_abuse": false,
      "character_repetition": false,
      "model_version": "rules_v1"
    }
  },
  "ensemble": {
    "summary": "highly_harmful",
    "primary_issue": "sexism",
    "score": 0.689,
    "severity": "high"
  },
  "meta": {
    "processing_time_ms": 27,
    "models_used": [
      "sexism_lasso_v1",
      "toxic_roberta_v1",
      "rules_v1"
    ]
  }
}

Using Response Data

Decision Making

1

Check ensemble.summary

Use the summary for quick decisions:
  • "likely_safe": Allow content
  • "potentially_harmful": Flag for review
  • "likely_harmful" or "highly_harmful": Block or moderate
2

Check ensemble.primary_issue

Understand why content was flagged:
  • Show specific feedback to users
  • Route to appropriate moderators
  • Apply category-specific rules
3

Use individual labels for detail

Access specific model outputs for:
  • Detailed reporting
  • Custom threshold logic
  • Audit trails
4

Monitor processing_time_ms

Track performance:
  • Identify slow requests
  • Optimize infrastructure
  • Set appropriate timeouts

Example Implementation

def handle_moderation_result(response):
    ensemble = response["ensemble"]

    if ensemble["summary"] == "highly_harmful":
        # Block immediately
        return "BLOCK"

    elif ensemble["summary"] == "likely_harmful":
        # Check specific issues
        if ensemble["primary_issue"] in ["threat", "self_harm"]:
            return "BLOCK_AND_ALERT"
        else:
            return "HOLD_FOR_REVIEW"

    elif ensemble["summary"] == "potentially_harmful":
        # Allow with monitoring
        return "ALLOW_WITH_FLAG"

    else:
        # Safe to publish
        return "ALLOW"

Next Steps