Agreement Calculation

qualcode.ai uses two independent AI raters to classify each response. Understanding how agreement is measured helps you interpret your results and identify potential issues with your coding guide.

What is Inter-Rater Agreement?

Inter-rater agreement measures how often two independent raters assign the same category to a response. In traditional research, this involves two human coders. qualcode.ai uses two different AI models (one from OpenAI, one from Anthropic) to provide independent classifications.

Why two raters? Using two independent AI raters from different providers reduces the risk of systematic bias and provides a reliability check similar to traditional human coding methods.

How Agreement Rate is Calculated

The agreement rate is the percentage of responses where both raters assigned the same category:

Agreement Rate = (Agreed Responses / Total Classified Responses) × 100%

What Counts as "Classified"

Not all responses are included in agreement calculations:

Status Included in Agreement? Reason
Agreed Yes Both raters assigned the same category
Disagreed Yes Raters assigned different categories
Reconciled Yes Human resolved a disagreement
Rejected No Pre-filtered as invalid (empty, spam, etc.)
Unclassifiable No Both raters couldn't classify (system issue)

Excluded responses don't affect agreement: Rejected and unclassifiable responses are excluded because they don't represent genuine classification decisions. This keeps your agreement rate focused on meaningful data.

Understanding Cohen's Kappa

While agreement rate tells you how often raters agree, it doesn't account for agreement that would occur by chance. Cohen's Kappa (κ) corrects for this.

Why Chance Agreement Matters

If you have only 2 categories, raters would agree 50% of the time just by random guessing. Kappa adjusts for this expected chance agreement:

κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)

Interpreting Kappa Values

Kappa Interpretation Action
<0.20 Poor agreement Review your coding guide - categories may be ambiguous
0.21-0.40 Fair agreement Consider refining category descriptions
0.41-0.60 Moderate agreement Acceptable for exploratory research
0.61-0.80 Substantial agreement Good reliability for most research purposes
0.81-1.00 Almost perfect Excellent - your coding guide is well-designed

Kappa can be low even with high agreement: If one category dominates (e.g., 90% of responses), Kappa will be lower than the agreement rate suggests because chance agreement is high. This is statistically correct - it's harder to demonstrate reliability when there's little variation.

Understanding Krippendorff's Alpha

Krippendorff's Alpha (α) is another measure of inter-rater reliability, developed by Klaus Krippendorff. It has two key advantages over Cohen's Kappa:

  • Handles missing data: Alpha can include responses where one rater failed to classify, treating them as missing values rather than excluding them entirely.
  • Generalizes to more raters: While qualcode.ai uses two raters, Alpha's formula works for any number of raters.

Interpreting Alpha Values

Krippendorff proposed stricter thresholds than the Landis-Koch scale used for Kappa:

Alpha Interpretation Action
<0.67 Unreliable Results should be discarded or used only tentatively
0.67-0.79 Tentative conclusions Acceptable for exploratory purposes only
≥0.80 Reliable Suitable for drawing firm conclusions

More conservative thresholds: Alpha's interpretation scale is stricter than Kappa's Landis-Koch scale. A Kappa of 0.65 would be "Substantial," but an Alpha of 0.65 would be "Unreliable." This doesn't mean Alpha is better - they're measuring the same concept with different assumptions.

Which Metric Should I Report?

Both metrics measure the same underlying concept (inter-rater reliability), so they will typically be similar. The choice depends on your field and publication venue:

Situation Recommendation
Standard content analysis Report Kappa with Landis-Koch interpretation
Communication research Report Alpha (preferred in the field since Krippendorff, 2004)
Psychology journals Report Kappa (more widely recognized)
Thorough methodology section Report both with their respective interpretation scales
Many unclassifiable responses Report Alpha (handles missing data)

When in doubt, report both: qualcode.ai calculates both metrics automatically. Reporting both demonstrates methodological rigor and lets readers interpret using the scale they're familiar with.

When Kappa and Alpha Differ

In qualcode.ai, Kappa and Alpha are calculated on slightly different data:

  • Kappa: Calculated only on responses where both raters successfully classified (excludes Unclassifiable)
  • Alpha: Includes responses where one or both raters returned empty (treats as missing data)

This means Alpha may be based on more responses than Kappa, particularly if you have many unclassifiable responses. If the metrics differ significantly, check how many responses were unclassifiable - that's likely the cause.

What Affects Agreement?

Several factors influence how well the two AI raters agree:

Coding Guide Quality

  • Clear descriptions: Explicit, detailed category descriptions increase agreement
  • Distinct categories: Overlapping categories cause confusion
  • Good examples: Include what belongs AND what doesn't belong

Response Characteristics

  • Response length: Very short responses are harder to classify accurately
  • Ambiguity: Some responses genuinely fit multiple categories
  • Off-topic content: Responses that don't address the question

Category Distribution

  • Balanced categories: More variation = more meaningful agreement stats
  • Number of categories: More categories = lower expected chance agreement

Low Confidence Responses

Each rater provides a confidence score (0-1) with their classification. Responses where either rater has low confidence are flagged for review, even if both raters agreed.

Why Review Low Confidence Agreements?

Two raters might agree on a category but both be uncertain. This often indicates:

  • The response is borderline between categories
  • The response is unusual or edge-case
  • Category descriptions need improvement

Confidence threshold: By default, responses where either rater's confidence is below 0.6 are flagged for review. You can adjust this threshold when starting a coding run.

Multi-Label Agreement

In multi-label mode, agreement is more complex because each response can have multiple categories assigned:

  • Agreement is calculated per response based on the set of assigned categories
  • Partial overlap (some categories match) is treated as disagreement
  • Both raters must assign the exact same set of categories for agreement

Next: Learn about the different export formats available for your coded data.