Agreement Calculation

qualcode.ai codes each response with two independent AI raters in per-response isolated API calls — no cross-contamination, no order effects. This isolation is what makes the agreement metrics methodologically valid, not just cosmetic.

What is Inter-Rater Agreement?

Inter-rater agreement measures how often two independent raters assign the same category to a response. In traditional research, this involves two human coders. qualcode.ai uses two different AI models (one from OpenAI, one from Anthropic) to provide independent classifications.

Why two raters? Two independent AI raters from different providers, each coding every response in its own isolated API call, eliminate shared-context bias and deliver the same reliability metrics used in traditional human coding studies.

How Agreement Rate is Calculated

The agreement rate is the percentage of responses where both raters assigned the same category:

Agreement Rate = (Agreed Responses / Total Classified Responses) × 100%

What Counts as "Classified"

Not all responses are included in agreement calculations:

Status	Included in Agreement?	Reason
Agreed	Yes	Both raters assigned the same category
Disagreed	Yes	Raters assigned different categories
Reconciled	Yes	Human resolved a disagreement
Rejected	No	Pre-filtered as invalid (empty, spam, etc.)
Unclassifiable	No	Both raters couldn't classify (system issue)

Excluded responses don't affect agreement: Rejected and unclassifiable responses are excluded because they don't represent genuine classification decisions. This keeps your agreement rate focused on meaningful data.

Understanding Cohen's Kappa

While agreement rate tells you how often raters agree, it doesn't account for agreement that would occur by chance. Cohen's Kappa (κ) corrects for this.

Why Chance Agreement Matters

If you have only 2 categories, raters would agree 50% of the time just by random guessing. Kappa adjusts for this expected chance agreement:

κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)

Interpreting Kappa Values

Kappa	Interpretation	Action
<0.20	Poor agreement	Review your coding guide - categories may be ambiguous
0.21-0.40	Fair agreement	Consider refining category descriptions
0.41-0.60	Moderate agreement	Acceptable for exploratory research
0.61-0.80	Substantial agreement	Good reliability for most research purposes
0.81-1.00	Almost perfect	Excellent - your coding guide is well-designed

Kappa can be low even with high agreement: If one category dominates (e.g., 90% of responses), Kappa will be lower than the agreement rate suggests because chance agreement is high. This is statistically correct - it's harder to demonstrate reliability when there's little variation.

Understanding Krippendorff's Alpha

Krippendorff's Alpha (α) is another measure of inter-rater reliability, developed by Klaus Krippendorff. It has two key advantages over Cohen's Kappa:

Handles missing data: Alpha can include responses where one rater failed to classify, treating them as missing values rather than excluding them entirely.
Generalizes to more raters: While qualcode.ai uses two raters, Alpha's formula works for any number of raters.

Interpreting Alpha Values

Krippendorff proposed stricter thresholds than the Landis-Koch scale used for Kappa:

Alpha	Interpretation	Action
<0.67	Unreliable	Results should be discarded or used only tentatively
0.67-0.79	Tentative conclusions	Acceptable for exploratory purposes only
≥0.80	Reliable	Suitable for drawing firm conclusions

More conservative thresholds: Alpha's interpretation scale is stricter than Kappa's Landis-Koch scale. A Kappa of 0.65 would be "Substantial," but an Alpha of 0.65 would be "Unreliable." This doesn't mean Alpha is better - they're measuring the same concept with different assumptions.

Which Metric Should I Report?

Both metrics measure the same underlying concept (inter-rater reliability), so they will typically be similar. The choice depends on your field and publication venue:

Situation	Recommendation
Standard content analysis	Report Kappa with Landis-Koch interpretation
Communication research	Report Alpha (preferred in the field since Krippendorff, 2004)
Psychology journals	Report Kappa (more widely recognized)
Thorough methodology section	Report both with their respective interpretation scales
Many unclassifiable responses	Report Alpha (handles missing data)

When in doubt, report both: qualcode.ai calculates both metrics automatically. Reporting both demonstrates methodological rigor and lets readers interpret using the scale they're familiar with.

When Kappa and Alpha Differ

In qualcode.ai, Kappa and Alpha are calculated on slightly different data:

Kappa: Calculated only on responses where both raters successfully classified (excludes Unclassifiable)
Alpha: Includes responses where one or both raters returned empty (treats as missing data)

This means Alpha may be based on more responses than Kappa, particularly if you have many unclassifiable responses. If the metrics differ significantly, check how many responses were unclassifiable - that's likely the cause.

What Affects Agreement?

Several factors influence how well the two AI raters agree:

Coding Guide Quality

Clear descriptions: Explicit, detailed category descriptions increase agreement
Distinct categories: Overlapping categories cause confusion
Good examples: Include what belongs AND what doesn't belong

Response Characteristics

Response length: Very short responses are harder to classify accurately
Ambiguity: Some responses genuinely fit multiple categories
Off-topic content: Responses that don't address the question

Category Distribution

Balanced categories: More variation = more meaningful agreement stats
Number of categories: More categories = lower expected chance agreement

Low Confidence Responses

Each rater provides a confidence score (0-1) with their classification. Responses where either rater has low confidence are flagged for review, even if both raters agreed.

Why Review Low Confidence Agreements?

Two raters might agree on a category but both be uncertain. This often indicates:

The response is borderline between categories
The response is unusual or edge-case
Category descriptions need improvement

Confidence threshold: By default, responses where either rater's confidence is below 0.6 are flagged for review. You can adjust this threshold when starting a coding run.

Multi-Label Agreement

In multi-label mode, agreement is more complex because each response can have multiple categories assigned:

Agreement is calculated per response based on the set of assigned categories
Partial overlap (some categories match) is treated as disagreement
Both raters must assign the exact same set of categories for agreement

Next: Learn about the different export formats available for your coded data.