Engineering |

Improving Docstats: A Journey in Quality Hill Climbing

Docstats is a multi-protocol server designed to provide scalable readability metrics for AI agents and developers across web, PDF, and cloud sources. This project demonstrates high-assurance software engineering through modular refactoring and the implementation of a data-driven "Golden Set" for rigorous performance benchmarking.

Readability metrics are often the unsung heroes of content strategy. Whether you're writing technical documentation, legal contracts, or children's stories, knowing the linguistic complexity of your text is vital.

Docstats is a tool I'd created about six months ago to provide these metrics at scale, supporting everything from direct text input to web URLs and Google Cloud Storage (GCS) PDF files. Recently, I revisited this with Gemini CLI to review the codebase to move beyond simple functionality and toward high-assurance quality.

The Purpose of Docstats

Docstats calculates a comprehensive set of readability scores, including:

It's not just a library; it's a multi-protocol server. You can run it as a FastAPI HTTP server or as an MCP (Model Context Protocol) server, allowing AI agents to "read" and analyze documents autonomously.

Quick Example: Evaluating a Research Paper

Evaluating a complex document is now as simple as a single API call. For example, to analyze a recent paper from arXiv:

curl -X POST "http://localhost:8000/scores/" \
     -H "Content-Type: application/json" \
     -d '{ "web_url": "https://arxiv.org/pdf/2503.05244" }'

Docstats automatically detects the PDF format, extracts the text, and returns a detailed JSON breakdown of its complexity.

{
  "flesch_reading_ease": 25.879137533818948,
  "flesch_kincaid_grade": 14.568110824527679,
  "gunning_fog": 17.588858807286314,
  "smog_index": 15.764899803559697,
  "automated_readability_index": 16.818193221887633,
  "coleman_liau_index": 16.24712440516655,
  "linsear_write_formula": 19,
  "dale_chall_readability_score": 14.554056901348915,
  "text_standard": "15.0",
  "spache": 10.00798599597112,
  "syllable_count": 27970,
  "word_count": 14710,
  "sentence_count": 743
}

Or, my more common use, with Gemini CLI as an MCP server:

Use docstats to get readability scores for this PDF: https://arxiv.org/pdf/2503.05244 and explain the results to me.


Readability Analysis: arXiv 2503.05244

This document presents the readability scores for the paper 2503.05244.

Readability Summary

Metric Score Interpretation
Flesch Reading Ease 25.88 Very Difficult. Typical for academic papers.
Flesch-Kincaid Grade 14.57 College Level. Requires ~15 years of formal education.
Gunning Fog Index 17.59 Graduate Level. Written for a post-graduate audience.
SMOG Index 15.76 Graduate Level.
Automated Readability Index 16.82 Graduate Level.
Text Standard 15.0 Consensus Grade Level.

Text Statistics

  • Word Count: 14,710
  • Sentence Count: 743
  • Syllable Count: 27,970
  • Average Sentence Length: ~19.8 words
  • Average Syllables per Word: ~1.9

Analysis

The scores indicate a highly technical academic document. The low Flesch Reading Ease and high Grade Level indices (Flesch-Kincaid, Gunning Fog, SMOG) are consistent with professional research papers that employ specialized terminology and complex syntactic structures.

The Quality Hill Climb: Recent Improvements

We didn't just add features; we refactored for the long haul. Here's how we improved the "quality ceiling" of the project:

1. Modular Refactoring

The original main.py was a classic "God Object", handling extraction, logic, routing, and server management. Gemini CLI and I decomposed it into domain-specific modules:

  • extraction.py: Handles the nuances of httpx for web and pypdf for documents.
  • metrics.py: The pure business logic for readability calculations.
  • models.py: Pydantic models for strict data validation.
  • mcp_server.py: Dedicated Model Context Protocol implementations.

2. Test Coverage & Mocking

Integration tests are great, but they are brittle when they rely on live websites or GCS buckets. We introduced a unit testing layer using pytest-mock. This allows us to verify extraction logic and MCP tool handlers in milliseconds without an internet connection.

3. The 'Golden Set' & Benchmarking

The most significant addition was the Readability Golden Set. We created a collection of reference texts representing different complexity levels:

  • Primary: Simple sentences, 1st-3rd grade level.
  • Middle: Descriptive prose, 6th-8th grade level.
  • Academic: Specialized vocabulary, graduate level.
  • Legal: Complex clauses and high "legalese."

We built a baseline_analysis.py script that captures the "ground truth" for these files. When run, it produces a clear calibration map:

Sample Grade Standard Word Count
level_primary.txt -1.0 133
level_middle.txt 15.0 153
level_academic.txt 23.0 140
level_legal.txt 25.0 109

Now, if we change how we extract text or normalize punctuation, we can instantly see if the readability scores shift.

Why this matters: Hill Climbing for Quality

In this context, Hill Climbing is an iterative optimization process. We use the baseline to understand where a text stands, identify the linguistic "levers" (like sentence length or syllable counts), and then make targeted adjustments. By re-running the metrics after each change, we "climb" toward the desired readability peak for our target audience. This ensures every code change moves the needle in the right direction.

Conclusion

By focusing on modularity, robust testing, and a data-driven "Golden Set," Docstats has evolved from a simple script into a reliable engineering tool. Whether you're a developer building an AI agent or a content creator aiming for clarity, Docstats provides the foundation you need.

Check out the code on GitHub.