Improving Docstats: A Journey in Quality Hill Climbing
Docstats is a multi-protocol server designed to provide scalable readability metrics for AI agents and developers across web, PDF, and cloud sources. This project demonstrates high-assurance software engineering through modular refactoring and the implementation of a data-driven "Golden Set" for rigorous performance benchmarking.
Readability metrics are often the unsung heroes of content strategy. Whether you're writing technical documentation, legal contracts, or children's stories, knowing the linguistic complexity of your text is vital.
Docstats is a tool I'd created about six months ago to provide these metrics at scale, supporting everything from direct text input to web URLs and Google Cloud Storage (GCS) PDF files. Recently, I revisited this with Gemini CLI to review the codebase to move beyond simple functionality and toward high-assurance quality.
The Purpose of Docstats
Docstats calculates a comprehensive set of readability scores, including:
- Flesch Reading Ease & Flesch-Kincaid Grade Level
- Gunning Fog Index & SMOG Index
- Automated Readability Index (ARI)
- Coleman-Liau Index
- Spache Readability Score
It's not just a library; it's a multi-protocol server. You can run it as a FastAPI HTTP server or as an MCP (Model Context Protocol) server, allowing AI agents to "read" and analyze documents autonomously.
Quick Example: Evaluating a Research Paper
Evaluating a complex document is now as simple as a single API call. For example, to analyze a recent paper from arXiv:
curl -X POST "http://localhost:8000/scores/" \
-H "Content-Type: application/json" \
-d '{ "web_url": "https://arxiv.org/pdf/2503.05244" }'
Docstats automatically detects the PDF format, extracts the text, and returns a detailed JSON breakdown of its complexity.
{
"flesch_reading_ease": 25.879137533818948,
"flesch_kincaid_grade": 14.568110824527679,
"gunning_fog": 17.588858807286314,
"smog_index": 15.764899803559697,
"automated_readability_index": 16.818193221887633,
"coleman_liau_index": 16.24712440516655,
"linsear_write_formula": 19,
"dale_chall_readability_score": 14.554056901348915,
"text_standard": "15.0",
"spache": 10.00798599597112,
"syllable_count": 27970,
"word_count": 14710,
"sentence_count": 743
}
Or, my more common use, with Gemini CLI as an MCP server:
Use docstats to get readability scores for this PDF: https://arxiv.org/pdf/2503.05244 and explain the results to me.
Readability Analysis: arXiv 2503.05244
This document presents the readability scores for the paper 2503.05244.
Readability Summary
| Metric | Score | Interpretation |
|---|---|---|
| Flesch Reading Ease | 25.88 | Very Difficult. Typical for academic papers. |
| Flesch-Kincaid Grade | 14.57 | College Level. Requires ~15 years of formal education. |
| Gunning Fog Index | 17.59 | Graduate Level. Written for a post-graduate audience. |
| SMOG Index | 15.76 | Graduate Level. |
| Automated Readability Index | 16.82 | Graduate Level. |
| Text Standard | 15.0 | Consensus Grade Level. |
Text Statistics
- Word Count: 14,710
- Sentence Count: 743
- Syllable Count: 27,970
- Average Sentence Length: ~19.8 words
- Average Syllables per Word: ~1.9
Analysis
The scores indicate a highly technical academic document. The low Flesch Reading Ease and high Grade Level indices (Flesch-Kincaid, Gunning Fog, SMOG) are consistent with professional research papers that employ specialized terminology and complex syntactic structures.
The Quality Hill Climb: Recent Improvements
We didn't just add features; we refactored for the long haul. Here's how we improved the "quality ceiling" of the project:
1. Modular Refactoring
The original main.py was a classic "God Object", handling extraction, logic, routing, and server management. Gemini CLI and I decomposed it into domain-specific modules:
extraction.py: Handles the nuances ofhttpxfor web andpypdffor documents.metrics.py: The pure business logic for readability calculations.models.py: Pydantic models for strict data validation.mcp_server.py: Dedicated Model Context Protocol implementations.
2. Test Coverage & Mocking
Integration tests are great, but they are brittle when they rely on live websites or GCS buckets. We introduced a unit testing layer using pytest-mock. This allows us to verify extraction logic and MCP tool handlers in milliseconds without an internet connection.
3. The 'Golden Set' & Benchmarking
The most significant addition was the Readability Golden Set. We created a collection of reference texts representing different complexity levels:
- Primary: Simple sentences, 1st-3rd grade level.
- Middle: Descriptive prose, 6th-8th grade level.
- Academic: Specialized vocabulary, graduate level.
- Legal: Complex clauses and high "legalese."
We built a baseline_analysis.py script that captures the "ground truth" for these files. When run, it produces a clear calibration map:
| Sample | Grade Standard | Word Count |
|---|---|---|
| level_primary.txt | -1.0 | 133 |
| level_middle.txt | 15.0 | 153 |
| level_academic.txt | 23.0 | 140 |
| level_legal.txt | 25.0 | 109 |
Now, if we change how we extract text or normalize punctuation, we can instantly see if the readability scores shift.
Why this matters: Hill Climbing for Quality
In this context, Hill Climbing is an iterative optimization process. We use the baseline to understand where a text stands, identify the linguistic "levers" (like sentence length or syllable counts), and then make targeted adjustments. By re-running the metrics after each change, we "climb" toward the desired readability peak for our target audience. This ensures every code change moves the needle in the right direction.
Conclusion
By focusing on modularity, robust testing, and a data-driven "Golden Set," Docstats has evolved from a simple script into a reliable engineering tool. Whether you're a developer building an AI agent or a content creator aiming for clarity, Docstats provides the foundation you need.
Check out the code on GitHub.