Readability metrics are often the unsung heroes of content strategy. Whether you're writing technical documentation, legal contracts, or children's stories, knowing the linguistic complexity of your text is vital.

Docstats is a tool I'd created about six months ago to provide these metrics at scale, supporting everything from direct text input to web URLs and Google Cloud Storage (GCS) PDF files. Recently, I revisited this with Gemini CLI to review the codebase to move beyond simple functionality and toward high-assurance quality.

The Purpose of Docstats

Docstats calculates a comprehensive set of readability scores, including:

It's not just a library; it's a multi-protocol server. You can run it as a FastAPI HTTP server or as an MCP (Model Context Protocol) server, allowing AI agents to "read" and analyze documents autonomously.

Quick Example: Evaluating a Research Paper

Evaluating a complex document is now as simple as a single API call. For example, to analyze a recent paper from arXiv:

curl -X POST "http://localhost:8000/scores/" \
     -H "Content-Type: application/json" \
     -d '{ "web_url": "https://arxiv.org/pdf/2503.05244" }'

Docstats automatically detects the PDF format, extracts the text, and returns a detailed JSON breakdown of its complexity.

{
  "flesch_reading_ease": 25.879137533818948,
  "flesch_kincaid_grade": 14.568110824527679,
  "gunning_fog": 17.588858807286314,
  "smog_index": 15.764899803559697,
  "automated_readability_index": 16.818193221887633,
  "coleman_liau_index": 16.24712440516655,
  "linsear_write_formula": 19,
  "dale_chall_readability_score": 14.554056901348915,
  "text_standard": "15.0",
  "spache": 10.00798599597112,
  "syllable_count": 27970,
  "word_count": 14710,
  "sentence_count": 743
}

Or, my more common use, with Gemini CLI as an MCP server:

Use docstats to get readability scores for this PDF: https://arxiv.org/pdf/2503.05244 and explain the results to me.

Readability Analysis: arXiv 2503.05244

This document presents the readability scores for the paper 2503.05244.

Readability Summary

Metric	Score	Interpretation
Flesch Reading Ease	25.88	Very Difficult. Typical for academic papers.
Flesch-Kincaid Grade	14.57	College Level. Requires ~15 years of formal education.
Gunning Fog Index	17.59	Graduate Level. Written for a post-graduate audience.
SMOG Index	15.76	Graduate Level.
Automated Readability Index	16.82	Graduate Level.
Text Standard	15.0	Consensus Grade Level.

Text Statistics

Word Count: 14,710
Sentence Count: 743
Syllable Count: 27,970
Average Sentence Length: ~19.8 words
Average Syllables per Word: ~1.9

Analysis

The scores indicate a highly technical academic document. The low Flesch Reading Ease and high Grade Level indices (Flesch-Kincaid, Gunning Fog, SMOG) are consistent with professional research papers that employ specialized terminology and complex syntactic structures.

The Quality Hill Climb: Recent Improvements

I didn't just add features, the new Gemini 3 Flash model helped refactor the code so I could better understand the changes.. Here's how we improved the "quality ceiling" of the project:

1. Modular Refactoring

The original main.py was a classic "God Object", handling extraction, logic, routing, and server management. Gemini CLI and I decomposed it into domain-specific modules:

extraction.py: Handles the nuances of httpx for web and pypdf for documents.
metrics.py: The pure business logic for readability calculations.
models.py: Pydantic models for strict data validation.
mcp_server.py: Dedicated Model Context Protocol implementations.

2. Test Coverage & Mocking

Integration tests are great, but they are brittle when they rely on live websites or GCS buckets. I requested the introduction of a unit testing layer using pytest-mock. This allowed me to verify extraction logic and MCP tool handlers in milliseconds without an internet connection.

3. The 'Golden Set' & Benchmarking

The most significant addition was the Readability Golden Set. I had Gemini CLI create an initial collection of reference texts representing different complexity levels:

Primary: Simple sentences, 1st-3rd grade level.
Middle: Descriptive prose, 6th-8th grade level.
Academic: Specialized vocabulary, graduate level.
Legal: Complex clauses and high "legalese."

Then, it built a baseline_analysis.py script that captures the "ground truth" for these files. When run, it produces a clear calibration map:

Sample	Grade Standard	Word Count
level_primary.txt	-1.0	133
level_middle.txt	15.0	153
level_academic.txt	23.0	140
level_legal.txt	25.0	109

Now, if there's a change how the extraction of text text or punctuation normalization, I can instantly see if the readability scores shift.

Why this matters: Hill Climbing for Quality

Hill Climbing is an iterative optimization process. Now that I have a baseline to use to understand where a text stands, I can identify the linguistic "levers" (like sentence length or syllable counts), and then make targeted adjustments. By re-running the metrics after each change, we "climb" toward the desired readability peak for our target audience. This ensures every code change moves the needle in the right direction.

Conclusion

By focusing on modularity, robust testing, and a data-driven "Golden Set," Docstats has evolved from a simple script into a reliable engineering tool. Whether you're a developer building an AI agent or a content creator aiming for clarity, Docstats provides the foundation you need.

Check out the code on GitHub.