Quick Start
Three commands to go from repo to a live docs site with an optional AI chatbot.
Install the package
# Python 3.10+ required pip install deepdoc # With chatbot dependencies (FAISS, FastAPI, fastembed) pip install "deepdoc[chatbot]"
The base install pulls in LiteLLM, GitPython, Click, Rich, PyYAML, and Jinja2.
The [chatbot] extra adds faiss-cpu, fastapi, uvicorn, httpx, and fastembed for local embeddings.
Initialise against your repo
# cd into your target codebase first cd /path/to/your-repo # Initialize — writes .deepdoc.yaml with sensible defaults deepdoc init --provider anthropic # Add --with-chatbot to enable the /ask Q&A interface deepdoc init --provider openai --with-chatbot # Set your API key (matches llm.api_key_env in .deepdoc.yaml) export ANTHROPIC_API_KEY=sk-ant-...
init is safe to re-run — it only adds keys that are missing. The generated
.deepdoc.yaml lives in your repo root and should be committed.
Generate and preview
# Full pipeline — scan, plan, generate, build deepdoc generate ✓ Scan complete — 312 files · 47 endpoints · 8 integrations ✓ Plan complete — 34 buckets across 6 sections ✓ Generate complete — 34 pages · 2 MDX fixes applied ✓ Chatbot index built — 1,247 chunks · 4.2 MB FAISS index ✓ Site scaffold written → site/ Phase timings: scan 4s · plan 18s · generate 72s · build 3s # Preview locally (Node 18+ required, deps auto-installed) deepdoc serve Docs → http://localhost:3000 Chat → http://localhost:3000/ask
Architecture
DeepDoc is a stateful, five-phase pipeline. Each phase has a single responsibility and
a well-defined output contract. Nothing is hand-edited inside docs/,
site/, or .deepdoc/ — all
fixes go into generators, builders, or prompts.
deepdoc/cli.py Click entry-point. init, generate, update, serve, deploy commands. deepdoc/pipeline_v2.py PipelineV2 orchestrator. Drives all five phases end-to-end. deepdoc/v2_models.py DocBucket + DocPlan dataclasses — the central data contract. deepdoc/config.py .deepdoc.yaml schema, DEFAULT_CONFIG dict, load/save helpers. deepdoc/scanner/ Static repo analysis — endpoints, runtime, integrations, DB, artifacts. deepdoc/planner/engine.py Multi-step LLM planner orchestration and bucket-scan entrypoint. deepdoc/planner/heuristics.py Public planning API: _merge_plan, _build_heuristic_assignment, _llm_step. deepdoc/generator/generation.py PageGenerator + BucketGenerationEngine. Batched parallel generation. deepdoc/generator/mdx_compile_gate.py Node.js MDX validator + LLM-reprompt + JSX-strip fallback. deepdoc/chatbot/service.py ChatbotQueryService. query / deep_research / code_deep modes. deepdoc/chatbot/retrieval_mixin.py All retrieval: FAISS + FTS, rerank, expansion, hit-selection. deepdoc/site/builder/engine.py SiteBuilderEngine — writes the full Fumadocs scaffold. deepdoc/persistence_v2.py Ledger, sync state, manifest, plan storage under .deepdoc/. deepdoc/smart_update_v2.py Incremental update logic — diff-based replan and regen. Five-Phase Pipeline
Each phase runs sequentially. The output of one is the sole input to the next — there are no cross-phase side effects.
Scan
Traverse the repo without any LLM calls. Detects HTTP endpoints (Django, FastAPI, Express, NestJS, Laravel, Go, Falcon), runtime surfaces (async workers, schedulers, Celery), third-party integrations, database schemas (SQL, Knex, GraphQL), OpenAPI specs, and config artifacts.
Plan
Multi-step LLM bucket planner (3 sequential LLM calls). Step 1: classify the repo type. Step 2: propose documentation buckets. Step 3: inject specialised buckets (database groups, runtime pages, integration pages). Then heuristics refine ownership: assign files/symbols, decompose oversized buckets, consolidate singletons.
Generate
For each bucket: assemble an evidence pack (source snippets, artifact extracts, scan hits), call the LLM, receive MDX. Inline validation runs immediately — on failure the LLM is reprompted with the exact error. If still broken, JSX is stripped and a safe Markdown fallback is written. Nothing broken ever hits disk.
API Ref
When an OpenAPI spec is found during scan, stage the raw JSON/YAML alongside a Fumadocs-compatible manifest.json so the /api/* page tree renders interactive endpoint reference pages automatically.
Build
Write the complete Fumadocs Next.js scaffold: fumadocs.config.ts, root layout.tsx, global.css, the page tree JSON, search route, and one page.tsx per MDX file. When chatbot is enabled, also emit the React chatbot widget, sidebar toggle, and inline chat components.
After each LLM generation call, DeepDoc spawns a lightweight Node.js validator that compiles the MDX in isolation. If it throws, the exact compile error is injected back into a reprompt call. If the reprompt still fails, JSX is stripped and a safe Markdown fallback is written instead. Nothing broken ever reaches disk.
Core Data Models
The entire pipeline communicates through two dataclasses defined in
deepdoc/v2_models.py.
The atomic planning unit. One bucket = one documentation page. The planner creates, assigns, and refines buckets; the generator consumes them.
@dataclass
class DocBucket:
# Identity
bucket_type: str # system | feature | endpoint | integration | database
title: str # page title shown in nav
slug: str # URL slug, e.g. "checkout-flow"
section: str # nav section grouping
description: str # one-line description for planning context
# Ownership — what evidence goes into this page
owned_files: list[str] # file paths relative to repo root
owned_symbols: list[str] # "module.Class.method" qualified names
artifact_refs: list[str] # config keys, env vars, OpenAPI ops
# Generation instructions
generation_hints: dict # prompt_style, is_introduction_page, is_endpoint_ref
required_sections: list[str]
required_diagrams: list[str]
# Topology
depends_on: list[str] # slugs this bucket depends on
parent_slug: str | None
# Metadata
priority: int # generation order within section
publication_tier: str # core | advanced | deprecated
source_kind_summary: str # summary of what kind of code is in owned_files The complete output of the planner. Consumed by the generator and site builder.
@dataclass
class DocPlan:
buckets: list[DocBucket] # ordered list — generation order
nav_structure: dict[str, list] # section → [slug, ...] for sidebar nav
skipped_files: list[str] # intentionally excluded (tests, fixtures…)
orphaned_files: list[str] # files not assigned to any bucket
classification: str # repo type: "api_service" | "monolith" | …
integration_candidates: list[str]
# Shim — adapts DocBucket to legacy DocPage interface
@property
def pages(self) -> list[_BucketAsPage]: ... Configuration
.deepdoc.yaml lives in your repo root.
deepdoc init writes sensible defaults; you only need to set
llm.provider and your API key env var.
project_name: My API description: "Internal API service for orders and fulfilment" llm: provider: anthropic # required model: claude-opus-4-7 # optional — uses provider default if omitted api_key_env: ANTHROPIC_API_KEY generation_mode: feature_buckets # v2 pipeline (default) max_parallel_workers: 6 output_dir: docs site_dir: site # Optional chatbot chatbot: enabled: true embeddings: provider: fastembed # 100% local — no API key needed model: nomic-ai/nomic-embed-text-v1.5
project_name (repo dir name) Human-readable project name used in site title and nav header. description "" One-line summary shown in hero and meta description. output_dir "docs" Where generated MDX pages are written. site_dir "site" Where the Fumadocs Next.js scaffold is written. generation_mode "feature_buckets" v2 bucket-based pipeline. Set "file_centric" for legacy v1 mode. max_pages 0 Cap on total pages generated. 0 = unlimited. max_parallel_workers 6 Concurrent LLM calls during generate phase. llm.provider (required) LLM provider: anthropic, openai, gemini, azure, ollama, … llm.model (provider default) Model name. Passed directly to LiteLLM. llm.api_key_env auto-detected Env var name holding the API key. Inferred from provider if omitted. llm.temperature 0.2 Generation temperature. Lower = more deterministic output. chatbot.enabled false Enable chatbot indexing and the /ask route in the built site. chatbot.answer.provider (inherits llm) Separate provider for chatbot answer calls. chatbot.embeddings.provider "fastembed" Embedding provider. fastembed runs 100% locally — no API key. chatbot.embeddings.model nomic-embed-text-v1.5 Local embedding model for fastembed. Downloaded once on first run. large_file_lines 500 Files above this threshold get chunked before evidence assembly. giant_file_lines 2000 Files above this threshold are split into symbol clusters. source_context_budget 200000 Max chars of source included in a single evidence pack. CLI Reference
All commands accept --help for the full flag list.
Run python -m deepdoc.cli --help to verify installation.
deepdoc init Scaffold a .deepdoc.yaml config file in the current directory. Safe to re-run — only missing keys are added.
| Flag | Value | Description |
|---|---|---|
--provider | anthropic | openai | gemini | … | LLM provider (required). Sets llm.provider in .deepdoc.yaml. |
--model | string | Override the default model for the chosen provider. |
--with-chatbot | flag | Enable the chatbot module. Adds chatbot config block with sensible defaults. |
--output-dir | path | Where generated MDX pages land. Default: docs/ |
--site-dir | path | Where the Fumadocs scaffold is written. Default: site/ |
deepdoc generate Run the full five-phase pipeline. Reads .deepdoc.yaml. On success prints per-phase timings and a page-quality summary.
| Flag | Value | Description |
|---|---|---|
--force | flag | Ignore existing state and regenerate everything from scratch. |
--reconcile | flag | Re-run planning then regenerate pages whose bucket assignment changed. |
--dry-run | flag | Run scan + plan only, print the proposed bucket list, exit without generating. |
--workers | int | Override max_parallel_workers for this run. |
deepdoc update Incremental update. Diffs the repo against the last synced commit. Regenerates only affected buckets; triggers a targeted replan when new routes or integrations appear.
| Flag | Value | Description |
|---|---|---|
--since | git-ref | Compare against this ref instead of the stored sync baseline. |
--replan | flag | Force a full replan even if the changed file set is small. |
deepdoc serve cd into site/ and run next dev. Requires Node 18+ and an initial npm install (done automatically on first run).
| Flag | Value | Description |
|---|---|---|
--port | int | HTTP port for the Next.js dev server. Default: 3000. |
--open | flag | Open the browser automatically after starting. |
deepdoc deploy Build the Fumadocs site (next build) and push the output to GitHub Pages. Requires git remote origin to be set.
| Flag | Value | Description |
|---|---|---|
--branch | string | Target GitHub Pages branch. Default: gh-pages. |
--force | flag | Force-push to the deploy branch. |
Bucket Types
Every documentation page belongs to a bucket. The planner picks the right type based on the repo classification and scanned signals. Each type maps to a different generation prompt and page template.
system Top-level architecture, setup guides, auth flows, observability, and ops runbooks. Every repo gets at least one.
feature Business-logic workflows: checkout, refunds, onboarding, notifications. Scoped to how engineers think about the domain, not how files are laid out.
endpoint Endpoint-family API docs. One bucket per logical resource (e.g. all /orders/* routes). Enriched with scanned route metadata.
endpoint_ref Single-endpoint page backed by an OpenAPI spec. Rendered as an interactive Fumadocs API reference page at /api/{slug}.
integration Third-party system docs: payment gateways, delivery providers, webhook consumers, external APIs. Auto-detected from import patterns.
database Schema and data-layer docs. Grouped by model family when a repo has many models; a single overview page for smaller schemas.
Supported Languages
DeepDoc parses source files natively (no subprocess or tree-sitter compilation required for most languages). Route detection is framework-aware.
.py Django · FastAPI · Falcon · Flask (generic) deepdoc/parser/python_parser.py .js, .mjs Express · Fastify · NestJS (JS mode) deepdoc/parser/js_ts_parser.py .ts, .tsx Express · Fastify · NestJS · Next.js API routes deepdoc/parser/js_ts_parser.py .go net/http · Gin · Echo (heuristic) deepdoc/parser/go_parser.py .php Laravel (routes/web.php + routes/api.php) deepdoc/parser/php_parser.py .vue Vue Router (composition API) deepdoc/parser/vue_parser.py Chatbot
An opt-in RAG chatbot that answers questions about your codebase. Embeddings run 100% locally via fastembed — no extra API key or paid embedding service required.
Enable the chatbot
# Enable during init deepdoc init --provider anthropic --with-chatbot # Or add to .deepdoc.yaml manually chatbot: enabled: true # The chatbot model can differ from the doc generation model chatbot: enabled: true answer: provider: openai model: gpt-4o embeddings: provider: fastembed model: nomic-ai/nomic-embed-text-v1.5 # downloaded once, ~270 MB
Query modes
/query Single-pass BM25 + semantic retrieval. Best for quick look-ups. Returns in ~1s.
/deep-research Multi-hop synthesis. Bounded live-file fallback when the index lacks coverage. Streams via SSE.
/code-deep Strict source-first mode. Reads exact file ranges. Includes call trace and full file inventory in response.
Retrieval architecture
Query │ ├─ FAISS vector search (nomic-embed-text-v1.5, 768-dim) │ └─ top-K semantic candidates │ ├─ SQLite FTS5 search (BM25 lexical) │ └─ top-K keyword candidates │ ├─ Graph neighbour expansion │ └─ chunks adjacent to strong hits │ ├─ Optional: LLM rerank (cross-encoder scoring) │ └─ Hit selection → evidence[] + references[] │ └─ Answer generation (answer_mixin.py) ├─ evidence-first prompt ├─ exact file:line citations └─ SSE streaming response
.deepdoc/chatbot/faiss.index FAISS flat L2 index. Rebuilt on each generate run. .deepdoc/chatbot/fts.db SQLite FTS5 database for BM25 lexical search. .deepdoc/chatbot/chunks.json Chunk metadata: text, file, line range, type. .deepdoc/chatbot/source_archive/ Compressed source files for exact line-range citation in code-deep mode. .deepdoc/chatbot/symbol_index.json Symbol name → file:line map for fast symbol look-up. Incremental Updates
After the first full generate, use deepdoc update instead of
regenerating everything. It only spends LLM tokens on buckets affected by your changes.
# Diff against last sync, regenerate affected buckets only deepdoc update ✓ Diff: 8 files changed since 3f2a91b ✓ Affected buckets: checkout-flow, order-api, payment-integration ✓ Replan triggered: new route detected in orders/views.py ✓ Regenerated 4 pages in 18s (saved ~54s vs full run) # Force a full replan even on small changes deepdoc update --replan # Compare against a specific git ref deepdoc update --since main
When a targeted replan triggers
- New HTTP endpoint detected
- New third-party integration import
- New database model found
- A bucket's owned files are deleted
- Bucket count changes significantly
State files that power updates
-
.deepdoc/sync_state.jsonLast-generated commit hash -
.deepdoc/ledger.jsonPer-page quality + timestamp -
.deepdoc/plan.jsonSerialised DocPlan from last run -
.deepdoc/manifest.jsonFile → hash → page mapping
Deployment
The generated site/ directory is a standard Next.js project.
Deploy it anywhere that runs Next.js, or use deepdoc deploy
for one-command GitHub Pages deployment.
# Builds Next.js static export and pushes to gh-pages branch deepdoc deploy # .deepdoc.yaml — GitHub Pages config block github_pages: base_path: /your-repo-name # required if repo is not at root branch: gh-pages
# The site/ directory is a standard Next.js app cd site npm install npm run build # or: next build npm run start # production server # Or point Vercel/Netlify at the site/ subdirectory # Build command: next build # Output directory: site/.next # Root directory: site
name: Regenerate docs
on:
push:
branches: [main]
jobs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: {fetch-depth: 0}
- uses: actions/setup-python@v5
with: {python-version: "3.11"}
- uses: actions/setup-node@v4
with: {node-version: "20"}
- run: pip install deepdoc
- run: deepdoc update # incremental — only changed buckets
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- run: deepdoc deploy
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} Module Reference
All modules follow the rule: extend _v2 variants, never
hand-edit generated output, route fixes go in repo_resolver.py.
deepdoc/scanner/ Phase 1 — static analysis Zero LLM calls. Extracts every signal that the planner and generator consume as evidence.
discover_integrations()discover_database_schema()discover_runtime_surfaces()discover_artifacts()cluster_giant_file() deepdoc/planner/ Phase 2 — bucket planning heuristics.py is the public planning API. Tests mock at this path. engine.py orchestrates the multi-step planner.
plan_docs(repo_root, cfg) → DocPlanscan_repo(repo_root, cfg) → RepoScanbuild_flow_candidates()_merge_plan()_build_heuristic_assignment()_llm_step() deepdoc/generator/ Phase 3 — MDX generation post_processors.py handles MDX escaping, fence repair, and Mermaid diagram fixups after generation.
PageGeneratorBucketGenerationEngineAssembledEvidenceapply_mdx_compile_gate() deepdoc/parser/ Symbol & route extraction routes/ sub-package has per-framework detectors: Django, FastAPI, Express, NestJS, Laravel, Falcon, Go. repo_resolver.py auto-detects the framework.
parse_file(path, language) → ParsedFileParsedFileSymbolsupported_extensions() deepdoc/chatbot/ RAG indexing + query service retrieval_mixin.py owns all search logic. answer_mixin.py owns LLM prompting. service.py wires query(), deep_research(), code_deep().
ChatbotIndexerChatbotQueryServicecreate_fastapi_app() deepdoc/site/builder/ Phase 5 — scaffold generation scaffold_files.py emits the Fumadocs project. chatbot_components.py emits React chatbot widget when chatbot.enabled = true.
SiteBuilderEngine deepdoc/llm/ LLM provider abstraction Thin wrapper around LiteLLM. Supports every provider LiteLLM supports: OpenAI, Anthropic, Gemini, Azure, Mistral, Ollama, …
LLMClient(cfg) deepdoc/persistence_v2.py State persistence All state lives under .deepdoc/. Ledger tracks per-page quality. Sync state stores the last-generated git commit hash.
load_ledger()save_ledger()load_sync_state()save_sync_state() deepdoc/smart_update_v2.py Incremental update strategy Diffs changed files against the sync baseline, classifies impact (new route? new integration?), decides which buckets to regenerate.
compute_update_plan()run_smart_update()