DeepDoc
Documentation

Engineering Reference.

Everything you need to install, configure, run, and extend DeepDoc — from the five-phase pipeline internals to chatbot retrieval and GitHub Pages deployment.

v2.3.0 · Python 3.10+ · Node 18+ · MIT
01

Quick Start

Three commands to go from repo to a live docs site with an optional AI chatbot.

1

Install the package

bash
# Python 3.10+ required
pip install deepdoc

# With chatbot dependencies (FAISS, FastAPI, fastembed)
pip install "deepdoc[chatbot]"

The base install pulls in LiteLLM, GitPython, Click, Rich, PyYAML, and Jinja2. The [chatbot] extra adds faiss-cpu, fastapi, uvicorn, httpx, and fastembed for local embeddings.

2

Initialise against your repo

bash
# cd into your target codebase first
cd /path/to/your-repo

# Initialize — writes .deepdoc.yaml with sensible defaults
deepdoc init --provider anthropic

# Add --with-chatbot to enable the /ask Q&A interface
deepdoc init --provider openai --with-chatbot

# Set your API key (matches llm.api_key_env in .deepdoc.yaml)
export ANTHROPIC_API_KEY=sk-ant-...

init is safe to re-run — it only adds keys that are missing. The generated .deepdoc.yaml lives in your repo root and should be committed.

3

Generate and preview

bash
# Full pipeline — scan, plan, generate, build
deepdoc generate

 Scan complete        — 312 files · 47 endpoints · 8 integrations
 Plan complete        — 34 buckets across 6 sections
 Generate complete    — 34 pages · 2 MDX fixes applied
 Chatbot index built  — 1,247 chunks · 4.2 MB FAISS index
 Site scaffold written → site/

  Phase timings:  scan 4s · plan 18s · generate 72s · build 3s

# Preview locally (Node 18+ required, deps auto-installed)
deepdoc serve

  Docs   → http://localhost:3000
  Chat   → http://localhost:3000/ask
02

Architecture

DeepDoc is a stateful, five-phase pipeline. Each phase has a single responsibility and a well-defined output contract. Nothing is hand-edited inside docs/, site/, or .deepdoc/ — all fixes go into generators, builders, or prompts.

System overview
Your repo DEEPDOC GENERATE 01 Scan no LLM 02 Plan 3×LLM 03 Generate N×LLM 04 API Ref OpenAPI 05 Build → RepoScan → DocPlan → MDX pages → manifest.json site/app/ Fumadocs scaffold .deepdoc/ state ledger docs/*.mdx MDX pages chatbot.enabled: true ChatbotIndexer chunk + embed FAISS + SQLite vector + FTS index /ask (SSE) FastAPI streaming
Key components
deepdoc/cli.py Click entry-point. init, generate, update, serve, deploy commands.
deepdoc/pipeline_v2.py PipelineV2 orchestrator. Drives all five phases end-to-end.
deepdoc/v2_models.py DocBucket + DocPlan dataclasses — the central data contract.
deepdoc/config.py .deepdoc.yaml schema, DEFAULT_CONFIG dict, load/save helpers.
deepdoc/scanner/ Static repo analysis — endpoints, runtime, integrations, DB, artifacts.
deepdoc/planner/engine.py Multi-step LLM planner orchestration and bucket-scan entrypoint.
deepdoc/planner/heuristics.py Public planning API: _merge_plan, _build_heuristic_assignment, _llm_step.
deepdoc/generator/generation.py PageGenerator + BucketGenerationEngine. Batched parallel generation.
deepdoc/generator/mdx_compile_gate.py Node.js MDX validator + LLM-reprompt + JSX-strip fallback.
deepdoc/chatbot/service.py ChatbotQueryService. query / deep_research / code_deep modes.
deepdoc/chatbot/retrieval_mixin.py All retrieval: FAISS + FTS, rerank, expansion, hit-selection.
deepdoc/site/builder/engine.py SiteBuilderEngine — writes the full Fumadocs scaffold.
deepdoc/persistence_v2.py Ledger, sync state, manifest, plan storage under .deepdoc/.
deepdoc/smart_update_v2.py Incremental update logic — diff-based replan and regen.
03

Five-Phase Pipeline

Each phase runs sequentially. The output of one is the sole input to the next — there are no cross-phase side effects.

01

Scan

Traverse the repo without any LLM calls. Detects HTTP endpoints (Django, FastAPI, Express, NestJS, Laravel, Go, Falcon), runtime surfaces (async workers, schedulers, Celery), third-party integrations, database schemas (SQL, Knex, GraphQL), OpenAPI specs, and config artifacts.

output RepoScan — endpoints, symbols, routes, integrations, artifacts
02

Plan

Multi-step LLM bucket planner (3 sequential LLM calls). Step 1: classify the repo type. Step 2: propose documentation buckets. Step 3: inject specialised buckets (database groups, runtime pages, integration pages). Then heuristics refine ownership: assign files/symbols, decompose oversized buckets, consolidate singletons.

output DocPlan — ordered bucket list, nav structure, orphaned files
03

Generate

For each bucket: assemble an evidence pack (source snippets, artifact extracts, scan hits), call the LLM, receive MDX. Inline validation runs immediately — on failure the LLM is reprompted with the exact error. If still broken, JSX is stripped and a safe Markdown fallback is written. Nothing broken ever hits disk.

output MDX pages on disk, per-page quality status, compile-gate log
04

API Ref

When an OpenAPI spec is found during scan, stage the raw JSON/YAML alongside a Fumadocs-compatible manifest.json so the /api/* page tree renders interactive endpoint reference pages automatically.

output openapi.json + manifest.json in site/app/api/
05

Build

Write the complete Fumadocs Next.js scaffold: fumadocs.config.ts, root layout.tsx, global.css, the page tree JSON, search route, and one page.tsx per MDX file. When chatbot is enabled, also emit the React chatbot widget, sidebar toggle, and inline chat components.

output site/ — a production-ready Next.js Fumadocs project
MDX Compile Gate

After each LLM generation call, DeepDoc spawns a lightweight Node.js validator that compiles the MDX in isolation. If it throws, the exact compile error is injected back into a reprompt call. If the reprompt still fails, JSX is stripped and a safe Markdown fallback is written instead. Nothing broken ever reaches disk.

04

Core Data Models

The entire pipeline communicates through two dataclasses defined in deepdoc/v2_models.py.

DocBucket deepdoc/v2_models.py

The atomic planning unit. One bucket = one documentation page. The planner creates, assigns, and refines buckets; the generator consumes them.

python
@dataclass
class DocBucket:
    # Identity
    bucket_type: str          # system | feature | endpoint | integration | database
    title: str                # page title shown in nav
    slug: str                 # URL slug, e.g. "checkout-flow"
    section: str              # nav section grouping
    description: str          # one-line description for planning context

    # Ownership — what evidence goes into this page
    owned_files: list[str]    # file paths relative to repo root
    owned_symbols: list[str]  # "module.Class.method" qualified names
    artifact_refs: list[str]  # config keys, env vars, OpenAPI ops

    # Generation instructions
    generation_hints: dict    # prompt_style, is_introduction_page, is_endpoint_ref
    required_sections: list[str]
    required_diagrams: list[str]

    # Topology
    depends_on: list[str]     # slugs this bucket depends on
    parent_slug: str | None

    # Metadata
    priority: int             # generation order within section
    publication_tier: str     # core | advanced | deprecated
    source_kind_summary: str  # summary of what kind of code is in owned_files
DocPlan deepdoc/v2_models.py

The complete output of the planner. Consumed by the generator and site builder.

python
@dataclass
class DocPlan:
    buckets: list[DocBucket]         # ordered list — generation order
    nav_structure: dict[str, list]   # section → [slug, ...] for sidebar nav
    skipped_files: list[str]         # intentionally excluded (tests, fixtures…)
    orphaned_files: list[str]        # files not assigned to any bucket
    classification: str              # repo type: "api_service" | "monolith" | …
    integration_candidates: list[str]

    # Shim — adapts DocBucket to legacy DocPage interface
    @property
    def pages(self) -> list[_BucketAsPage]: ...
05

Configuration

.deepdoc.yaml lives in your repo root. deepdoc init writes sensible defaults; you only need to set llm.provider and your API key env var.

.deepdoc.yaml
project_name: My API
description: "Internal API service for orders and fulfilment"

llm:
  provider: anthropic           # required
  model: claude-opus-4-7        # optional — uses provider default if omitted
  api_key_env: ANTHROPIC_API_KEY

generation_mode: feature_buckets  # v2 pipeline (default)
max_parallel_workers: 6
output_dir: docs
site_dir: site

# Optional chatbot
chatbot:
  enabled: true
  embeddings:
    provider: fastembed           # 100% local — no API key needed
    model: nomic-ai/nomic-embed-text-v1.5
All configuration keys
KeyDefaultDescription
project_name (repo dir name) Human-readable project name used in site title and nav header.
description "" One-line summary shown in hero and meta description.
output_dir "docs" Where generated MDX pages are written.
site_dir "site" Where the Fumadocs Next.js scaffold is written.
generation_mode "feature_buckets" v2 bucket-based pipeline. Set "file_centric" for legacy v1 mode.
max_pages 0 Cap on total pages generated. 0 = unlimited.
max_parallel_workers 6 Concurrent LLM calls during generate phase.
llm.provider (required) LLM provider: anthropic, openai, gemini, azure, ollama, …
llm.model (provider default) Model name. Passed directly to LiteLLM.
llm.api_key_env auto-detected Env var name holding the API key. Inferred from provider if omitted.
llm.temperature 0.2 Generation temperature. Lower = more deterministic output.
chatbot.enabled false Enable chatbot indexing and the /ask route in the built site.
chatbot.answer.provider (inherits llm) Separate provider for chatbot answer calls.
chatbot.embeddings.provider "fastembed" Embedding provider. fastembed runs 100% locally — no API key.
chatbot.embeddings.model nomic-embed-text-v1.5 Local embedding model for fastembed. Downloaded once on first run.
large_file_lines 500 Files above this threshold get chunked before evidence assembly.
giant_file_lines 2000 Files above this threshold are split into symbol clusters.
source_context_budget 200000 Max chars of source included in a single evidence pack.
06

CLI Reference

All commands accept --help for the full flag list. Run python -m deepdoc.cli --help to verify installation.

deepdoc init

Scaffold a .deepdoc.yaml config file in the current directory. Safe to re-run — only missing keys are added.

Flag Value Description
--provider anthropic | openai | gemini | … LLM provider (required). Sets llm.provider in .deepdoc.yaml.
--model string Override the default model for the chosen provider.
--with-chatbot flag Enable the chatbot module. Adds chatbot config block with sensible defaults.
--output-dir path Where generated MDX pages land. Default: docs/
--site-dir path Where the Fumadocs scaffold is written. Default: site/
deepdoc generate

Run the full five-phase pipeline. Reads .deepdoc.yaml. On success prints per-phase timings and a page-quality summary.

Flag Value Description
--force flag Ignore existing state and regenerate everything from scratch.
--reconcile flag Re-run planning then regenerate pages whose bucket assignment changed.
--dry-run flag Run scan + plan only, print the proposed bucket list, exit without generating.
--workers int Override max_parallel_workers for this run.
deepdoc update

Incremental update. Diffs the repo against the last synced commit. Regenerates only affected buckets; triggers a targeted replan when new routes or integrations appear.

Flag Value Description
--since git-ref Compare against this ref instead of the stored sync baseline.
--replan flag Force a full replan even if the changed file set is small.
deepdoc serve

cd into site/ and run next dev. Requires Node 18+ and an initial npm install (done automatically on first run).

Flag Value Description
--port int HTTP port for the Next.js dev server. Default: 3000.
--open flag Open the browser automatically after starting.
deepdoc deploy

Build the Fumadocs site (next build) and push the output to GitHub Pages. Requires git remote origin to be set.

Flag Value Description
--branch string Target GitHub Pages branch. Default: gh-pages.
--force flag Force-push to the deploy branch.
07

Bucket Types

Every documentation page belongs to a bucket. The planner picks the right type based on the repo classification and scanned signals. Each type maps to a different generation prompt and page template.

system

Top-level architecture, setup guides, auth flows, observability, and ops runbooks. Every repo gets at least one.

feature

Business-logic workflows: checkout, refunds, onboarding, notifications. Scoped to how engineers think about the domain, not how files are laid out.

endpoint

Endpoint-family API docs. One bucket per logical resource (e.g. all /orders/* routes). Enriched with scanned route metadata.

endpoint_ref

Single-endpoint page backed by an OpenAPI spec. Rendered as an interactive Fumadocs API reference page at /api/{slug}.

integration

Third-party system docs: payment gateways, delivery providers, webhook consumers, external APIs. Auto-detected from import patterns.

database

Schema and data-layer docs. Grouped by model family when a repo has many models; a single overview page for smaller schemas.

08

Supported Languages

DeepDoc parses source files natively (no subprocess or tree-sitter compilation required for most languages). Route detection is framework-aware.

Python .py Django · FastAPI · Falcon · Flask (generic) deepdoc/parser/python_parser.py
JavaScript .js, .mjs Express · Fastify · NestJS (JS mode) deepdoc/parser/js_ts_parser.py
TypeScript .ts, .tsx Express · Fastify · NestJS · Next.js API routes deepdoc/parser/js_ts_parser.py
Go .go net/http · Gin · Echo (heuristic) deepdoc/parser/go_parser.py
PHP .php Laravel (routes/web.php + routes/api.php) deepdoc/parser/php_parser.py
Vue .vue Vue Router (composition API) deepdoc/parser/vue_parser.py
09

Chatbot

An opt-in RAG chatbot that answers questions about your codebase. Embeddings run 100% locally via fastembed — no extra API key or paid embedding service required.

Enable the chatbot

bash
# Enable during init
deepdoc init --provider anthropic --with-chatbot

# Or add to .deepdoc.yaml manually
chatbot:
  enabled: true

# The chatbot model can differ from the doc generation model
chatbot:
  enabled: true
  answer:
    provider: openai
    model: gpt-4o
  embeddings:
    provider: fastembed
    model: nomic-ai/nomic-embed-text-v1.5  # downloaded once, ~270 MB

Query modes

Fast
/query

Single-pass BM25 + semantic retrieval. Best for quick look-ups. Returns in ~1s.

Deep
/deep-research

Multi-hop synthesis. Bounded live-file fallback when the index lacks coverage. Streams via SSE.

Code-first
/code-deep

Strict source-first mode. Reads exact file ranges. Includes call trace and full file inventory in response.

Retrieval architecture

retrieval pipeline
Query
  │
  ├─ FAISS vector search    (nomic-embed-text-v1.5, 768-dim)
  │    └─ top-K semantic candidates
  │
  ├─ SQLite FTS5 search     (BM25 lexical)
  │    └─ top-K keyword candidates
  │
  ├─ Graph neighbour expansion
  │    └─ chunks adjacent to strong hits
  │
  ├─ Optional: LLM rerank    (cross-encoder scoring)
  │
  └─ Hit selection          → evidence[] + references[]
       │
       └─ Answer generation (answer_mixin.py)
            ├─ evidence-first prompt
            ├─ exact file:line citations
            └─ SSE streaming response
index storage (under .deepdoc/)
.deepdoc/chatbot/faiss.index FAISS flat L2 index. Rebuilt on each generate run.
.deepdoc/chatbot/fts.db SQLite FTS5 database for BM25 lexical search.
.deepdoc/chatbot/chunks.json Chunk metadata: text, file, line range, type.
.deepdoc/chatbot/source_archive/ Compressed source files for exact line-range citation in code-deep mode.
.deepdoc/chatbot/symbol_index.json Symbol name → file:line map for fast symbol look-up.
10

Incremental Updates

After the first full generate, use deepdoc update instead of regenerating everything. It only spends LLM tokens on buckets affected by your changes.

bash
# Diff against last sync, regenerate affected buckets only
deepdoc update

 Diff: 8 files changed since 3f2a91b
 Affected buckets: checkout-flow, order-api, payment-integration
 Replan triggered: new route detected in orders/views.py
 Regenerated 4 pages in 18s (saved ~54s vs full run)

# Force a full replan even on small changes
deepdoc update --replan

# Compare against a specific git ref
deepdoc update --since main

When a targeted replan triggers

  • New HTTP endpoint detected
  • New third-party integration import
  • New database model found
  • A bucket's owned files are deleted
  • Bucket count changes significantly

State files that power updates

  • .deepdoc/sync_state.json Last-generated commit hash
  • .deepdoc/ledger.json Per-page quality + timestamp
  • .deepdoc/plan.json Serialised DocPlan from last run
  • .deepdoc/manifest.json File → hash → page mapping
11

Deployment

The generated site/ directory is a standard Next.js project. Deploy it anywhere that runs Next.js, or use deepdoc deploy for one-command GitHub Pages deployment.

GitHub Pages (built-in)
bash
# Builds Next.js static export and pushes to gh-pages branch
deepdoc deploy

# .deepdoc.yaml — GitHub Pages config block
github_pages:
  base_path: /your-repo-name   # required if repo is not at root
  branch: gh-pages
Vercel / Netlify / any Next.js host
bash
# The site/ directory is a standard Next.js app
cd site
npm install
npm run build          # or: next build
npm run start          # production server

# Or point Vercel/Netlify at the site/ subdirectory
# Build command: next build
# Output directory: site/.next
# Root directory: site
CI/CD — auto-regenerate on push
.github/workflows/docs.yml
name: Regenerate docs
on:
  push:
    branches: [main]

jobs:
  docs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: {fetch-depth: 0}

      - uses: actions/setup-python@v5
        with: {python-version: "3.11"}

      - uses: actions/setup-node@v4
        with: {node-version: "20"}

      - run: pip install deepdoc

      - run: deepdoc update         # incremental — only changed buckets
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - run: deepdoc deploy
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
12

Module Reference

All modules follow the rule: extend _v2 variants, never hand-edit generated output, route fixes go in repo_resolver.py.

deepdoc/scanner/ Phase 1 — static analysis

Zero LLM calls. Extracts every signal that the planner and generator consume as evidence.

discover_integrations()discover_database_schema()discover_runtime_surfaces()discover_artifacts()cluster_giant_file()
deepdoc/planner/ Phase 2 — bucket planning

heuristics.py is the public planning API. Tests mock at this path. engine.py orchestrates the multi-step planner.

plan_docs(repo_root, cfg) → DocPlanscan_repo(repo_root, cfg) → RepoScanbuild_flow_candidates()_merge_plan()_build_heuristic_assignment()_llm_step()
deepdoc/generator/ Phase 3 — MDX generation

post_processors.py handles MDX escaping, fence repair, and Mermaid diagram fixups after generation.

PageGeneratorBucketGenerationEngineAssembledEvidenceapply_mdx_compile_gate()
deepdoc/parser/ Symbol & route extraction

routes/ sub-package has per-framework detectors: Django, FastAPI, Express, NestJS, Laravel, Falcon, Go. repo_resolver.py auto-detects the framework.

parse_file(path, language) → ParsedFileParsedFileSymbolsupported_extensions()
deepdoc/chatbot/ RAG indexing + query service

retrieval_mixin.py owns all search logic. answer_mixin.py owns LLM prompting. service.py wires query(), deep_research(), code_deep().

ChatbotIndexerChatbotQueryServicecreate_fastapi_app()
deepdoc/site/builder/ Phase 5 — scaffold generation

scaffold_files.py emits the Fumadocs project. chatbot_components.py emits React chatbot widget when chatbot.enabled = true.

SiteBuilderEngine
deepdoc/llm/ LLM provider abstraction

Thin wrapper around LiteLLM. Supports every provider LiteLLM supports: OpenAI, Anthropic, Gemini, Azure, Mistral, Ollama, …

LLMClient(cfg)
deepdoc/persistence_v2.py State persistence

All state lives under .deepdoc/. Ledger tracks per-page quality. Sync state stores the last-generated git commit hash.

load_ledger()save_ledger()load_sync_state()save_sync_state()
deepdoc/smart_update_v2.py Incremental update strategy

Diffs changed files against the sync baseline, classifies impact (new route? new integration?), decides which buckets to regenerate.

compute_update_plan()run_smart_update()