Quick Start

Three commands to go from repo to a live docs site with an optional AI chatbot.

Install the package

bash

# Python 3.10+ required
pip install deepdoc

# With chatbot dependencies (FAISS, FastAPI, fastembed)
pip install "deepdoc[chatbot]"

The base install pulls in LiteLLM, GitPython, Click, Rich, PyYAML, and Jinja2. The [chatbot] extra adds faiss-cpu, fastapi, uvicorn, httpx, and fastembed for local embeddings.

Initialise against your repo

bash

# cd into your target codebase first
cd /path/to/your-repo

# Initialize — writes .deepdoc.yaml with sensible defaults
deepdoc init --provider anthropic

# Add --with-chatbot to enable the /ask Q&A interface
deepdoc init --provider openai --with-chatbot

# Set your API key (matches llm.api_key_env in .deepdoc.yaml)
export ANTHROPIC_API_KEY=sk-ant-...

init is safe to re-run — it only adds keys that are missing. The generated .deepdoc.yaml lives in your repo root and should be committed.

Generate and preview

bash

# Full pipeline — scan, plan, generate, build
deepdoc generate

✓ Scan complete        — 312 files · 47 endpoints · 8 integrations
✓ Plan complete        — 34 buckets across 6 sections
✓ Generate complete    — 34 pages · 2 Markdown fixes applied
✓ Chatbot index built  — 1,247 chunks · 4.2 MB FAISS index
✓ Site scaffold written → site/

  Phase timings:  scan 4s · plan 18s · generate 72s · build 3s

# Preview locally (pure Python — no Node.js required)
deepdoc serve

  Docs   → http://localhost:3000
  Chat   → http://localhost:3000/ask

Architecture

DeepDoc is a stateful, five-phase pipeline. Each phase has a single responsibility and a well-defined output contract. Nothing is hand-edited inside docs/, site/, or .deepdoc/ — all fixes go into generators, builders, or prompts.

System overview

Key components

deepdoc/cli.py Click entry-point. init, generate, update, serve, deploy commands.

deepdoc/pipeline_v2.py PipelineV2 orchestrator. Drives all five phases end-to-end.

deepdoc/v2_models.py DocBucket + DocPlan dataclasses — the central data contract.

deepdoc/config.py .deepdoc.yaml schema, DEFAULT_CONFIG dict, load/save helpers.

deepdoc/scanner/ Static repo analysis — endpoints, runtime, integrations, DB, artifacts.

deepdoc/planner/engine.py Multi-step LLM planner orchestration and bucket-scan entrypoint.

deepdoc/planner/heuristics.py Public planning API: _merge_plan, _build_heuristic_assignment, _llm_step.

deepdoc/generator/generation.py PageGenerator + BucketGenerationEngine. Batched parallel generation.

deepdoc/generator/validation.py PageValidator — file, route, symbol, and flow grounding checks.

deepdoc/chatbot/service.py ChatbotQueryService. query / deep_research / code_deep modes.

deepdoc/chatbot/retrieval_mixin.py All retrieval: FAISS + FTS, rerank, expansion, hit-selection.

deepdoc/site/builder/mkdocs_builder.py build_mkdocs_from_plan — writes the full MkDocs Material scaffold.

deepdoc/persistence_v2.py Ledger, sync state, manifest, plan storage under .deepdoc/.

deepdoc/smart_update_v2.py Incremental update logic — diff-based replan and regen.

Five-Phase Pipeline

Each phase runs sequentially. The output of one is the sole input to the next — there are no cross-phase side effects.

Scan

Traverse the repo without any LLM calls. Detects HTTP endpoints (Django, FastAPI, Express, NestJS, Laravel, Go, Falcon), runtime surfaces (async workers, schedulers, Celery), third-party integrations, database schemas (SQL, Knex, GraphQL), OpenAPI specs, and config artifacts.

output RepoScan — endpoints, symbols, routes, integrations, artifacts

Plan

Multi-step LLM bucket planner (3 sequential LLM calls). Step 1: classify the repo type. Step 2: propose documentation buckets. Step 3: inject specialised buckets (database groups, runtime pages, integration pages). Then heuristics refine ownership: assign files/symbols, decompose oversized buckets, consolidate singletons.

output DocPlan — ordered bucket list, nav structure, orphaned files

Generate

For each bucket: assemble an evidence pack (source snippets, artifact extracts, scan hits), call the LLM, receive Markdown. Python-side repair runs immediately — on failure the LLM is reprompted with quality feedback. Nothing broken ever hits disk.

output Markdown pages on disk, per-page quality status, generation quality log

API Ref

When an OpenAPI spec is found during scan, stage the raw JSON/YAML and render a single interactive Swagger UI page via mkdocs-swagger-ui-tag.

output docs/api.md — Swagger UI page from staged OpenAPI specs

Build

Write the MkDocs Material scaffold: site/mkdocs.yml (Material theme, Mermaid superfence, pymdownx Blocks), extra.css brand stylesheet, landing page grid cards, and nav structure from the plan. When chatbot is enabled, the chatbot backend runs alongside via deepdoc serve.

output site/ — a configurable MkDocs Material project (pure Python)

Markdown Quality Gate

After each LLM call, DeepDoc runs a Python-side repair pipeline (fence fixup, Mermaid normalization, link rewriting) and a validation pass (sections, files, routes, symbols, flow grounding). On failure the LLM is reprompted with quality feedback. Pages that still fail hard checks — truncated output or hallucinated paths — are reported as invalid. Nothing broken ever reaches disk.

Core Data Models

The entire pipeline communicates through two dataclasses defined in deepdoc/v2_models.py.

DocBucket deepdoc/v2_models.py

The atomic planning unit. One bucket = one documentation page. The planner creates, assigns, and refines buckets; the generator consumes them.

python

@dataclass
class DocBucket:
    # Identity
    bucket_type: str          # system | feature | endpoint | integration | database
    title: str                # page title shown in nav
    slug: str                 # URL slug, e.g. "checkout-flow"
    section: str              # nav section grouping
    description: str          # one-line description for planning context

    # Ownership — what evidence goes into this page
    owned_files: list[str]    # file paths relative to repo root
    owned_symbols: list[str]  # "module.Class.method" qualified names
    artifact_refs: list[str]  # config keys, env vars, OpenAPI ops

    # Generation instructions
    generation_hints: dict    # prompt_style, is_introduction_page, is_endpoint_ref
    required_sections: list[str]
    required_diagrams: list[str]

    # Topology
    depends_on: list[str]     # slugs this bucket depends on
    parent_slug: str | None

    # Metadata
    priority: int             # generation order within section
    publication_tier: str     # core | advanced | deprecated
    source_kind_summary: str  # summary of what kind of code is in owned_files

DocPlan deepdoc/v2_models.py

The complete output of the planner. Consumed by the generator and site builder.

python

@dataclass
class DocPlan:
    buckets: list[DocBucket]         # ordered list — generation order
    nav_structure: dict[str, list]   # section → [slug, ...] for sidebar nav
    skipped_files: list[str]         # intentionally excluded (tests, fixtures…)
    orphaned_files: list[str]        # files not assigned to any bucket
    classification: str              # repo type: "api_service" | "monolith" | …
    integration_candidates: list[str]

    # Shim — adapts DocBucket to legacy DocPage interface
    @property
    def pages(self) -> list[_BucketAsPage]: ...

Configuration

.deepdoc.yaml lives in your repo root. deepdoc init writes sensible defaults; you only need to set llm.provider and your API key env var.

.deepdoc.yaml

project_name: My API
description: "Internal API service for orders and fulfilment"

llm:
  provider: anthropic           # required
  model: claude-opus-4-7        # optional — uses provider default if omitted
  api_key_env: ANTHROPIC_API_KEY

generation_mode: feature_buckets  # v2 pipeline (default)
max_parallel_workers: 6
output_dir: docs
site_dir: site

# Optional chatbot
chatbot:
  enabled: true
  embeddings:
    provider: fastembed           # 100% local — no API key needed
    model: nomic-ai/nomic-embed-text-v1.5

All configuration keys

KeyDefaultDescription

project_name (repo dir name) Human-readable project name used in site title and nav header.

description "" One-line summary shown in hero and meta description.

output_dir "docs" Where generated Markdown pages are written.

site_dir "site" Where the MkDocs site config is written.

generation_mode "feature_buckets" v2 bucket-based pipeline. Set "file_centric" for legacy v1 mode.

max_pages 0 Cap on total pages generated. 0 = unlimited.

max_parallel_workers 6 Concurrent LLM calls during generate phase.

llm.provider (required) LLM provider: anthropic, openai, gemini, azure, ollama, …

llm.model (provider default) Model name. Passed directly to LiteLLM.

llm.api_key_env auto-detected Env var name holding the API key. Inferred from provider if omitted.

llm.temperature 0.2 Generation temperature. Lower = more deterministic output.

chatbot.enabled false Enable chatbot indexing and the /ask route in the built site.

chatbot.answer.provider (inherits llm) Separate provider for chatbot answer calls.

chatbot.embeddings.provider "fastembed" Embedding provider. fastembed runs 100% locally — no API key.

chatbot.embeddings.model nomic-embed-text-v1.5 Local embedding model for fastembed. Downloaded once on first run.

large_file_lines 500 Files above this threshold get chunked before evidence assembly.

giant_file_lines 2000 Files above this threshold are split into symbol clusters.

source_context_budget 200000 Max chars of source included in a single evidence pack.

CLI Reference

All commands accept --help for the full flag list. Run python -m deepdoc.cli --help to verify installation.

deepdoc init

Scaffold a .deepdoc.yaml config file in the current directory. Safe to re-run — only missing keys are added.

Flag	Value	Description
`--provider`	`anthropic \| openai \| gemini \| …`	LLM provider (required). Sets llm.provider in .deepdoc.yaml.
`--model`	`string`	Override the default model for the chosen provider.
`--with-chatbot`	`flag`	Enable the chatbot module. Adds chatbot config block with sensible defaults.
`--output-dir`	`path`	Where generated Markdown pages land. Default: docs/
`--site-dir`	`path`	Where the MkDocs site config is written. Default: site/

deepdoc generate

Run the full five-phase pipeline. Reads .deepdoc.yaml. On success prints per-phase timings and a page-quality summary.

Flag	Value	Description
`--force`	`flag`	Ignore existing state and regenerate everything from scratch.
`--reconcile`	`flag`	Re-run planning then regenerate pages whose bucket assignment changed.
`--dry-run`	`flag`	Run scan + plan only, print the proposed bucket list, exit without generating.
`--workers`	`int`	Override max_parallel_workers for this run.

deepdoc update

Incremental update. Diffs the repo against the last synced commit. Regenerates only affected buckets; triggers a targeted replan when new routes or integrations appear.

Flag	Value	Description
`--since`	`git-ref`	Compare against this ref instead of the stored sync baseline.
`--replan`	`flag`	Force a full replan even if the changed file set is small.

deepdoc serve

Run mkdocs serve against site/mkdocs.yml. Pure Python — no Node.js required.

Flag	Value	Description
`--port`	`int`	HTTP port for the MkDocs dev server. Default: 3000.
`--open`	`flag`	Open the browser automatically after starting.

deepdoc deploy

Build the MkDocs Material site (mkdocs build) and export static HTML to site/out/. Pure Python, no Node.js.

Flag	Value	Description
`--branch`	`string`	Target GitHub Pages branch. Default: gh-pages.
`--force`	`flag`	Force-push to the deploy branch.

Bucket Types

Every documentation page belongs to a bucket. The planner picks the right type based on the repo classification and scanned signals. Each type maps to a different generation prompt and page template.

system

Top-level architecture, setup guides, auth flows, observability, and ops runbooks. Every repo gets at least one.

feature

Business-logic workflows: checkout, refunds, onboarding, notifications. Scoped to how engineers think about the domain, not how files are laid out.

endpoint

Endpoint-family API docs. One bucket per logical resource (e.g. all /orders/* routes). Enriched with scanned route metadata.

endpoint_ref

Single-endpoint page backed by an OpenAPI spec. Rendered as part of the Swagger UI API reference page.

integration

Third-party system docs: payment gateways, delivery providers, webhook consumers, external APIs. Auto-detected from import patterns.

database

Schema and data-layer docs. Grouped by model family when a repo has many models; a single overview page for smaller schemas.

Supported Languages

DeepDoc parses source files natively (no subprocess or tree-sitter compilation required for most languages). Route detection is framework-aware.

Python .py Django · FastAPI · Falcon · Flask (generic) deepdoc/parser/python_parser.py

JavaScript .js, .mjs Express · Fastify · NestJS (JS mode) deepdoc/parser/js_ts_parser.py

TypeScript .ts, .tsx Express · Fastify · NestJS · Next.js API routes deepdoc/parser/js_ts_parser.py

Go .go net/http · Gin · Echo (heuristic) deepdoc/parser/go_parser.py

PHP .php Laravel (routes/web.php + routes/api.php) deepdoc/parser/php_parser.py

Vue .vue Vue Router (composition API) deepdoc/parser/vue_parser.py

Chatbot

An opt-in RAG chatbot that answers questions about your codebase. Embeddings run 100% locally via fastembed — no extra API key or paid embedding service required.

Enable the chatbot

bash

# Enable during init
deepdoc init --provider anthropic --with-chatbot

# Or add to .deepdoc.yaml manually
chatbot:
  enabled: true

# The chatbot model can differ from the doc generation model
chatbot:
  enabled: true
  answer:
    provider: openai
    model: gpt-4o
  embeddings:
    provider: fastembed
    model: nomic-ai/nomic-embed-text-v1.5  # downloaded once, ~270 MB

Query modes

Fast

/query

Single-pass BM25 + semantic retrieval. Best for quick look-ups. Returns in ~1s.

Deep

/deep-research

Multi-hop synthesis. Bounded live-file fallback when the index lacks coverage. Streams via SSE.

Code-first

/code-deep

Strict source-first mode. Reads exact file ranges. Includes call trace and full file inventory in response.

Retrieval architecture

retrieval pipeline

Query
  │
  ├─ FAISS vector search    (nomic-embed-text-v1.5, 768-dim)
  │    └─ top-K semantic candidates
  │
  ├─ SQLite FTS5 search     (BM25 lexical)
  │    └─ top-K keyword candidates
  │
  ├─ Graph neighbour expansion
  │    └─ chunks adjacent to strong hits
  │
  ├─ Optional: LLM rerank    (cross-encoder scoring)
  │
  └─ Hit selection          → evidence[] + references[]
       │
       └─ Answer generation (answer_mixin.py)
            ├─ evidence-first prompt
            ├─ exact file:line citations
            └─ SSE streaming response

index storage (under .deepdoc/)

.deepdoc/chatbot/faiss.index FAISS flat L2 index. Rebuilt on each generate run.

.deepdoc/chatbot/fts.db SQLite FTS5 database for BM25 lexical search.

.deepdoc/chatbot/chunks.json Chunk metadata: text, file, line range, type.

.deepdoc/chatbot/source_archive/ Compressed source files for exact line-range citation in code-deep mode.

.deepdoc/chatbot/symbol_index.json Symbol name → file:line map for fast symbol look-up.

Incremental Updates

After the first full generate, use deepdoc update instead of regenerating everything. It only spends LLM tokens on buckets affected by your changes.

bash

# Diff against last sync, regenerate affected buckets only
deepdoc update

✓ Diff: 8 files changed since 3f2a91b
✓ Affected buckets: checkout-flow, order-api, payment-integration
✓ Replan triggered: new route detected in orders/views.py
✓ Regenerated 4 pages in 18s (saved ~54s vs full run)

# Force a full replan even on small changes
deepdoc update --replan

# Compare against a specific git ref
deepdoc update --since main

When a targeted replan triggers

New HTTP endpoint detected
New third-party integration import
New database model found
A bucket's owned files are deleted
Bucket count changes significantly

State files that power updates

.deepdoc/sync_state.json Last-generated commit hash
.deepdoc/ledger.json Per-page quality + timestamp
.deepdoc/plan.json Serialised DocPlan from last run
.deepdoc/manifest.json File → hash → page mapping

Deployment

The generated site/ directory is a standard Next.js project. Deploy it anywhere that runs Next.js, or use deepdoc deploy for one-command GitHub Pages deployment.

GitHub Pages (built-in)

bash

# Builds Next.js static export and pushes to gh-pages branch
deepdoc deploy

# .deepdoc.yaml — GitHub Pages config block
github_pages:
  base_path: /your-repo-name   # required if repo is not at root
  branch: gh-pages

Vercel / Netlify / any Next.js host

bash

# The site/ directory is a standard Next.js app
cd site
npm install
npm run build          # or: next build
npm run start          # production server

# Or point Vercel/Netlify at the site/ subdirectory
# Build command: next build
# Output directory: site/.next
# Root directory: site

CI/CD — auto-regenerate on push

.github/workflows/docs.yml

name: Regenerate docs
on:
  push:
    branches: [main]

jobs:
  docs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: {fetch-depth: 0}

      - uses: actions/setup-python@v5
        with: {python-version: "3.11"}

      - uses: actions/setup-node@v4
        with: {node-version: "20"}

      - run: pip install deepdoc

      - run: deepdoc update         # incremental — only changed buckets
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - run: deepdoc deploy
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Module Reference

All modules follow the rule: extend _v2 variants, never hand-edit generated output, route fixes go in repo_resolver.py.

deepdoc/scanner/ Phase 1 — static analysis

Zero LLM calls. Extracts every signal that the planner and generator consume as evidence.

discover_integrations()discover_database_schema()discover_runtime_surfaces()discover_artifacts()cluster_giant_file()

deepdoc/planner/ Phase 2 — bucket planning

heuristics.py is the public planning API. Tests mock at this path. engine.py orchestrates the multi-step planner.

plan_docs(repo_root, cfg) → DocPlanscan_repo(repo_root, cfg) → RepoScanbuild_flow_candidates()_merge_plan()_build_heuristic_assignment()_llm_step()

deepdoc/generator/ Phase 3 — Markdown generation

post_processors.py handles Markdown repair, fence fixes, and Mermaid diagram fixups after generation.

PageGeneratorBucketGenerationEngineAssembledEvidenceapply_mdx_compile_gate()

deepdoc/parser/ Symbol & route extraction

routes/ sub-package has per-framework detectors: Django, FastAPI, Express, NestJS, Laravel, Falcon, Go. repo_resolver.py auto-detects the framework.

parse_file(path, language) → ParsedFileParsedFileSymbolsupported_extensions()

deepdoc/chatbot/ RAG indexing + query service

retrieval_mixin.py owns all search logic. answer_mixin.py owns LLM prompting. service.py wires query(), deep_research(), code_deep().

ChatbotIndexerChatbotQueryServicecreate_fastapi_app()

deepdoc/site/builder/ Phase 5 — scaffold generation

mkdocs_builder.py emits the MkDocs Material project: site/mkdocs.yml, extra.css, and landing page.

build_mkdocs_from_plan()

deepdoc/llm/ LLM provider abstraction

Thin wrapper around LiteLLM. Supports every provider LiteLLM supports: OpenAI, Anthropic, Gemini, Azure, Mistral, Ollama, …

LLMClient(cfg)

deepdoc/persistence_v2.py State persistence

All state lives under .deepdoc/. Ledger tracks per-page quality. Sync state stores the last-generated git commit hash.

load_ledger()save_ledger()load_sync_state()save_sync_state()

deepdoc/smart_update_v2.py Incremental update strategy

Diffs changed files against the sync baseline, classifies impact (new route? new integration?), decides which buckets to regenerate.

compute_update_plan()run_smart_update()

Engineering Reference.

Quick Start

Install the package

Initialise against your repo

Generate and preview

Architecture

Five-Phase Pipeline

Scan

Plan

Generate

API Ref

Build

Core Data Models

Configuration

CLI Reference

Bucket Types

Supported Languages

Chatbot

Enable the chatbot

Query modes

Retrieval architecture

Incremental Updates

When a targeted replan triggers

State files that power updates

Deployment

Module Reference