duckdb-hybrid-doc-search
A tool for hybrid indexing of internal documents managed in Markdown using DuckDB with full-text search (FTS) + vector search (VSS), and making them callable from AI coding agents as an MCP stdio server.
duckdb-hybrid-doc-search
A tool for hybrid indexing of internal documents managed in Markdown using DuckDB with full-text search (FTS) + vector search (VSS), and making them callable from AI coding agents as an MCP stdio server.
Features
- Advanced splitting and indexing of Markdown documents
- Hybrid search combining full-text search (FTS) and vector search (VSS)
- High-precision search using Japanese morphological analysis
- Search result re-ranking using CrossEncoder
- Integration with AI agents via MCP stdio server
Usage
Creating an Index
duckdb-hybrid-doc-search index docs/ handbook/ \\
--db index.duckdb
To use a different model, re-run the index command with a different model specified using the --embedding-model parameter.
duckdb-hybrid-doc-search index docs/ handbook/ \\
--db index.duckdb \\
--embedding-model cl-nagoya/ruri-v3-310m
You can also trim a prefix from file paths during indexing, which is useful when using Docker:
duckdb-hybrid-doc-search index docs/ handbook/ \\
--db index.duckdb \\
--trim-path-prefix "/app/"
Starting the Server
By default, the server uses STDIO transport:
duckdb-hybrid-doc-search serve --db index.duckdb
You can customize the MCP tool name and description:
duckdb-hybrid-doc-search serve --db index.duckdb \\
--tool-name "my_search" \\
--tool-description "Search my documentation"
HTTP Transport Support
The server also supports HTTP-based transport:
duckdb-hybrid-doc-search serve --db index.duckdb \\
--transport streamable-http \\
--host 127.0.0.1 \\
--port 8765 \\
--path /mcp
Changing the Model
Docker
You can also use the Docker image:
docker pull ghcr.io/upamune/duckdb-hybrid-doc-search:latest
Creating an Index with Docker
# Mount document directories and create an index
docker run -v /path/to/docs:/docs -v /path/to/output:/output \\
ghcr.io/upamune/duckdb-hybrid-doc-search:latest \\
index /docs --db /output/index.duckdb --embedding-model cl-nagoya/ruri-v3-310m
Starting the MCP Server with Docker
STDIO Transport (Default)
# Mount only the index.duckdb file and start the server with STDIO transport
docker run -v /path/to/index.duckdb:/app/index.duckdb \\
ghcr.io/upamune/duckdb-hybrid-doc-search:latest \\
serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m
# With custom tool name and description
docker run -v /path/to/index.duckdb:/app/index.duckdb \\
ghcr.io/upamune/duckdb-hybrid-doc-search:latest \\
serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m \\
--tool-name "my_search" --tool-description "Search my documentation"
HTTP Transport
# Using Streamable HTTP transport
docker run -v /path/to/index.duckdb:/app/index.duckdb -p 8765:8765 \\
ghcr.io/upamune/duckdb-hybrid-doc-search:latest \\
serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m \\
--transport streamable-http --host 0.0.0.0 --port 8765 --path /mcp
Note: When running in Docker, use
--host 0.0.0.0to make the server accessible from outside the container.
Searching Documents with Docker
# Direct search with a specific query
docker run -v /path/to/index.duckdb:/app/index.duckdb -it \\
ghcr.io/upamune/duckdb-hybrid-doc-search:latest \\
search --db /app/index.duckdb --query "your search query" --rerank-model cl-nagoya/ruri-v3-reranker-310m
# Interactive search mode (when --query is omitted)
docker run -v /path/to/index.duckdb:/app/index.duckdb -it \\
ghcr.io/upamune/duckdb-hybrid-doc-search:latest \\
search --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m
Path Manipulation in Search Results
You can manipulate file paths in search results using the following flags:
# Remove a prefix from file paths in search results
duckdb-hybrid-doc-search search --query "example" --remove-path-prefix "/app/"
# Add a prefix to file paths in search results
duckdb-hybrid-doc-search search --query "example" --add-path-prefix "docs/"
# Combine both: first remove, then add prefix
duckdb-hybrid-doc-search search --query "example" \\
--remove-path-prefix "/app/" \\
--add-path-prefix "docs/"
These flags are also available for the MCP server:
# Start server with path manipulation
duckdb-hybrid-doc-search serve --db index.duckdb \\
--remove-path-prefix "/app/" \\
--add-path-prefix "docs/"
Using as an MCP Server with VS Code and Cursor
VS Code Configuration
To use as an MCP server with VS Code:
STDIO Transport (Default)
Create a .vscode/mcp.json file in your workspace:
{
"servers": [
{
"name": "DuckDB Hybrid Doc Search",
"description": "Document search server for Markdown files",
"connection": {
"type": "stdio",
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"-v", "${workspaceFolder}/index.duckdb:/app/index.duckdb",
"ghcr.io/upamune/duckdb-hybrid-doc-search:latest",
"serve",
"--db", "/app/index.duckdb",
"--rerank-model", "cl-nagoya/ruri-v3-reranker-310m",
"--tool-name", "search_documents",
"--tool-description", "Search for documentation"
]
}
}
]
}
HTTP Transport
For HTTP-based transport, use the http connection type:
{
"servers": [
{
"name": "DuckDB Hybrid Doc Search",
"description": "Document search server with HTTP transport",
"connection": {
"type": "http",
"url": "http://localhost:8765/mcp"
}
}
]
}
Then start the server separately with:
docker run -v ${workspaceFolder}/index.duckdb:/app/index.duckdb -p 8765:8765 \\
ghcr.io/upamune/duckdb-hybrid-doc-search:latest \\
serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m \\
--transport streamable-http --host 0.0.0.0 --port 8765 --path /mcp
Using a Pre-indexed Image
{
"servers": [
{
"name": "DuckDB Hybrid Doc Search",
"description": "Pre-indexed document search server",
"connection": {
"type": "stdio",
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"your-org/doc-search-with-index:latest"
]
}
}
]
}
Cursor Configuration
To use as an MCP server with Cursor:
STDIO Transport (Default)
Create a mcp.json file in your workspace or add to your global configuration:
{
"mcpServers": {
"duckdb-doc-search": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"-v", "${workspaceFolder}/index.duckdb:/app/index.duckdb",
"ghcr.io/upamune/duckdb-hybrid-doc-search:latest",
"serve",
"--db", "/app/index.duckdb",
"--rerank-model", "cl-nagoya/ruri-v3-reranker-310m",
"--tool-name", "search_documents",
"--tool-description", "Search for documentation"
]
}
}
}
HTTP Transport
For HTTP-based transport, use the url property:
{
"mcpServers": {
"duckdb-doc-search": {
"url": "http://localhost:8765/mcp"
}
}
}
Then start the server separately with:
docker run -v ${workspaceFolder}/index.duckdb:/app/index.duckdb -p 8765:8765 \\
ghcr.io/upamune/duckdb-hybrid-doc-search:latest \\
serve --db /app/index.duckdb --rerank-model cl-nagoya/ruri-v3-reranker-310m \\
--transport streamable-http --host 0.0.0.0 --port 8765 --path /mcp
Using a Pre-indexed Image
{
"mcpServers": {
"duckdb-doc-search": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"your-org/doc-search-with-index:latest"
]
}
}
}
Practical Example: Creating and Distributing Docker Images with Pre-built Indexes
Here's a practical example for efficiently deploying document search within your organization by pre-building indexes and embedding them in Docker images.
Creating a Docker Image with Pre-built Index
When using Docker, file paths in the index often include the container path (like /app/docs/), but you might want search results to show just docs/. The --trim-path-prefix parameter solves this by removing the specified prefix from file paths during indexing.
Create a Dockerfile.with-index-args file:
# Use base image with build arguments
FROM ghcr.io/upamune/duckdb-hybrid-doc-search:latest AS builder
# Define build arguments with defaults
ARG DOCS_DIR=./docs
ARG MODEL=cl-nagoya/ruri-v3-310m
# Copy documents from specified directory
COPY ${DOCS_DIR} /docs
# Create index with specified model
RUN duckdb-hybrid-doc-search index /docs \\
--db /app/index.duckdb \\
--embedding-model ${MODEL} \\
--trim-path-prefix "/app/"
# Create final image
FROM ghcr.io/upamune/duckdb-hybrid-doc-search:latest
# Copy index file from builder
COPY --from=builder /app/index.duckdb /app/index.duckdb
# Set default command
CMD ["serve", "--db", "/app/index.duckdb", "--rerank-model", "cl-nagoya/ruri-v3-reranker-310m", "--tool-name", "search_documents", "--tool-description", "Search for documentation"]
Build and run:
# Build image for development documents
docker build -t your-org/doc-search-dev:latest \\
--build-arg DOCS_DIR=./docs-dev \\
--build-arg MODEL=cl-nagoya/ruri-v3-310m \\
-f Dockerfile.with-index-args .
# Build image for production documents
docker build -t your-org/doc-search-prod:latest \\
--build-arg DOCS_DIR=./docs-prod \\
--build-arg MODEL=cl-nagoya/ruri-v3-310m \\
-f Dockerfile.with-index-args .
# Push to your organization's registry
docker push your-org/doc-search-prod:latest
# Run with STDIO transport (default)
docker run your-org/doc-search-prod:latest
# Run with Streamable HTTP transport
docker run -p 8765:8765 -e TRANSPORT=streamable-http your-org/doc-search-prod:latest
# Run with custom HTTP settings
docker run -p 9000:9000 -e TRANSPORT=streamable-http -e PORT=9000 -e PATH=/api/mcp your-org/doc-search-prod:latest
This approach enables efficient deployment and management of document search systems within your organization.
Development
This project uses Task to manage build and development tasks.
Setting Up Development Environment
task dev:setup
source .venv/bin/activate
Running Tests
task test
Running Linters
task lint
Formatting Code
task format
Creating Document Index
task run:index DIRS="docs/ handbook/"
Starting the Server
task run:serve
Running Search
task run:search
Building and Running Docker Image
task docker:build
task docker:run CLI_ARGS="serve --db /app/index.duckdb"
Listing Available Tasks
task
Migration from Makefile
Previously, we used Makefile, but we've migrated to Task for more flexibility and features.
You can replace any make command with the corresponding task command.
License
MIT
Recommend MCP Servers 💡
@hugeicons/mcp-server
An MCP server for integrating Hugeicons into various platforms, providing tools and resources for AI assistants to offer accurate guidance on icon usage.
cartesia-mcp
The Cartesia MCP server enables clients like Cursor, Claude Desktop, and OpenAI agents to interact with Cartesia's API for speech localization, text-to-audio conversion, and voice infill.
mcp-bone
A Node.js MCP Server package to connect to MCP Bone and parse tool calls from LLM completions
@collaborne/mcp-server
A Model Context Protocol (MCP) server that provides tools for Large Language Models (LLMs) like Claude to interact with NEXT structured data, offering functionalities for getting highlights and clusters.
arxiv-latex-mcp
MCP server that uses arxiv-to-prompt to fetch and process arXiv LaTeX sources for precise interpretation of mathematical expressions in scientific papers.
shinichi-takayanagi/myweight-mcp-server
Connects to Health Planet API to provide weight data for MCP clients via SSE.