Content Core
Extract, process, and summarize content from URLs, files, and text through a unified async Python API, CLI, or MCP server.
Supported Formats
| Category | Formats |
|---|---|
| Web | URLs, HTML pages, YouTube videos, Reddit posts |
| Documents | PDF, DOCX, PPTX, XLSX, EPUB, Markdown, plain text |
| Media | MP3, WAV, M4A, FLAC, OGG (audio); MP4, AVI, MOV, MKV (video) |
Quick Start
pip install content-core
import content_core
result = await content_core.extract_content(url="https://example.com")
print(result.content)
Or with zero install:
uvx content-core extract "https://example.com"
CLI Usage
Content Core provides a unified content-core command with subcommands for extraction, summarization, and MCP server.
Extract
# From a URL
content-core extract "https://example.com"
# From a file
content-core extract document.pdf
# With JSON output
content-core extract document.pdf --format json
# With a specific engine
content-core extract "https://example.com" --engine firecrawl
# From stdin
echo "some text" | content-core extract
Summarize
# Summarize text
content-core summarize "Long article text here..."
# With context
content-core summarize "Long text" --context "bullet points"
# From stdin
cat article.txt | content-core summarize --context "explain to a child"
MCP Server
content-core mcp
Configuration
# Set persistent config
content-core config set llm_provider anthropic
content-core config set llm_model claude-sonnet-4-20250514
# List current config
content-core config list
# Delete a config value
content-core config delete llm_provider
Config is stored in ~/.content-core/config.toml. Priority: command flags > env vars > config file > defaults.
Zero-Install with uvx
All commands work without installation using uvx:
uvx content-core extract "https://example.com"
uvx content-core summarize "text" --context "one sentence"
uvx content-core mcp
Python API
Extraction
import content_core
# From a URL
result = await content_core.extract_content(url="https://example.com")
# From a file
result = await content_core.extract_content(file_path="document.pdf")
# From text
result = await content_core.extract_content(content="some text")
# With engine override
from content_core import ContentCoreConfig
config = ContentCoreConfig(url_engine="firecrawl")
result = await content_core.extract_content(url="https://example.com", config=config)
Summarization
import content_core
summary = await content_core.summarize("long article text", context="bullet points")
Configuration
from content_core import ContentCoreConfig
config = ContentCoreConfig(
url_engine="firecrawl",
document_engine="docling",
audio_concurrency=5,
)
result = await content_core.extract_content(url="https://example.com", config=config)
MCP Integration
Content Core includes a Model Context Protocol (MCP) server for use with Claude Desktop and other MCP-compatible applications.
Add to your claude_desktop_config.json:
{
"mcpServers": {
"content-core": {
"command": "uvx",
"args": ["content-core", "mcp"],
"env": {
"OPENAI_API_KEY": "sk-..."
}
}
}
}
The MCP server exposes two tools: extract_content and summarize_content. Both return plain text.
For detailed setup, see the MCP documentation.
Claude Code Skill
Content Core includes a SKILL.md that teaches AI agents how to use it for extracting content from external sources. To make it available in your Claude Code project, copy it to your skills directory:
# Download the skill
curl -o .claude/skills/content-core/SKILL.md --create-dirs \\
https://raw.githubusercontent.com/lfnovo/content-core/main/SKILL.md
Once installed, Claude Code can use content-core to extract content from URLs, documents, and media files — either via CLI (uvx content-core) or MCP if configured.
AI Providers
Content Core uses Esperanto to support multiple LLM and STT providers. Switch providers by changing the config — no code changes needed:
# Use Anthropic for summarization
content-core config set llm_provider anthropic
content-core config set llm_model claude-sonnet-4-20250514
# Use Groq for transcription
content-core config set stt_provider groq
content-core config set stt_model whisper-large-v3
Supported providers include OpenAI, Anthropic, Google, Groq, DeepSeek, Ollama, and more. See the Esperanto documentation for the full list.
Configuration
Content Core uses ContentCoreConfig powered by pydantic-settings. Settings are resolved in priority order: constructor args > env vars (CCORE_*) > config file (~/.content-core/config.toml) > defaults.
Environment Variables
| Variable | Description | Default |
|---|---|---|
CCORE_URL_ENGINE |
URL extraction engine (auto, simple, firecrawl, jina, crawl4ai) |
auto |
CCORE_DOCUMENT_ENGINE |
Document extraction engine (auto, simple, docling) |
auto |
CCORE_AUDIO_CONCURRENCY |
Concurrent audio transcriptions (1-10) | 3 |
CRAWL4AI_API_URL |
Crawl4AI Docker API URL (omit for local browser mode) | - |
FIRECRAWL_API_URL |
Custom Firecrawl API URL for self-hosted instances | - |
CCORE_FIRECRAWL_PROXY |
Firecrawl proxy mode (auto, basic, stealth) |
auto |
CCORE_FIRECRAWL_WAIT_FOR |
Wait time in ms before extraction | 3000 |
CCORE_LLM_PROVIDER |
LLM provider for summarization | - |
CCORE_LLM_MODEL |
LLM model for summarization | - |
CCORE_STT_PROVIDER |
Speech-to-text provider | - |
CCORE_STT_MODEL |
Speech-to-text model | - |
CCORE_STT_TIMEOUT |
Speech-to-text timeout in seconds | - |
CCORE_YOUTUBE_LANGUAGES |
Preferred YouTube transcript languages | - |
API keys for external services are set via their standard environment variables (e.g., OPENAI_API_KEY, FIRECRAWL_API_KEY, JINA_API_KEY).
Proxy Configuration
Content Core reads standard HTTP_PROXY / HTTPS_PROXY / NO_PROXY environment variables automatically. No additional configuration is needed.
Optional Dependencies
# Docling for advanced document parsing (PDF, DOCX, PPTX, XLSX)
pip install content-core[docling]
# Crawl4AI for local browser-based URL extraction
pip install content-core[crawl4ai]
python -m playwright install --with-deps
# LangChain tool wrappers
pip install content-core[langchain]
# All optional features
pip install content-core[docling,crawl4ai,langchain]
Using with LangChain
When installed with the langchain extra, Content Core provides LangChain-compatible tool wrappers:
from content_core.tools import extract_content_tool, summarize_content_tool
tools = [extract_content_tool, summarize_content_tool]
Documentation
- Usage Guide -- Python API details, configuration, and examples
- Processors -- How content extraction works for each format
- MCP Server -- Claude Desktop and MCP integration
Development
git clone https://github.com/lfnovo/content-core
cd content-core
uv sync --group dev
# Run tests
make test
# Lint
make ruff
License
This project is licensed under the MIT License.
Contributing
Contributions are welcome! Please see our Contributing Guide for details.
Recommend MCP Servers 💡
MedifinderMCP
An MCP server for medicine inventory queries, enabling searches, availability checks, and regional statistics
winston-ai-mcp
An MCP server for Winston AI, providing advanced AI text and image detection, plagiarism checking, and text comparison functionalities.
mcp-server-nationalparks
MCP Server for the National Park Service (NPS) API, providing real-time information about U.S. National Parks, including park details, alerts, and activities.
@modelcontextprotocol/server-google-maps
MCP Server integrating Google Maps API functionalities including geocoding, place search, directions, and more
user-prompt-mcp
An MCP server for Cursor that enables requesting user input during AI generation
@makeplane/plane-mcp-server
An official Model Context Protocol server for Plane, enabling AI agents and developer tools to programmatically interact with Plane workspaces for project, issue, and task management.