Skip to content

sec2md

Convert SEC EDGAR filings to clean, LLM-ready Markdown.

What is sec2md?

sec2md transforms messy SEC HTML filings into structured Markdown designed for AI systems. Unlike generic HTML-to-text converters, it preserves tables, tracks pages, detects section boundaries, and outputs clean text optimized for embeddings and retrieval.

Feature Description
🧭 Page-aware Preserves original pagination for citation traceability
🗂ïļ Section-aware Detects ITEM boundaries in 10-K/10-Q filings
📊 Table-preserving Converts HTML tables to clean Markdown pipe syntax
ðŸŠķ LLM-ready Outputs chunk-safe Markdown for RAG pipelines
🔗 Universal Works with filings, exhibits, notes, and press releases

Installation

pip install sec2md

Quick Example

import sec2md

# Convert any SEC filing to markdown
md = sec2md.convert_to_markdown(
    "https://www.sec.gov/Archives/edgar/data/.../10k.htm",
    user_agent="YourName you@example.com"
)

print(md)  # Clean, structured markdown

Output:

ITEM 1. Business

Apple Inc. designs, manufactures, and markets smartphones, personal computers,
tablets, wearables, and accessories worldwide...

| Product Category | Revenue (millions) |
|------------------|-------------------|
| iPhone           | $200,583          |
| Mac              | $29,357           |
...

What's Next?

Why Markdown?

SEC filings contain XBRL tags, inline CSS, absolute positioning, and nested tables. Standard HTML parsers produce garbage. sec2md rebuilds the document as semantic Markdown that LLMs can actually parse - preserving structure, tables, and metadata for retrieval.

License

MIT ÂĐ 2025