What is SoPDF
SoPDF is an open-source Python PDF processing library that covers the full workflow from rendering and text extraction to structural editing. You can use it to build production-grade parsing, retrieval, split/merge, and batch processing pipelines.SoPDF is released under Apache 2.0. You can use it in personal projects, commercial products, and open-source libraries.
Why SoPDF
- High performance: official benchmarks show major speedups vs PyMuPDF in core scenarios (up to about
2.82xin rendering,2.74xin plain text extraction, and3.17xin full-text search). - Feature complete: supports rendering, extraction, search, split/merge, compressed save, metadata, and outline handling.
- Clean API: designed for practical day-to-day engineering workflows.
- Permissive license: Apache 2.0 simplifies internal adoption and external distribution.
Core capabilities
| Capability | Description |
|---|---|
| Open documents | Open PDFs from file paths, bytes, or streams |
| Page rendering | Render to PNG/JPEG, including batch and parallel rendering |
| Text workflows | Extract plain text, extract text blocks with bounding boxes, and search keywords |
| Document editing | Split, merge, rotate pages, and save with compression |
| Operational readiness | Handle encrypted PDFs and auto-repair corrupted PDFs |
| Document metadata | Read/write metadata and read document outline (TOC) |
Architecture
SoPDF uses a dual-engine architecture:pypdfium2(Google PDFium): rendering, text extraction, and search.pikepdf(libqpdf): structure reads, writes, and save/compression.
Quick start
3.10+.
Resources
- Repository: SoMarkAI/SoPDF
- PyPI: sopdf
- Examples:
examples/in the repository - Benchmark suite:
tests/benchmark/in the repository

