SoPDF

What is SoPDF

SoPDF is an open-source Python PDF processing library that covers the full workflow from rendering and text extraction to structural editing. You can use it to build production-grade parsing, retrieval, split/merge, and batch processing pipelines.

SoPDF is released under Apache 2.0. You can use it in personal projects, commercial products, and open-source libraries.

Why SoPDF

High performance: official benchmarks show major speedups vs PyMuPDF in core scenarios (up to about 2.82x in rendering, 2.74x in plain text extraction, and 3.17x in full-text search).
Feature complete: supports rendering, extraction, search, split/merge, compressed save, metadata, and outline handling.
Clean API: designed for practical day-to-day engineering workflows.
Permissive license: Apache 2.0 simplifies internal adoption and external distribution.

Core capabilities

Capability	Description
Open documents	Open PDFs from file paths, bytes, or streams
Page rendering	Render to PNG/JPEG, including batch and parallel rendering
Text workflows	Extract plain text, extract text blocks with bounding boxes, and search keywords
Document editing	Split, merge, rotate pages, and save with compression
Operational readiness	Handle encrypted PDFs and auto-repair corrupted PDFs
Document metadata	Read/write metadata and read document outline (TOC)

Architecture

SoPDF uses a dual-engine architecture:

pypdfium2 (Google PDFium): rendering, text extraction, and search.
pikepdf (libqpdf): structure reads, writes, and save/compression.

With a dirty-flag + hot-reload sync mechanism, write operations are automatically reflected in the read path, so you do not need to manually coordinate both engines.

Quick start

pip install sopdf

Requirements: Python 3.10+.

import sopdf

with sopdf.open("document.pdf") as doc:
	# Render
	img_bytes = doc[0].render(dpi=150)

	# Extract text
	text = doc[0].get_text()
	blocks = doc[0].get_text_blocks()

	# Search
	hits = doc[0].search("invoice", match_case=False)

	# Split and merge
	part = doc.split(pages=[0, 1, 2], output="chapter1.pdf")
	sopdf.merge(["intro.pdf", "body.pdf"], output="book.pdf")

	# Save
	doc.append(part)
	doc.save("out.pdf", compress=True, garbage=True)

Resources

Repository: SoMarkAI/SoPDF
PyPI: sopdf
Examples: examples/ in the repository
Benchmark suite: tests/benchmark/ in the repository

License

Apache License 2.0

​What is SoPDF

​Why SoPDF

​Core capabilities

​Architecture

​Quick start

​Resources

​License

What is SoPDF

Why SoPDF

Core capabilities

Architecture

Quick start

Resources

License