Skip to main content

What is SoPDF

SoPDF is an open-source Python PDF processing library that covers the full workflow from rendering and text extraction to structural editing. You can use it to build production-grade parsing, retrieval, split/merge, and batch processing pipelines.
SoPDF is released under Apache 2.0. You can use it in personal projects, commercial products, and open-source libraries.

Why SoPDF

  • High performance: official benchmarks show major speedups vs PyMuPDF in core scenarios (up to about 2.82x in rendering, 2.74x in plain text extraction, and 3.17x in full-text search).
  • Feature complete: supports rendering, extraction, search, split/merge, compressed save, metadata, and outline handling.
  • Clean API: designed for practical day-to-day engineering workflows.
  • Permissive license: Apache 2.0 simplifies internal adoption and external distribution.

Core capabilities

CapabilityDescription
Open documentsOpen PDFs from file paths, bytes, or streams
Page renderingRender to PNG/JPEG, including batch and parallel rendering
Text workflowsExtract plain text, extract text blocks with bounding boxes, and search keywords
Document editingSplit, merge, rotate pages, and save with compression
Operational readinessHandle encrypted PDFs and auto-repair corrupted PDFs
Document metadataRead/write metadata and read document outline (TOC)

Architecture

SoPDF uses a dual-engine architecture:
  • pypdfium2 (Google PDFium): rendering, text extraction, and search.
  • pikepdf (libqpdf): structure reads, writes, and save/compression.
With a dirty-flag + hot-reload sync mechanism, write operations are automatically reflected in the read path, so you do not need to manually coordinate both engines.

Quick start

pip install sopdf
Requirements: Python 3.10+.
import sopdf

with sopdf.open("document.pdf") as doc:
	# Render
	img_bytes = doc[0].render(dpi=150)

	# Extract text
	text = doc[0].get_text()
	blocks = doc[0].get_text_blocks()

	# Search
	hits = doc[0].search("invoice", match_case=False)

	# Split and merge
	part = doc.split(pages=[0, 1, 2], output="chapter1.pdf")
	sopdf.merge(["intro.pdf", "body.pdf"], output="book.pdf")

	# Save
	doc.append(part)
	doc.save("out.pdf", compress=True, garbage=True)

Resources

  • Repository: SoMarkAI/SoPDF
  • PyPI: sopdf
  • Examples: examples/ in the repository
  • Benchmark suite: tests/benchmark/ in the repository

License

Apache License 2.0