同步解析 - SoMark 文档

POST

parse

sync

Python

import json
import requests

url = "https://somark.tech/api/v1/parse/sync"

data = {
    "output_formats": ["markdown", "json"],
    "api_key": "sk-***",
    "element_formats": json.dumps({
        "image": "url",
        "formula": "latex",
        "table": "html",
        "cs": "image",
    }),
    "feature_config": json.dumps({
        "enable_text_cross_page": False,
        "enable_table_cross_page": False,
        "enable_title_level_recognition": False,
        "enable_inline_image": False,
        "enable_table_image": True,
        "enable_image_understanding": True,
        "keep_header_footer": False,
    }),
}

files = {"file": ("example.pdf", open("example.pdf", "rb"))}

response = requests.post(url, data=data, files=files)
print(response.json())

{
  "code": 0,
  "message": "任务成功",
  "data": {
    "task_id": "a1b2c3d4e5f6",
    "error": null,
    "metadata": {
      "page_num": 10,
      "file_type": ".pdf"
    },
    "result": {
      "file_name": "document.pdf",
      "outputs": {
        "markdown": "# 第一章 引言\n\n本文档介绍了...",
        "json": {
          "pages": [
            {
              "page_num": 0,
              "blocks": [
                {
                  "idx": 0,
                  "type": "title",
                  "bbox": [
                    72,
                    50,
                    540,
                    80
                  ],
                  "content": "第一章 引言",
                  "format": "text",
                  "captions": [],
                  "img_url": "",
                  "title_level": 1
                },
                {
                  "idx": 1,
                  "type": "text",
                  "bbox": [
                    72,
                    100,
                    540,
                    200
                  ],
                  "content": "本文档介绍了...",
                  "format": "text",
                  "captions": [],
                  "img_url": ""
                }
              ],
              "page_size": {
                "h": 1684,
                "w": 1190
              },
              "merge_content_from_pre_page": false
            }
          ]
        }
      }
    }
  }
}

路径变更：该接口路径已从 /extract/acc_sync 更改为 /parse/sync。旧路径将于 2026-12-31 停用，请在此之前迁移至新路径。参数变更：extract_config 已更名为 feature_config。请将请求中的 extract_config 字段替换为 feature_config。

output_formats

可选输出格式	默认值	说明
`json` / `markdown` / `zip`	`["markdown", "json"]`	可多选，不传时使用默认值；`zip` 将 Markdown 和所有图片文件打包为压缩包；在 `output_formats` 包含 `zip` 时，`element_formats.image` 只能是 `file`

element_formats

配置项	可选输出格式	默认值	说明
`image`	`url` / `base64` / `file` / `none`	`url`	单选，image 设定为`file`时，`output_formats`需包含`zip`；`none`表示不返回图片
`formula`	`latex` / `mathml` / `ascii`	`latex`	单选，指定公式的输出格式
`table`	`markdown` / `html` / `image`	`html`	单选，`markdown` 模式下，合并单元格将自动拆分为独立单元格，并以相同内容填充
`cs`	`image`	`image`	单选，化学结构式输出格式，即将支持`smiles`格式

feature_config

配置项	默认值	说明
`enable_text_cross_page`	`false`	文字跨页拼接：跨页文字段合并为连续段落
`enable_table_cross_page`	`false`	表格跨页拼接：跨页表格合并为完整表格
`enable_title_level_recognition`	`false`	标题层级识别：识别文档标题层级结构（H1/H2/H3…）
`enable_inline_image`	`false`	文中图：返回文字段落中的图片
`enable_table_image`	`true`	表中图：返回表格单元格内的图片
`enable_image_understanding`	`true`	图片理解：对文档内图片进行语义理解和结构化描述
`keep_header_footer`	`false`	保留页眉页脚：默认过滤了页眉页脚，如果使用页眉页脚可开启保留

如果你需要了解鉴权、使用限制和同步/异步选择建议，先看 API 概览。大文件、批处理和后台任务建议改走异步解析 — 提交任务。

请求体

multipart/form-data

file

必填

待解析的文件，支持 PDF、图片、Word、PPT 和 Excel 格式

api_key

string

必填

API 密钥，格式 sk-***

output_formats

enum<string>[]

输出格式，可多选。不传时默认为 ["markdown", "json"]。支持 json / markdown / zip，其中 zip 将所有输出文件打包为压缩包

可用选项:

json,

markdown,

zip

element_formats

object

元素格式配置，控制各类元素的格式

Show child attributes

feature_config

object

特色功能配置（参数已从 extract_config 更名为 feature_config）

Show child attributes

响应

200 - application/json

解析成功

code

integer

状态码，0 为成功，非 0 见错误码

示例:

0

message

string

示例:

"任务成功"

data

object

Show child attributes

错误码异步解析 — 提交任务