This section covers the operational method that most developers eventually need:
extract(...)extract_with_registered(...)
This is the method that takes the current routing rules, resolves the correct registered extractor, and runs it through the managed extractor runtime.
extract(path=None, data=None, filename=None, mime_type=None, config=None) -> dict | None¶
This method runs the active registered extractor selected by the configured MIME type binding.
Example with a materialized media path:
materialized = module_sdk.media.get_path(stored_path)
try:
payload = module_sdk.extractors.extract(
path=materialized.path,
mime_type="application/pdf",
config={"ocr_enabled": True},
)
finally:
materialized.cleanup()Example with in-memory bytes:
payload = module_sdk.extractors.extract(
data=file_bytes,
filename="report.pdf",
mime_type="application/pdf",
config={"chunker": "markdown"},
)What It Is For¶
Use this when your module wants the structured extraction output itself, not just the routing decision.
Typical examples:
- convert an uploaded PDF into markdown and chunks
- extract structure from a document before a chat or preview flow
- run the registered extractor configured for a MIME type
Supported Input Shapes¶
You can call extract(...) in two main ways.
Path-based extraction¶
Pass path when the file already exists at a readable local path.
If the source lives in media storage, first materialize it with
module_sdk.media.get_path(...) and pass the returned local path. Do not pass a
client media URL or proxy URL to the extractor.
In-memory extraction¶
Pass data when the file exists only as bytes in memory.
When you use data, you should also pass:
filename- and, when available,
mime_type
That gives the resolver enough information to choose the extractor deterministically.
What Happens Internally¶
The facade does more than call one extractor class directly.
The flow is:
- normalize
path,filename, andmime_type - infer mime type from path or filename when missing
- call
resolve(...)to choose the extractor bound to that MIME type - merge extractor row config with any runtime overrides passed in
config - prepare the input source list in the shape expected by the extractor runtime
- call the managed extractor runtime
- attach
resolved_extractormetadata to the returned payload
That is why extract(...) is the right abstraction for modules. It preserves the full runtime contract instead of forcing every module to manually instantiate or route extractors itself.
Return Value¶
When extraction succeeds, the returned payload contains normalized structured fields from the extractor runtime:
extractor_idstructuremarkdown_contentchunksindexformulastablesimagesresolved_extractor
That last field is added by the resolver layer and is particularly useful because it tells you which registered extractor row was actually selected.
Why resolved_extractor Is Useful¶
When debugging an extraction result, it is often not enough to know only that extraction succeeded.
You also want to know:
- which extractor handled the file
- which MIME type binding selected it
- which config and registry metadata were attached to the chosen row
resolved_extractor gives you that context.
Failure Behavior¶
This method has two important failure modes.
No extractor matches¶
If the MIME type has no configured binding, or the bound extractor is no longer
installed/active, the method returns None.
It does not silently invent a fallback extractor.
That makes None a meaningful outcome your module must handle deliberately.
No input was provided¶
If both path and data are missing, the lower resolver layer raises:
ValueError("path or data is required")That is a programmer error rather than a normal runtime result.
Real Usage Pattern¶
The higher-level sdk.knowledge.extract_document_data(...) method delegates to this domain.
Its pattern is:
- resolve or materialize the file path
- call
sdk.extractors.extract(...) - fail explicitly if the result is
None
That is a good example of the relationship between sdk.extractors and sdk.knowledge:
sdk.extractorsis the lower-level execution domainsdk.knowledgecan build stricter document-oriented behavior on top of it
A Practical Module Example¶
Imagine a module that accepts uploaded attachments and wants to feed extracted text into a later workflow.
The pattern looks like this:
payload = module_sdk.extractors.extract(
data=file_bytes,
filename="proposal.pdf",
mime_type="application/pdf",
config={
"ocr_enabled": True,
"chunker_max_tokens": 800,
},
)
if payload is None:
raise RuntimeError("registered_extractor_not_found")
markdown = str(payload.get("markdown_content") or "").strip()
chunks = list(payload.get("chunks") or [])
tables = list(payload.get("tables") or [])This is the right style for module authors:
- let the extractor domain choose the runtime handler
- treat
Noneas a real routing failure - consume normalized output fields from the returned payload
About config¶
The optional config parameter is not a separate extractor definition. It is a set of runtime overrides merged on top of the resolved extractor row config before execution.
That means it is appropriate for per-call behavior such as:
- enabling OCR
- selecting a chunker
- overriding token limits
It is not the same thing as permanently reconfiguring the extractor registry row.
When To Prefer sdk.knowledge¶
If your actual goal is not "run extraction" but rather:
- ingest a document into knowledge storage
- resolve or materialize media-backed document paths
- enforce strict failure when no extractor exists
then the higher-level sdk.knowledge facade may be the better entry point.
Use sdk.extractors directly when you want explicit control over extractor routing and extraction output.
extract_with_registered(extractor_id, path=None, data=None, filename=None, mime_type=None, config=None) -> dict | None¶
This method runs one installed extractor directly.
It bypasses MIME-type binding resolution and is intended for management and test flows where the user explicitly selected the extractor to exercise.
payload = module_sdk.extractors.extract_with_registered(
extractor_id="pdf-extractor",
path=materialized.path,
filename="report.pdf",
mime_type="application/pdf",
)Use this for extractor detail pages, diagnostics, and runtime smoke tests.
Do not use it for normal ingestion routing. In normal document flows, call
extract(...) so the configured MIME binding remains the source of truth.
Failure behavior:
- returns
Nonewhen the extractor id is missing from the registry or not installed/active - raises
ValueError("extractor_id is required")for an empty extractor id - raises
ValueError("path or data is required")when no input is provided