This section covers the most directly useful high-level method in the domain:
extract_document_data(...)
If you are a module author who wants document content in a knowledge-oriented shape without manually orchestrating extractor selection, this is usually the first method to look at.
extract_document_data(path, force_path_resolved=False, chunker_override=None, chunker_max_tokens=None, ocr_enabled=None)¶
This method extracts structured content from a readable document path using the currently registered extractor flow.
Example:
payload = module_sdk.knowledge.extract_document_data(
"/tmp/report.pdf",
chunker_override="markdown",
chunker_max_tokens=800,
ocr_enabled=True,
)Example with a persisted media payload:
payload = module_sdk.knowledge.extract_with_registered_extractor(
data=module_sdk.media.view(document.media_path),
filename="report.pdf",
)What It Is For¶
Use this when your module wants structured document content and the source is already available as a readable local path.
Typical examples:
- preprocessing an uploaded document before attaching it to a chat flow
- extracting markdown and chunks from a stored PDF
- building a document-summary or preview flow
This method is intentionally document-oriented. It is a convenience layer over the extractor runtime, not a generic routing primitive.
What It Actually Does¶
Under the hood, the method:
- rejects legacy path-resolution mode if
force_path_resolved=True - guesses the mime type from the provided path
- calls
sdk.extractors.extract(...) - passes a config override dict containing:
chunkerchunker_max_tokensocr_enabled- raises a runtime error if no registered extractor matches
That final step is important.
This method does not silently return None when no extractor is available. It converts that condition into an explicit failure because the method represents a knowledge-level expectation: if you asked to extract a document, the operation is supposed to succeed through a registered extractor.
About force_path_resolved¶
This flag is no longer a supported path-resolution mechanism.
If force_path_resolved=True, the method raises:
RuntimeError("legacy media path resolution was removed from the public SDK; rewrite this extraction flow against the new media API")Keep the argument documented only because it is still present in the current function signature. New module code should leave it at the default False.
If your source lives in persisted media storage rather than at a readable local path, the correct pattern is to load the bytes through module_sdk.media.view(...) and call module_sdk.knowledge.extract_with_registered_extractor(...) instead.
Return Value¶
On success, the method returns the same normalized extraction payload produced by the registered extractor flow, including fields such as:
extractor_idstructuremarkdown_contentchunksindexformulastablesimagesresolved_extractor
That means the output is still extractor-derived, but the calling style is more document-oriented.
Failure Behavior¶
If no registered extractor matches the provided path and guessed mime type, the method raises:
RuntimeError("registered_extractor_not_found ...")That makes this method stricter than sdk.extractors.extract(...), which returns None when no extractor matches.
Why This Difference Exists¶
The lower extractor domain is a routing/execution primitive. Returning None there makes sense because the caller may still want to decide what to do.
The knowledge façade is a higher-level helper. Here, missing extractor support is generally a real operational problem rather than just a neutral routing outcome.
Real Usage Pattern¶
Document-oriented flows usually follow this split:
- if you already have a readable local path, call
module_sdk.knowledge.extract_document_data(...) - if you have a stored media reference, load the bytes with
module_sdk.media.view(...) - pass those bytes to
module_sdk.knowledge.extract_with_registered_extractor(...) - read the extracted markdown for downstream use
That keeps the extraction API aligned with the current public media contract.
When To Use This Instead Of sdk.extractors.extract(...)¶
Prefer extract_document_data(...) when:
- you already have a readable local path
- your feature is document-centric
- failure to find an extractor should be treated as an error
Prefer sdk.extractors.extract(...) when:
- you need lower-level control
- you are dealing with bytes rather than a resolved path
- you want to inspect or handle the
Nonecase yourself
Prefer module_sdk.knowledge.extract_with_registered_extractor(...) when:
- the document is already in memory as bytes
- the source came from
module_sdk.media.view(...) - you want the knowledge-domain entrypoint without relying on a local filesystem path