This section covers the methods that help modules answer three questions:
- which extractors exist?
- which MIME types are configurable for installed extractors?
- which extractor would be chosen for this file?
Those are different questions, and the domain exposes different methods for each.
list_manifests() -> list[dict]¶
This method returns the extractor manifests discovered from the extractor roots on disk.
Example:
manifests = module_sdk.extractors.list_manifests()What It Is For¶
Use this when you need the filesystem-discovered capability catalog rather than the operational registry state.
Typical use cases:
- an admin page that shows all extractor plugins discovered by the runtime
- debugging why an extractor is or is not being discovered
- tooling that wants to inspect declared file extensions, mime types, or manifest metadata
What It Returns¶
The payload comes from valid manifest.json files found under the configured extractor roots.
In practice, each manifest can include things such as:
idnameentrypointfile_extensionsmime_typespriority- runtime and install policy sections declared by the extractor
The manifest loader normalizes important fields such as:
- extractor id to lowercase
- file extensions to lowercase
- mime types to lowercase
Why You Would Use It¶
This method tells you what the runtime can discover from disk, regardless of current registry or activation state.
That is useful because discovery problems and registry problems are not the same.
If an extractor is missing from list_manifests(), the issue is usually about manifest discovery or validity.
If it appears in list_manifests() but not where you expect operationally, the issue is probably in registration or status state, which is where list_registered() becomes relevant.
list_registered() -> list[dict]¶
This method returns the extractor rows currently present in the extractor registry model.
Example:
rows = module_sdk.extractors.list_registered()What It Is For¶
Use this when you care about operational state rather than just discovery.
Typical examples:
- show which extractors are
active,installed, oruninstalled - inspect registry priority and effective matching configuration
- build an admin table of extractors actually known to the runtime
What It Returns¶
The method queries module_sdk.models.extractor_registry and returns serialized registry rows.
Those rows can include fields such as:
idnameextractor_idconfigfile_extensionsmime_typesprioritystatussupported
Why This Matters¶
list_manifests() answers "what extractor manifests were found on disk?"
list_registered() answers "what extractor rows does the runtime currently know about operationally?"
That second question is usually the one you need for admin UI and debugging real routing behavior.
list_configurable_mime_bindings() -> list[dict]¶
This method returns the MIME types that can be configured for extraction and the installed extractors that support each MIME type.
Example:
rows = module_sdk.extractors.list_configurable_mime_bindings()Each row includes:
id: the MIME type, suitable for table row idsmime_type: normalized MIME typeconfigured_extractor_id: selected extractor id, or an empty stringextractor_options: select options for compatible installed extractorsconfigured: whether a binding currently exists
The first option is always the unconfigured state:
{"label": "Not configured", "value": ""}Use this method for admin/configuration pages. Do not build compatible extractor lists in the UI by manually comparing manifest fields. The backend filters options to extractors that are installed/active and actually declare support for the MIME type.
list_ingestible_mime_types() -> list[str]¶
This method returns only MIME types that currently have an extractor binding.
Example:
allowed = set(module_sdk.extractors.list_ingestible_mime_types())Use this as an upload filter when a feature should accept only files that can enter the extraction/ingestion flow. If a MIME type has no binding, it is not ingestible even if some extractor manifest declares theoretical support for it.
set_mime_type_binding(mime_type, extractor_id) -> dict | None¶
This method sets or clears the extractor selected for one MIME type.
Example:
binding = module_sdk.extractors.set_mime_type_binding(
mime_type="application/pdf",
extractor_id="docling",
)Clear a binding by passing None or an empty extractor id:
module_sdk.extractors.set_mime_type_binding(
mime_type="application/pdf",
extractor_id=None,
)Validation happens server-side:
mime_typeis required- the extractor must exist in the registry
- the extractor must be installed or active
- the extractor must declare support for that MIME type
On success, the method returns:
idmime_typeextractor_id
When a binding is cleared, the method returns None.
resolve(path=None, filename=None, mime_type=None) -> dict | None¶
This method resolves the active extractor for a file-like source.
Example with path:
resolved = module_sdk.extractors.resolve(
path="documents/report.pdf",
mime_type="application/pdf",
)Example with in-memory upload metadata:
resolved = module_sdk.extractors.resolve(
filename="invoice.pdf",
mime_type="application/pdf",
)What It Is For¶
Use this when your module needs to know which registered extractor would be selected before actually running extraction.
Typical reasons:
- show the user which extractor will handle a file
- debug why a file is being routed to the wrong extractor
- fail early if no registered extractor matches
- inspect the chosen extractor configuration before running it
Resolution Inputs¶
The method can resolve using:
pathfilenamemime_type
If you do not pass mime_type, the facade will try to infer it from path or filename using guess_mime_type(...).
What The Resolver Actually Does¶
Under the hood, the resolver:
- normalizes the MIME type
- infers the MIME type from
pathorfilenamewhen the caller did not pass one - looks up the explicit MIME binding
- verifies that the bound extractor still exists, is installed/active, and still supports that MIME type
- returns the bound extractor payload
There is no implicit extension-priority fallback in the current contract. A source is extractable only when its MIME type has an explicit binding.
Return Value¶
On success, the method returns the resolved extractor payload.
That payload includes fields such as:
row_idnameextractor_idconfigstatuspriorityfile_extensionsmime_typesmime_matchextension_match
If no extractor matches, the method returns None.
Why This Method Matters¶
A module should not guess extractor routing on its own by manually inspecting file extensions.
The runtime already knows:
- which extractor rows are eligible
- which ones are active
- which MIME type bindings are configured
Using resolve(...) keeps your routing logic aligned with the actual runtime decision.
resolve_for_mime_type(mime_type: str) -> dict | None¶
This is a convenience wrapper around resolve(...) for cases where mime type is the only signal you have.
Example:
resolved = module_sdk.extractors.resolve_for_mime_type("application/pdf")What It Is For¶
Use this when your module already has a trusted mime type and does not need path- or filename-based extension inference.
This is especially useful in flows where:
- the source is in memory
- the filename is unreliable
- mime type was already determined upstream
Why It Exists¶
The method is small, but it makes intent clearer.
Calling resolve_for_mime_type(...) tells the next developer that mime type is the routing source of truth in this flow.
guess_mime_type(path_or_name: str) -> str | None¶
This static helper guesses a mime type from a path or filename using Python’s mimetypes module.
Example:
mime_type = module_sdk.extractors.guess_mime_type("report.pdf")What It Is For¶
Use this when you only have a filename or path and want a reasonable mime-type hint before calling resolve(...) or extract(...).
Important Limitation¶
This is only an extension-based guess.
It does not inspect file content, so it should be treated as a convenience hint rather than authoritative content detection.
Practical Guidance¶
If your module already knows the actual mime type from the upload or media layer, pass it explicitly instead of relying on guessing.
That makes extractor resolution more deterministic.
Choosing The Right Discovery Method¶
Use:
list_manifests()when you want discovery from disklist_registered()when you want runtime registry statelist_configurable_mime_bindings()when you want the admin/configuration matrixlist_ingestible_mime_types()when you need an upload filterset_mime_type_binding(...)when an admin action changes the selected extractor for a MIME typeresolve(...)when you want the actual extractor routing decisionresolve_for_mime_type(...)when mime type is your primary signalguess_mime_type(...)only as a fallback hint