-
Notifications
You must be signed in to change notification settings - Fork 18
feat: add document-to-markdown MCP tool, improve KG structuring & retrieval, and extend blob storage #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Jurij89
commented
Feb 4, 2026
- Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing
- Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF
- Extend blob storage to support subfolders for better organization
…rieval, and extend blob storage: - Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing - Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF - Extend blob storage to support subfolders for better organization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds document conversion capabilities to the DKG essentials plugin, enhances the DKG agent's knowledge structuring abilities, and improves blob storage organization.
Changes:
- Adds a new
document-to-markdownMCP tool that converts PDF, DOCX, and PPTX files to markdown using Mistral OCR - Significantly expands the DKG agent system prompt with detailed instructions for structuring knowledge graphs, conducting SPARQL queries, and extracting deep knowledge from documents
- Extends blob storage to support subfolder organization through automatic parent directory creation
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts | Implements the core document-to-markdown conversion tool with Mistral OCR integration, image extraction, and blob storage |
| packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts | Comprehensive test suite covering tool registration, input validation, file type validation, and blob integration |
| packages/plugin-dkg-essentials/src/index.ts | Exports and initializes the new document-to-markdown plugin |
| packages/plugin-dkg-essentials/src/createFsBlobStorage.ts | Adds parent directory creation to support nested blob paths |
| packages/plugin-dkg-essentials/package.json | Adds Mistral AI SDK and undici dependencies |
| packages/plugin-dkg-essentials/src/plugins/dkg-tools.ts | Updates tool description for clarity |
| apps/agent/src/shared/chat.ts | Major expansion of agent system prompt with detailed knowledge extraction, SPARQL query patterns, and user communication guidelines |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const imageFilename = `${image.id}`; // ID already includes extension (e.g., "img-0.jpeg") | ||
| const imageBlobId = `${BLOB_PREFIX}/${folderId}/${imageFilename}`; | ||
| const imageStream = Readable.toWeb(Readable.from(image.data)); | ||
| await ctx.blob.put(imageBlobId, imageStream, { | ||
| name: imageFilename, |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable imageFilename is redundant and reduces code clarity. The comment indicates image.id already contains the complete filename. The variable should be removed and image.id used directly in subsequent code (line 418) to eliminate unnecessary indirection.
| const imageFilename = `${image.id}`; // ID already includes extension (e.g., "img-0.jpeg") | |
| const imageBlobId = `${BLOB_PREFIX}/${folderId}/${imageFilename}`; | |
| const imageStream = Readable.toWeb(Readable.from(image.data)); | |
| await ctx.blob.put(imageBlobId, imageStream, { | |
| name: imageFilename, | |
| const imageBlobId = `${BLOB_PREFIX}/${folderId}/${image.id}`; // ID already includes extension (e.g., "img-0.jpeg") | |
| const imageStream = Readable.toWeb(Readable.from(image.data)); | |
| await ctx.blob.put(imageBlobId, imageStream, { | |
| name: image.id, |
| let timeoutId: NodeJS.Timeout; | ||
| const timeoutPromise = new Promise<never>((_, reject) => { | ||
| timeoutId = setTimeout(() => { |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable timeoutId is declared with let but assigned inside the Promise constructor, which means it may be undefined when clearTimeout(timeoutId) is called on line 345. This creates a race condition where if the Promise rejects/resolves before the setTimeout callback assigns the value, the cleanup will fail. Declare and assign timeoutId before creating timeoutPromise to ensure it's always defined for cleanup.
| // Create a base64 string that decodes to > 50MB | ||
| // 50MB = 52,428,800 bytes, base64 encoding increases size by ~33% | ||
| // So we need about 70MB of base64 data | ||
| const largeSizeBytes = 51 * 1024 * 1024; // 51MB |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on lines 250-251 is misleading. The test creates a 51MB buffer and then base64-encodes it (line 255), which actually creates ~68MB of base64 data. However, the tool validates the decoded size (line 520 in document-to-markdown.ts), not the base64 size. The comment should clarify that the 51MB buffer will decode to 51MB, which exceeds the 50MB limit, making the base64 size discussion irrelevant to this validation.
| // Create a base64 string that decodes to > 50MB | |
| // 50MB = 52,428,800 bytes, base64 encoding increases size by ~33% | |
| // So we need about 70MB of base64 data | |
| const largeSizeBytes = 51 * 1024 * 1024; // 51MB | |
| // Create a base64 string whose decoded content is > 50MB | |
| // 50MB = 52,428,800 bytes; the size check is performed on the decoded bytes | |
| // So a 51MB buffer is sufficient to exceed the 50MB limit when decoded | |
| const largeSizeBytes = 51 * 1024 * 1024; // 51MB decoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how should I feel about changing so many things in package.lock? was this really necessary ?
| * Validate that MISTRAL_API_KEY is set | ||
| * @throws Error if API key is missing | ||
| */ | ||
| function validateMistralApiKey(): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why mistral ? maybe lets have some generic llm support or an interface where we can implement multiple llm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mistral was selected because it is best performing for pdf->markdown conversion.
but i agree, the way we create new LLM clients is very "all over the place".
| // Apply page range filter if specified | ||
| let pages = response.pages; | ||
| if (options?.pageStart !== undefined || options?.pageEnd !== undefined) { | ||
| const start = (options.pageStart ?? 1) - 1; // Convert to 0-indexed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should 1 be a constant? e.g. DEFAULT_PAGE_START = 1.
| pageStart: z | ||
| .number() | ||
| .optional() | ||
| .describe("First page to process (1-indexed)"), | ||
| pageEnd: z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some validation for input params, like page start not being bigger than page end, can it be negative, etc.
| @@ -194,31 +194,348 @@ export const DEFAULT_SYSTEM_PROMPT = ` | |||
| You are a DKG Agent that helps users interact with the OriginTrail Decentralized Knowledge Graph (DKG) using available Model Context Protocol (MCP) tools. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should revisit the whole prompt and try to narrow it down somewhat? This could influence the context of the llm quite a lot, especially for other tool which heavily rely on the llm for final processing, like retrieval for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried the prompt quite a bit and works pretty well. Also checked and it's only 4k tokens - with nmost models having now 200k or 400k context window, I don't see this as an issue I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm In this context I see document-to-markdown tool instructions even if that is a plugin which is not always included? or am I missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tool is part of the dkg essentials plugin, which contains basic tools that enable the DKG agent (DKG interaction tools + this basic document processing) - so yes, the plugin is always included and comes out of the box.
- Extract DocumentConversionProvider interface for pluggable OCR backends - Move Mistral-specific code to providers/mistral.ts - Add provider registry with factory pattern (createProvider, getAvailableProviders) - Split into modular structure: types, validation, blob-integration, providers - Support runtime provider selection via DOCUMENT_CONVERSION_PROVIDER env var - Add createDocumentToMarkdownPlugin() factory for custom configuration - Update tests with provider registry and mock provider tests - Update documentation with provider abstraction usage
| `Document-to-markdown plugin initialized with provider: ${provider.name}`, | ||
| ); | ||
|
|
||
| mcp.registerTool( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could register an api endpoint too
| } catch (error) { | ||
| // Log warning but don't fail plugin initialization | ||
| // Provider errors will surface when the tool is actually used | ||
| console.warn( | ||
| `Document conversion provider "${providerName}" initialization deferred: ` + | ||
| `${error instanceof Error ? error.message : String(error)}`, | ||
| ); | ||
| // Create a lazy provider that initializes on first use | ||
| provider = createLazyProvider(providerName); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why load a lazy provider if the tool will fail without a valid provider later on? We should fail fast here with a constructive error, it's easier to debug and less code to maintain