feat: add document-to-markdown MCP tool, improve KG structuring & retrieval, and extend blob storage #28

Jurij89 · 2026-02-04T16:03:00Z

Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing
Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF
Extend blob storage to support subfolders for better organization

…rieval, and extend blob storage: - Add document-to-markdown MCP tool (.pdf, .docx, .pptx) for ingestion → markdown → JSON-LD → DKG publishing - Improve DKG agent system prompt to structure raw data into knowledge graphs and retrieve data via SPARQL using schema.org and FOAF - Extend blob storage to support subfolders for better organization

Copilot

Pull request overview

This PR adds document conversion capabilities to the DKG essentials plugin, enhances the DKG agent's knowledge structuring abilities, and improves blob storage organization.

Changes:

Adds a new document-to-markdown MCP tool that converts PDF, DOCX, and PPTX files to markdown using Mistral OCR
Significantly expands the DKG agent system prompt with detailed instructions for structuring knowledge graphs, conducting SPARQL queries, and extracting deep knowledge from documents
Extends blob storage to support subfolder organization through automatic parent directory creation

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts	Implements the core document-to-markdown conversion tool with Mistral OCR integration, image extraction, and blob storage
packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts	Comprehensive test suite covering tool registration, input validation, file type validation, and blob integration
packages/plugin-dkg-essentials/src/index.ts	Exports and initializes the new document-to-markdown plugin
packages/plugin-dkg-essentials/src/createFsBlobStorage.ts	Adds parent directory creation to support nested blob paths
packages/plugin-dkg-essentials/package.json	Adds Mistral AI SDK and undici dependencies
packages/plugin-dkg-essentials/src/plugins/dkg-tools.ts	Updates tool description for clarity
apps/agent/src/shared/chat.ts	Major expansion of agent system prompt with detailed knowledge extraction, SPARQL query patterns, and user communication guidelines

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T16:08:01Z

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts

+    const imageFilename = `${image.id}`;  // ID already includes extension (e.g., "img-0.jpeg")
+    const imageBlobId = `${BLOB_PREFIX}/${folderId}/${imageFilename}`;
+    const imageStream = Readable.toWeb(Readable.from(image.data));
+    await ctx.blob.put(imageBlobId, imageStream, {
+      name: imageFilename,


The variable imageFilename is redundant and reduces code clarity. The comment indicates image.id already contains the complete filename. The variable should be removed and image.id used directly in subsequent code (line 418) to eliminate unnecessary indirection.

Suggested change

const imageFilename = `${image.id}`; // ID already includes extension (e.g., "img-0.jpeg")

const imageBlobId = `${BLOB_PREFIX}/${folderId}/${imageFilename}`;

const imageStream = Readable.toWeb(Readable.from(image.data));

await ctx.blob.put(imageBlobId, imageStream, {

name: imageFilename,

const imageBlobId = `${BLOB_PREFIX}/${folderId}/${image.id}`; // ID already includes extension (e.g., "img-0.jpeg")

const imageStream = Readable.toWeb(Readable.from(image.data));

await ctx.blob.put(imageBlobId, imageStream, {

name: image.id,

Copilot · 2026-02-04T16:08:01Z

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts

+  let timeoutId: NodeJS.Timeout;
+  const timeoutPromise = new Promise<never>((_, reject) => {
+    timeoutId = setTimeout(() => {


The variable timeoutId is declared with let but assigned inside the Promise constructor, which means it may be undefined when clearTimeout(timeoutId) is called on line 345. This creates a race condition where if the Promise rejects/resolves before the setTimeout callback assigns the value, the cleanup will fail. Declare and assign timeoutId before creating timeoutPromise to ensure it's always defined for cleanup.

Copilot · 2026-02-04T16:08:02Z

packages/plugin-dkg-essentials/tests/document-to-markdown.spec.ts

+      // Create a base64 string that decodes to > 50MB
+      // 50MB = 52,428,800 bytes, base64 encoding increases size by ~33%
+      // So we need about 70MB of base64 data
+      const largeSizeBytes = 51 * 1024 * 1024; // 51MB


The comment on lines 250-251 is misleading. The test creates a 51MB buffer and then base64-encodes it (line 255), which actually creates ~68MB of base64 data. However, the tool validates the decoded size (line 520 in document-to-markdown.ts), not the base64 size. The comment should clarify that the 51MB buffer will decode to 51MB, which exceeds the 50MB limit, making the base64 size discussion irrelevant to this validation.

Suggested change

// Create a base64 string that decodes to > 50MB

// 50MB = 52,428,800 bytes, base64 encoding increases size by ~33%

// So we need about 70MB of base64 data

const largeSizeBytes = 51 * 1024 * 1024; // 51MB

// Create a base64 string whose decoded content is > 50MB

// 50MB = 52,428,800 bytes; the size check is performed on the decoded bytes

// So a 51MB buffer is sufficient to exceed the 50MB limit when decoded

const largeSizeBytes = 51 * 1024 * 1024; // 51MB decoded

lupuszr · 2026-02-04T16:10:39Z

package-lock.json

I am not sure how should I feel about changing so many things in package.lock? was this really necessary ?

lupuszr · 2026-02-04T16:15:49Z

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts

+ * Validate that MISTRAL_API_KEY is set
+ * @throws Error if API key is missing
+ */
+function validateMistralApiKey(): string {


why mistral ? maybe lets have some generic llm support or an interface where we can implement multiple llm

Mistral was selected because it is best performing for pdf->markdown conversion.

but i agree, the way we create new LLM clients is very "all over the place".

zsculac · 2026-02-05T10:45:01Z

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts

+  // Apply page range filter if specified
+  let pages = response.pages;
+  if (options?.pageStart !== undefined || options?.pageEnd !== undefined) {
+    const start = (options.pageStart ?? 1) - 1; // Convert to 0-indexed


Should 1 be a constant? e.g. DEFAULT_PAGE_START = 1.

zsculac · 2026-02-05T13:05:17Z

packages/plugin-dkg-essentials/src/plugins/document-to-markdown.ts

+            pageStart: z
+              .number()
+              .optional()
+              .describe("First page to process (1-indexed)"),
+            pageEnd: z


We should add some validation for input params, like page start not being bigger than page end, can it be negative, etc.

zsculac · 2026-02-05T14:28:13Z

apps/agent/src/shared/chat.ts

@@ -194,31 +194,348 @@ export const DEFAULT_SYSTEM_PROMPT = `
 You are a DKG Agent that helps users interact with the OriginTrail Decentralized Knowledge Graph (DKG) using available Model Context Protocol (MCP) tools.


Perhaps we should revisit the whole prompt and try to narrow it down somewhat? This could influence the context of the llm quite a lot, especially for other tool which heavily rely on the llm for final processing, like retrieval for example

Tried the prompt quite a bit and works pretty well. Also checked and it's only 4k tokens - with nmost models having now 200k or 400k context window, I don't see this as an issue I think?

Hmm In this context I see document-to-markdown tool instructions even if that is a plugin which is not always included? or am I missing something

This tool is part of the dkg essentials plugin, which contains basic tools that enable the DKG agent (DKG interaction tools + this basic document processing) - so yes, the plugin is always included and comes out of the box.

- Extract DocumentConversionProvider interface for pluggable OCR backends - Move Mistral-specific code to providers/mistral.ts - Add provider registry with factory pattern (createProvider, getAvailableProviders) - Split into modular structure: types, validation, blob-integration, providers - Support runtime provider selection via DOCUMENT_CONVERSION_PROVIDER env var - Add createDocumentToMarkdownPlugin() factory for custom configuration - Update tests with provider registry and mock provider tests - Update documentation with provider abstraction usage

zsculac · 2026-02-05T15:56:34Z

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

+      `Document-to-markdown plugin initialized with provider: ${provider.name}`,
+    );
+
+    mcp.registerTool(


we could register an api endpoint too

zsculac · 2026-02-05T16:03:24Z

packages/plugin-dkg-essentials/src/plugins/document-to-markdown/index.ts

+      } catch (error) {
+        // Log warning but don't fail plugin initialization
+        // Provider errors will surface when the tool is actually used
+        console.warn(
+          `Document conversion provider "${providerName}" initialization deferred: ` +
+            `${error instanceof Error ? error.message : String(error)}`,
+        );
+        // Create a lazy provider that initializes on first use
+        provider = createLazyProvider(providerName);
+      }


Why load a lazy provider if the tool will fail without a valid provider later on? We should fail fast here with a constructive error, it's easier to debug and less code to maintain

Jurij89 requested a review from Copilot February 4, 2026 16:06

Copilot AI reviewed Feb 4, 2026

View reviewed changes

lupuszr reviewed Feb 4, 2026

View reviewed changes

Balki-OriginTrail approved these changes Feb 5, 2026

View reviewed changes

zsculac reviewed Feb 5, 2026

View reviewed changes

fix: removed a folder that shouldln't have been a part of the commit

c9dc5e2

		@@ -194,31 +194,348 @@ export const DEFAULT_SYSTEM_PROMPT = `
		You are a DKG Agent that helps users interact with the OriginTrail Decentralized Knowledge Graph (DKG) using available Model Context Protocol (MCP) tools.

feat: add document-to-markdown MCP tool, improve KG structuring & retrieval, and extend blob storage #28

Are you sure you want to change the base?

feat: add document-to-markdown MCP tool, improve KG structuring & retrieval, and extend blob storage #28

Uh oh!

Conversation

Jurij89 commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants