Skip to content

Add get-related: chunk-seeded similarity lookup (more-like-this) #395

@RobertLD

Description

@RobertLD

Summary

libscope has no way to ask "what else is like this chunk?" — you can only search by a text query. A get-related operation that seeds a similarity search from an existing chunk's stored embedding would enable document navigation, "related reading" suggestions, and richer MCP tool responses without requiring the user to formulate a query.

Current State

  • src/core/search.tsvectorSearch(db, options, vecBuffer, limit, offset) already accepts a raw Buffer (Float32Array) — the embedding doesn't have to come from provider.embed(). The ANN search at line 599 is fully decoupled from query text.
  • src/core/indexing.ts — embeddings are stored in chunk_embeddings(chunk_id, embedding) as sqlite-vec float vectors
  • src/core/links.tsgetDocumentLinks(db, documentId) returns { outgoing, incoming } which could optionally be blended into results
  • No existing "more like this" path exists — all search entry points embed a text query string

Detailed Implementation Plan

Step 1 — Add getRelatedChunks to src/core/search.ts

New exported function:

export interface RelatedChunksOptions {
  chunkId: string;
  limit?: number;           // default 10
  excludeDocumentId?: string; // exclude the source document (default: auto-detected from chunkId)
  topic?: string;
  library?: string;
  tags?: string[];
  minScore?: number;        // default 0.0, filter out low-similarity results
  includeLinkedDocuments?: boolean; // blend in explicit document_links (default false)
}

export interface RelatedChunksResult {
  chunks: SearchResult[];   // reuse existing SearchResult shape
  sourceChunk: {
    id: string;
    documentId: string;
    content: string;
    chunkIndex: number;
  };
}

export async function getRelatedChunks(
  db: Database.Database,
  options: RelatedChunksOptions,
): Promise<RelatedChunksResult>

Implementation:

  1. Fetch source chunk + its embedding:
const row = db.prepare(`
  SELECT c.id, c.document_id, c.content, c.chunk_index,
         ce.embedding
  FROM chunks c
  JOIN chunk_embeddings ce ON ce.chunk_id = c.id
  WHERE c.id = ?
`).get(options.chunkId);

if (!row) throw new Error(`Chunk not found: ${options.chunkId}`);
  1. Extract embedding buffer — the stored embedding is already a Buffer from sqlite-vec; pass it directly to vectorSearch() (same path as line 425 in searchDocuments):
const vecBuffer = Buffer.from(row.embedding); // already Float32Array bytes
  1. Build search options — reuse SearchOptions shape but override query with empty string (query is not used when vecBuffer is provided directly):
const searchOpts: SearchOptions = {
  query: '',
  topic: options.topic,
  library: options.library,
  tags: options.tags,
  limit: (options.limit ?? 10) + 10, // over-fetch to account for exclusions
};
  1. Call vectorSearch directly (bypasses provider.embed() entirely):
const vectorResults = vectorSearch(db, searchOpts, vecBuffer, searchOpts.limit!, 0);
  1. Exclude source document:
const excludeDocId = options.excludeDocumentId ?? row.document_id;
const filtered = vectorResults.results.filter(r => r.documentId !== excludeDocId);
  1. Optional: blend in explicitly linked documents when includeLinkedDocuments: true:
if (options.includeLinkedDocuments) {
  const links = getDocumentLinks(db, row.document_id);
  const linkedIds = new Set([
    ...links.outgoing.map(l => l.targetId),
    ...links.incoming.map(l => l.sourceId),
  ]);
  // Fetch top chunk from each linked document and boost its score
  for (const linkedDocId of linkedIds) {
    if (!filtered.some(r => r.documentId === linkedDocId)) {
      const linkedChunk = fetchTopChunkForDocument(db, linkedDocId); // new helper
      if (linkedChunk) {
        filtered.push({ ...linkedChunk, score: Math.max(linkedChunk.score, 0.6) });
      }
    }
  }
  // Re-sort by score descending
  filtered.sort((a, b) => b.score - a.score);
}
  1. Apply minScore filter and limit:
return {
  chunks: filtered
    .filter(r => r.score >= (options.minScore ?? 0))
    .slice(0, options.limit ?? 10),
  sourceChunk: {
    id: row.id,
    documentId: row.document_id,
    content: row.content,
    chunkIndex: row.chunk_index,
  },
};

Step 2 — Add get-related MCP tool

In src/mcp/tools/ (or src/mcp/server.ts if decomposed tools aren't merged yet):

server.tool(
  "get-related",
  "Find chunks semantically similar to a given chunk (more-like-this)",
  {
    chunkId: z.string().describe("ID of the source chunk"),
    limit: z.number().min(1).max(50).optional().describe("Number of results (default 10)"),
    topic: z.string().optional().describe("Filter results to a specific topic"),
    library: z.string().optional().describe("Filter results to a specific library"),
    tags: z.array(z.string()).optional().describe("Filter results to documents with these tags"),
    minScore: z.number().min(0).max(1).optional().describe("Minimum similarity score threshold"),
    includeLinkedDocuments: z.boolean().optional()
      .describe("Also include explicitly linked documents even if below similarity threshold"),
  },
  async ({ chunkId, limit, topic, library, tags, minScore, includeLinkedDocuments }) => {
    const result = await getRelatedChunks(db, {
      chunkId, limit, topic, library, tags, minScore, includeLinkedDocuments,
    });
    return {
      content: [{
        type: "text",
        text: JSON.stringify(result, null, 2),
      }],
    };
  }
);

Step 3 — Expose chunk IDs in existing search results

get-related is only useful if callers can get a chunkId to pass in. The existing SearchResult interface should already return chunkId — verify and ensure it's included in MCP search-docs responses. If not, add it to the serialized output.


Step 4 — CLI command

libscope related <chunkId> [--limit 10] [--topic <topic>] [--library <lib>] [--min-score 0.5]

Reuses the same getRelatedChunks function. Format output similar to libscope search.


Step 5 — Tests

  • Unit test: getRelatedChunks with a known chunk returns similar chunks, excludes source document
  • Unit test: minScore filter correctly drops low-similarity results
  • Unit test: includeLinkedDocuments: true includes explicitly linked document chunks
  • Integration test: index 3 documents with known content overlap → verify get-related returns the overlapping doc above the unrelated one
  • Test that passing an unknown chunkId throws a clear error

Why vectorSearch directly (not searchDocuments)?

searchDocuments always calls provider.embed(query) first — it's designed around text queries. getRelatedChunks skips that entirely by re-using the already-stored embedding buffer. This means:

  • No embedding API call needed (fast, free)
  • Works offline / without a provider configured
  • Guaranteed to use the same embedding space the chunk was indexed with

Files to Modify

File Change
src/core/search.ts Add getRelatedChunks(), RelatedChunksOptions, RelatedChunksResult
src/mcp/server.ts or src/mcp/tools/search.ts Add get-related tool
src/cli/index.ts or src/cli/commands/search.ts Add related <chunkId> command
src/core/links.ts Possibly add fetchTopChunkForDocument() helper if includeLinkedDocuments is implemented
tests/unit/search.test.ts Unit tests for getRelatedChunks
tests/integration/ Integration tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions