-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
libscope has no way to ask "what else is like this chunk?" — you can only search by a text query. A get-related operation that seeds a similarity search from an existing chunk's stored embedding would enable document navigation, "related reading" suggestions, and richer MCP tool responses without requiring the user to formulate a query.
Current State
src/core/search.ts—vectorSearch(db, options, vecBuffer, limit, offset)already accepts a rawBuffer(Float32Array) — the embedding doesn't have to come fromprovider.embed(). The ANN search at line 599 is fully decoupled from query text.src/core/indexing.ts— embeddings are stored inchunk_embeddings(chunk_id, embedding)as sqlite-vec float vectorssrc/core/links.ts—getDocumentLinks(db, documentId)returns{ outgoing, incoming }which could optionally be blended into results- No existing "more like this" path exists — all search entry points embed a text query string
Detailed Implementation Plan
Step 1 — Add getRelatedChunks to src/core/search.ts
New exported function:
export interface RelatedChunksOptions {
chunkId: string;
limit?: number; // default 10
excludeDocumentId?: string; // exclude the source document (default: auto-detected from chunkId)
topic?: string;
library?: string;
tags?: string[];
minScore?: number; // default 0.0, filter out low-similarity results
includeLinkedDocuments?: boolean; // blend in explicit document_links (default false)
}
export interface RelatedChunksResult {
chunks: SearchResult[]; // reuse existing SearchResult shape
sourceChunk: {
id: string;
documentId: string;
content: string;
chunkIndex: number;
};
}
export async function getRelatedChunks(
db: Database.Database,
options: RelatedChunksOptions,
): Promise<RelatedChunksResult>Implementation:
- Fetch source chunk + its embedding:
const row = db.prepare(`
SELECT c.id, c.document_id, c.content, c.chunk_index,
ce.embedding
FROM chunks c
JOIN chunk_embeddings ce ON ce.chunk_id = c.id
WHERE c.id = ?
`).get(options.chunkId);
if (!row) throw new Error(`Chunk not found: ${options.chunkId}`);- Extract embedding buffer — the stored embedding is already a Buffer from sqlite-vec; pass it directly to
vectorSearch()(same path as line 425 insearchDocuments):
const vecBuffer = Buffer.from(row.embedding); // already Float32Array bytes- Build search options — reuse
SearchOptionsshape but overridequerywith empty string (query is not used when vecBuffer is provided directly):
const searchOpts: SearchOptions = {
query: '',
topic: options.topic,
library: options.library,
tags: options.tags,
limit: (options.limit ?? 10) + 10, // over-fetch to account for exclusions
};- Call vectorSearch directly (bypasses
provider.embed()entirely):
const vectorResults = vectorSearch(db, searchOpts, vecBuffer, searchOpts.limit!, 0);- Exclude source document:
const excludeDocId = options.excludeDocumentId ?? row.document_id;
const filtered = vectorResults.results.filter(r => r.documentId !== excludeDocId);- Optional: blend in explicitly linked documents when
includeLinkedDocuments: true:
if (options.includeLinkedDocuments) {
const links = getDocumentLinks(db, row.document_id);
const linkedIds = new Set([
...links.outgoing.map(l => l.targetId),
...links.incoming.map(l => l.sourceId),
]);
// Fetch top chunk from each linked document and boost its score
for (const linkedDocId of linkedIds) {
if (!filtered.some(r => r.documentId === linkedDocId)) {
const linkedChunk = fetchTopChunkForDocument(db, linkedDocId); // new helper
if (linkedChunk) {
filtered.push({ ...linkedChunk, score: Math.max(linkedChunk.score, 0.6) });
}
}
}
// Re-sort by score descending
filtered.sort((a, b) => b.score - a.score);
}- Apply minScore filter and limit:
return {
chunks: filtered
.filter(r => r.score >= (options.minScore ?? 0))
.slice(0, options.limit ?? 10),
sourceChunk: {
id: row.id,
documentId: row.document_id,
content: row.content,
chunkIndex: row.chunk_index,
},
};Step 2 — Add get-related MCP tool
In src/mcp/tools/ (or src/mcp/server.ts if decomposed tools aren't merged yet):
server.tool(
"get-related",
"Find chunks semantically similar to a given chunk (more-like-this)",
{
chunkId: z.string().describe("ID of the source chunk"),
limit: z.number().min(1).max(50).optional().describe("Number of results (default 10)"),
topic: z.string().optional().describe("Filter results to a specific topic"),
library: z.string().optional().describe("Filter results to a specific library"),
tags: z.array(z.string()).optional().describe("Filter results to documents with these tags"),
minScore: z.number().min(0).max(1).optional().describe("Minimum similarity score threshold"),
includeLinkedDocuments: z.boolean().optional()
.describe("Also include explicitly linked documents even if below similarity threshold"),
},
async ({ chunkId, limit, topic, library, tags, minScore, includeLinkedDocuments }) => {
const result = await getRelatedChunks(db, {
chunkId, limit, topic, library, tags, minScore, includeLinkedDocuments,
});
return {
content: [{
type: "text",
text: JSON.stringify(result, null, 2),
}],
};
}
);Step 3 — Expose chunk IDs in existing search results
get-related is only useful if callers can get a chunkId to pass in. The existing SearchResult interface should already return chunkId — verify and ensure it's included in MCP search-docs responses. If not, add it to the serialized output.
Step 4 — CLI command
libscope related <chunkId> [--limit 10] [--topic <topic>] [--library <lib>] [--min-score 0.5]
Reuses the same getRelatedChunks function. Format output similar to libscope search.
Step 5 — Tests
- Unit test:
getRelatedChunkswith a known chunk returns similar chunks, excludes source document - Unit test:
minScorefilter correctly drops low-similarity results - Unit test:
includeLinkedDocuments: trueincludes explicitly linked document chunks - Integration test: index 3 documents with known content overlap → verify
get-relatedreturns the overlapping doc above the unrelated one - Test that passing an unknown
chunkIdthrows a clear error
Why vectorSearch directly (not searchDocuments)?
searchDocuments always calls provider.embed(query) first — it's designed around text queries. getRelatedChunks skips that entirely by re-using the already-stored embedding buffer. This means:
- No embedding API call needed (fast, free)
- Works offline / without a provider configured
- Guaranteed to use the same embedding space the chunk was indexed with
Files to Modify
| File | Change |
|---|---|
src/core/search.ts |
Add getRelatedChunks(), RelatedChunksOptions, RelatedChunksResult |
src/mcp/server.ts or src/mcp/tools/search.ts |
Add get-related tool |
src/cli/index.ts or src/cli/commands/search.ts |
Add related <chunkId> command |
src/core/links.ts |
Possibly add fetchTopChunkForDocument() helper if includeLinkedDocuments is implemented |
tests/unit/search.test.ts |
Unit tests for getRelatedChunks |
tests/integration/ |
Integration tests |