Skip to content

Auto-detect document links from markdown, wikilinks, and HTML #394

@RobertLD

Description

@RobertLD

Summary

libscope has a document_links table and manual link-documents MCP tool, but no automatic detection of links from document content. The link graph stays empty unless users manually curate it, and the knowledge graph visualization ignores document_links entirely (using only embedding similarity edges).

Current State

  • src/core/link-extractor.tsextractLinks(html, baseUrl): string[] exists for HTML <a> tags, but only used by the web spider for crawling, not for link graph population
  • src/connectors/obsidian.ts — already parses [[wikilinks]] into a wikilinks: string[] array, converts them to markdown format in body text, but never calls createLink()
  • src/core/links.tscreateLink(db, sourceId, targetId, linkType, label?) is ready to use
  • src/core/graph.tsGraphEdge.type only supports "belongs_to_topic" | "has_tag" | "similar_to"; document_links rows are never read for graph edges

Detailed Implementation Plan

Step 1 — Extend link-extractor.ts to handle markdown and wikilinks

Add two new extraction functions alongside the existing extractLinks:

// Extract [text](url) markdown links — return absolute URLs or relative paths
export function extractMarkdownLinks(content: string): string[]

// Extract [[PageName]] and [[PageName|display]] wikilinks — return raw page names
export function extractWikilinks(content: string): string[]

The wikilink regex already exists in obsidian.ts (line 150): /(?<!!)\[\[([^\]|]+)(?:\|([^\]]*))?\]\]/g — move it here as the canonical implementation.

For markdown links, use a simple [text](href) regex, filtering out image links (![...](...)), mailto:, and anchor-only (#...) hrefs.


Step 2 — Add link resolution to src/core/links.ts

Add a new function that resolves a URL or title string to a known document ID:

export function resolveDocumentByUrl(db: Database.Database, url: string): string | null
export function resolveDocumentByTitle(db: Database.Database, title: string): string | null

Both do a simple DB lookup:

  • resolveDocumentByUrl: exact match on documents.url (normalize trailing slash/fragment first)
  • resolveDocumentByTitle: case-insensitive match on documents.title; for wikilinks, also try slug matching (title.toLowerCase().replace(/\s+/g, '-'))

Add a combined helper:

export function resolveDocumentLink(
  db: Database.Database,
  linkTarget: string, // URL or wikilink name
  sourceUrl?: string,  // for relative URL resolution
): string | null

Step 3 — Wire link extraction into indexDocument in src/core/indexing.ts

After the transaction completes at line 470, add a post-index link extraction pass:

// After transaction() call, before return:
await extractAndStoreDocumentLinks(db, docId, input.content, input.url);

New function extractAndStoreDocumentLinks:

async function extractAndStoreDocumentLinks(
  db: Database.Database,
  docId: string,
  content: string,
  sourceUrl?: string,
): Promise<void> {
  const targets = new Set<string>();

  // Detect content type and extract accordingly
  if (content.trimStart().startsWith('<')) {
    // HTML content
    for (const url of extractLinks(content, sourceUrl ?? '')) targets.add(url);
  } else {
    // Markdown/plain — extract both markdown links and wikilinks
    for (const url of extractMarkdownLinks(content)) targets.add(url);
    for (const name of extractWikilinks(content)) targets.add(name);
  }

  for (const target of targets) {
    const targetId = resolveDocumentLink(db, target, sourceUrl);
    if (targetId && targetId !== docId) {
      try {
        createLink(db, docId, targetId, 'references');
      } catch {
        // UNIQUE constraint violation = link already exists, skip
      }
    }
  }
}

This runs only when the content contains resolvable links — silently skips unresolvable ones.


Step 4 — Add "references" link type

In src/core/links.ts, add "references" to VALID_LINK_TYPES and the LinkType union. This distinguishes auto-detected content links from manually curated semantic relationships (see_also, prerequisite, etc.).

In src/db/schema.ts, add a migration (next version after current) with a CHECK constraint update if desired, though SQLite doesn't enforce TEXT enums — the app-level set is sufficient.


Step 5 — Wire Obsidian connector wikilinks

In src/connectors/obsidian.ts, after indexDocument() returns at line 340, call the new link extraction function using the already-parsed parsed.wikilinks array:

// After indexDocument call:
for (const wikilink of parsed.wikilinks) {
  const targetId = resolveDocumentByTitle(db, wikilink);
  if (targetId && targetId !== indexed.id) {
    try { createLink(db, indexed.id, targetId, 'references'); } catch {}
  }
}

Note: wikilinks in Obsidian refer to other vault files by title/filename, so resolveDocumentByTitle is the right resolver here. Since vault sync processes all files, links created early may be unresolvable until the target file is indexed — a second pass or re-index after full sync would resolve these. Consider adding a post-sync link resolution sweep.


Step 6 — Add document_links edges to the knowledge graph

In src/core/graph.ts, after building similar_to edges (line ~263), add:

// Add explicit document_links edges
const allLinks = listLinks(db); // existing function in links.ts
for (const link of allLinks) {
  if (nodeIds.has(link.sourceId) && nodeIds.has(link.targetId)) {
    edges.push({
      source: link.sourceId,
      target: link.targetId,
      type: link.linkType as GraphEdge['type'], // extend union
      weight: 1,
    });
  }
}

Extend GraphEdge.type to include "see_also" | "prerequisite" | "supersedes" | "related" | "references".


Step 7 — Tests

  • Unit test extractMarkdownLinks and extractWikilinks in tests/unit/
  • Unit test resolveDocumentByUrl and resolveDocumentByTitle
  • Integration test: index two documents where doc A links to doc B's URL → verify document_links row created with link_type = "references"
  • Integration test: Obsidian sync with two files where file A wikilinks to file B → verify link created
  • Integration test: graph includes document_links edges

Files to Modify

File Change
src/core/link-extractor.ts Add extractMarkdownLinks(), extractWikilinks()
src/core/links.ts Add "references" type, resolveDocumentByUrl(), resolveDocumentByTitle(), resolveDocumentLink()
src/core/indexing.ts Call extractAndStoreDocumentLinks() after transaction
src/connectors/obsidian.ts Wire parsed.wikilinks to createLink() after indexing
src/core/graph.ts Add document_links edges; extend GraphEdge.type union
src/db/schema.ts No schema change needed (link_type is TEXT, enforced at app level)
tests/unit/ Tests for new extractor and resolver functions
tests/integration/ End-to-end link detection tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions