Skip to content

release#372

Closed
RobertLD wants to merge 26 commits intomainfrom
development
Closed

release#372
RobertLD wants to merge 26 commits intomainfrom
development

Conversation

@RobertLD
Copy link
Owner

@RobertLD RobertLD commented Mar 6, 2026

No description provided.

RobertLD and others added 25 commits March 3, 2026 13:25
* chore: add development branch workflow

- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
  Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: skip Vercel preview deployments on non-main branches

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: trigger check refresh

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix: comprehensive audit fixes — security, performance, resilience, API hardening

Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints

Closes #314

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test

- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: consolidate and fix CI/CD workflows

- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add HTML file parser for .html/.htm document indexing

Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.

Closes #317

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address CodeQL and review comments on HTML parser

- Replace regex-based tag stripping with node-html-markdown's native
  ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
  bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
  other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "fix: skip Vercel preview deployments on non-main branches"

This reverts commit eb48187.

* feat: add --from option to pack create for folder/URL sources

Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.

CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]

Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)

Closes #328

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* style: fix prettier formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add gzip support for pack files (.json.gz)

Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json

5 new tests covering gzip write, install, magic byte detection, and round-trip.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add progress logging and fix dedup handling in pack install

- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
  (previously it always threw ValidationError regardless of dedup mode)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: auto-generate tags during pack creation and apply on install

- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add HTML file parser for .html/.htm document indexing

Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.

Closes #317

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address CodeQL and review comments on HTML parser

- Replace regex-based tag stripping with node-html-markdown's native
  ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
  bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
  other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "fix: skip Vercel preview deployments on non-main branches"

This reverts commit eb48187.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)

---
updated-dependencies:
- dependency-name: eslint-config-prettier
  dependency-version: 10.1.8
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).


Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](lint-staged/lint-staged@v16.3.1...v16.3.2)

---
updated-dependencies:
- dependency-name: lint-staged
  dependency-version: 16.3.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: minor-and-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps the actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v4...v6)

Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](actions/setup-node@v4...v6)

Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4...v7)

Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v5...v6)

Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](actions/setup-go@v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-node
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-go
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)

---
updated-dependencies:
- dependency-name: "@types/node"
  dependency-version: 25.3.3
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](WiseLibs/better-sqlite3@v11.10.0...v12.6.2)

---
updated-dependencies:
- dependency-name: better-sqlite3
  dependency-version: 12.6.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](eslint/eslint@v9.39.3...v10.0.2)

---
updated-dependencies:
- dependency-name: eslint
  dependency-version: 10.0.2
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* feat: add passthrough LLM mode for ask-question tool

Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).

- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
  retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
  directly; falls through to existing LLM path otherwise

Enable via config:  { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: format config.ts and include passthrough in provider override

- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
  not spread into overrides.llm.provider

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address 9 audit findings from issue #332

Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
  mutation with per-request undici Agent to eliminate TLS race condition (#1)

Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
  instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
  the pre-joined aggregate in FTS and LIKE search paths (#4)

Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
  per-document getDocumentTags() calls with a single getDocumentTagsBatch()
  query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
  repeated on every request (#6)

Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
  the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
  Date objects back to ISO-8601 strings (#9)

Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
  add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
  and correct coverage threshold (80% → actual 75%/74%)

Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
  tests so TTL cache doesn't return stale results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: remove unused warnIfTlsBypassMissing function

Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: update tests for config cache and retry semantics

- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
  that were failing because the 30s TTL cache introduced in the config
  module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
  retries = 3 total calls (loop is attempt <= maxRetries)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
… (#336)

- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
  bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
  `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.

- Update `setupLogging` to default to "silent" in CLI mode (pretty
  reporter handles user-facing output). Verbose/`--log-level` flags still
  route to structured JSON pino logs. Fix duplicate `initLogger` calls in
  onenote connect/disconnect commands to use `setupLogging` consistently.

- Update `installPack` in `packs.ts` to support batch embedding and
  progress reporting:
  - New `InstallOptions` interface with `batchSize`, `resumeFrom`,
    `onProgress` fields
  - Batch documents: chunk all → single `provider.embedBatch` call per
    batch → single SQLite transaction per batch (avoids N embedding calls)
  - `resumeFrom` skips the first N documents (enables partial install
    resume after failure)
  - `InstallResult` now includes `errors` count
  - Add `--batch-size` and `--resume-from` CLI options to `pack install`

- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
  SilentReporter, isVerbose, env var detection); extended
  `tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
  batch efficiency, resumeFrom, embedBatch failure handling.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…dates (#337)

Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).


Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](honojs/node-server@v1.19.9...v1.19.10)

Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](honojs/hono@v4.12.3...v4.12.5)

---
updated-dependencies:
- dependency-name: "@hono/node-server"
  dependency-version: 1.19.10
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: hono
  dependency-version: 4.12.5
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)

**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.

An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.

**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).

Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers

The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).

Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…) (#339)

* feat: concurrent pack installation and -v verbose shorthand (issue #330)

Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.

Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
  calls run simultaneously; embedding is I/O-bound so parallelism directly
  reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
  semaphore-based scheduler to run up to `concurrency` embedBatch calls
  concurrently while inserting completed batches in-order (SQLite requires
  serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
  without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
  multiple embedBatch calls per install, concurrency limit enforcement,
  incremental progress reporting, and partial-failure error counting

https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ

* fix: address all 4 Copilot review comments on PR #339

- Validate batchSize, concurrency, resumeFrom at the start of installPack
  and throw ValidationError for invalid values (comments 3 & 4). Concurrency
  <= 0 would silently hang the semaphore indefinitely.

- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
  error before ever calling installPack (comment 3).

- Lazy chunking: pre-chunking all documents upfront held chunks for the
  entire pack in memory simultaneously. Batches now store only the raw
  documents; resolveBatch() chunks on demand right before embedBatch
  is called, so only one batch's worth of chunks is in memory at a time
  (comment 2).

- Wrap provider.embedBatch() in try/catch so synchronous throws are
  converted to rejected Promises rather than escaping scheduleNext() and
  leaving the outer Promise permanently pending (comment 1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)

Closes #342

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…page limits (#315) (#343)

* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)

Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.

New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
  URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
  excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
  10-min total timeout, robots.txt (User-agent: * and libscope), and
  1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
  dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
  domain + path + pattern filtering, cycle detection, robots.txt, partial
  failure recovery, stats, and abortReason reporting.

Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
  contentType + finalUrl before HTML-to-markdown conversion, so the
  spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
  maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
  pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
  sameDomain, pathPrefix, excludePatterns parameters.

Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.

Closes #315

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve CI lint errors in spider implementation

- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
  TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
  no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
  .resolves requires a Promise, not an async function

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address CodeQL security findings in spider/link-extractor

link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
  Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
  with a strict http/https allowlist check on the resolved URL protocol.
  An allowlist is exhaustive by definition; a blocklist will always miss
  obscure schemes like vbscript:, blob:, or future additions.

spider.ts (CodeQL #31 — incomplete multi-character sanitization):
  Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
  an indexOf-based stripTags() function. The regex stops at the first >
  which can be inside a quoted attribute value (e.g. <img alt="a>b">),
  potentially leaving partial tag content in the extracted title.
  The new implementation walks quoted attribute values explicitly so no
  tag content leaks through regardless of its internal structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address all Copilot review comments on spider PR (#343)

- link-extractor: add word-boundary check in extractHref to prevent
  matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
  interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
  as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
  yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
  to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
  add SpiderResponse schema, update 201 response to oneOf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: fix prettier formatting in spider files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
  all 26 tools, expand REST API table with all endpoints (webhooks,
  links, analytics, connectors status, suggest-tags, bulk ops), add
  webhooks section with HMAC signing example, add missing CLI commands
  (bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
  for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
  grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
  suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
  add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
  expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: comprehensive documentation update for v1.3.0

- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
  all 26 tools, expand REST API table with all endpoints (webhooks,
  links, analytics, connectors status, suggest-tags, bulk ops), add
  webhooks section with HMAC signing example, add missing CLI commands
  (bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
  for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
  grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
  suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
  add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
  expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: allow release-please PRs to pass merge gate and trigger CI

Two issues prevented PR #238 from getting CI runs:

1. merge-gate blocked release-please PRs — the gate only allowed
   'development' as the source branch, but release-please uses
   'release-please--branches--main--components--libscope'. Updated
   to allow any branch matching 'release-please--*'.

2. CI never ran on the PR — GitHub does not trigger workflows when
   GITHUB_TOKEN creates a PR (intentional security restriction to
   prevent infinite loops). Fixed by passing a PAT via secrets.GH_TOKEN
   to the release-please action so its PR creation triggers CI.

Note: requires a 'GH_TOKEN' secret in repo settings — a classic PAT
with repo and workflow scopes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: add development branch workflow (#327)

* chore: add development branch workflow

- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
  Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development



* fix: skip Vercel preview deployments on non-main branches



* chore: trigger check refresh

---------



* feat: create-pack from local folder or URL sources (#329)

* fix: comprehensive audit fixes — security, performance, resilience, API hardening

Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints

Closes #314



* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test

- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking



* ci: consolidate and fix CI/CD workflows

- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)



* feat: add HTML file parser for .html/.htm document indexing

Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.

Closes #317



* fix: address CodeQL and review comments on HTML parser

- Replace regex-based tag stripping with node-html-markdown's native
  ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
  bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
  other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency



* Revert "fix: skip Vercel preview deployments on non-main branches"

This reverts commit eb48187.

* feat: add --from option to pack create for folder/URL sources

Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.

CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]

Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)

Closes #328



* style: fix prettier formatting



* feat: add gzip support for pack files (.json.gz)

Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json

5 new tests covering gzip write, install, magic byte detection, and round-trip.



* feat: add progress logging and fix dedup handling in pack install

- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
  (previously it always threw ValidationError regardless of dedup mode)



* feat: auto-generate tags during pack creation and apply on install

- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing



---------



* feat: add HTML file parser for .html/.htm document indexing (#318)

* feat: add HTML file parser for .html/.htm document indexing

Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.

Closes #317



* fix: address CodeQL and review comments on HTML parser

- Replace regex-based tag stripping with node-html-markdown's native
  ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
  bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
  other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency



* Revert "fix: skip Vercel preview deployments on non-main branches"

This reverts commit eb48187.

---------



* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)

Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)

---
updated-dependencies:
- dependency-name: eslint-config-prettier
  dependency-version: 10.1.8
  dependency-type: direct:development
  update-type: version-update:semver-major
...




* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2

Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).


Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](lint-staged/lint-staged@v16.3.1...v16.3.2)

---
updated-dependencies:
- dependency-name: lint-staged
  dependency-version: 16.3.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: minor-and-patch
...




* build(deps): Bump the actions group with 5 updates

Bumps the actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v4...v6)

Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](actions/setup-node@v4...v6)

Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4...v7)

Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v5...v6)

Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](actions/setup-go@v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-node
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-go
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
...




* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3

Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)

---
updated-dependencies:
- dependency-name: "@types/node"
  dependency-version: 25.3.3
  dependency-type: direct:development
  update-type: version-update:semver-major
...




* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2

Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](WiseLibs/better-sqlite3@v11.10.0...v12.6.2)

---
updated-dependencies:
- dependency-name: better-sqlite3
  dependency-version: 12.6.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...




* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2

Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](eslint/eslint@v9.39.3...v10.0.2)

---
updated-dependencies:
- dependency-name: eslint
  dependency-version: 10.0.2
  dependency-type: direct:development
  update-type: version-update:semver-major
...






* feat: add passthrough LLM mode for ask-question tool (#335)

* feat: add passthrough LLM mode for ask-question tool

Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).

- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
  retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
  directly; falls through to existing LLM path otherwise

Enable via config:  { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough



* fix: format config.ts and include passthrough in provider override

- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
  not spread into overrides.llm.provider



---------



* fix: address 9 audit findings from issue #332 (#333)

* fix: address 9 audit findings from issue #332

Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
  mutation with per-request undici Agent to eliminate TLS race condition (#1)

Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
  instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
  the pre-joined aggregate in FTS and LIKE search paths (#4)

Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
  per-document getDocumentTags() calls with a single getDocumentTagsBatch()
  query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
  repeated on every request (#6)

Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
  the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
  Date objects back to ISO-8601 strings (#9)

Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
  add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
  and correct coverage threshold (80% → actual 75%/74%)

Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
  tests so TTL cache doesn't return stale results



* fix: remove unused warnIfTlsBypassMissing function

Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).



* fix: update tests for config cache and retry semantics

- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
  that were failing because the 30s TTL cache introduced in the config
  module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
  retries = 3 total calls (loop is attempt <= maxRetries)



---------



* feat: CLI logging improvements and pack installation performance (#330) (#336)

- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
  bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
  `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.

- Update `setupLogging` to default to "silent" in CLI mode (pretty
  reporter handles user-facing output). Verbose/`--log-level` flags still
  route to structured JSON pino logs. Fix duplicate `initLogger` calls in
  onenote connect/disconnect commands to use `setupLogging` consistently.

- Update `installPack` in `packs.ts` to support batch embedding and
  progress reporting:
  - New `InstallOptions` interface with `batchSize`, `resumeFrom`,
    `onProgress` fields
  - Batch documents: chunk all → single `provider.embedBatch` call per
    batch → single SQLite transaction per batch (avoids N embedding calls)
  - `resumeFrom` skips the first N documents (enables partial install
    resume after failure)
  - `InstallResult` now includes `errors` count
  - Add `--batch-size` and `--resume-from` CLI options to `pack install`

- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
  SilentReporter, isVerbose, env var detection); extended
  `tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
  batch efficiency, resumeFrom, embedBatch failure handling.



* Claude/fix issue 331 s1qzu (#338)

* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)

Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).


Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](honojs/node-server@v1.19.9...v1.19.10)

Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](honojs/hono@v4.12.3...v4.12.5)

---
updated-dependencies:
- dependency-name: "@hono/node-server"
  dependency-version: 1.19.10
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: hono
  dependency-version: 4.12.5
  dependency-type: indirect
  dependency-group: npm_and_yarn
...




* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)

* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)

**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.

An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.

**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).

Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.



* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers

The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).

Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.



---------



* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)

* feat: concurrent pack installation and -v verbose shorthand (issue #330)

Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.

Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
  calls run simultaneously; embedding is I/O-bound so parallelism directly
  reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
  semaphore-based scheduler to run up to `concurrency` embedBatch calls
  concurrently while inserting completed batches in-order (SQLite requires
  serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
  without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
  multiple embedBatch calls per install, concurrency limit enforcement,
  incremental progress reporting, and partial-failure error counting

https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ

* fix: address all 4 Copilot review comments on PR #339

- Validate batchSize, concurrency, resumeFrom at the start of installPack
  and throw ValidationError for invalid values (comments 3 & 4). Concurrency
  <= 0 would silently hang the semaphore indefinitely.

- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
  error before ever calling installPack (comment 3).

- Lazy chunking: pre-chunking all documents upfront held chunks for the
  entire pack in memory simultaneously. Batches now store only the raw
  documents; resolveBatch() chunks on demand right before embedBatch
  is called, so only one batch's worth of chunks is in memory at a time
  (comment 2).

- Wrap provider.embedBatch() in try/catch so synchronous throws are
  converted to rejected Promises rather than escaping scheduleNext() and
  leaving the outer Promise permanently pending (comment 1).



---------



* fix: address 7 pre-release bugs from audit (#342) (#344)

- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)

Closes #342



* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)

* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)

Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.

New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
  URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
  excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
  10-min total timeout, robots.txt (User-agent: * and libscope), and
  1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
  dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
  domain + path + pattern filtering, cycle detection, robots.txt, partial
  failure recovery, stats, and abortReason reporting.

Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
  contentType + finalUrl before HTML-to-markdown conversion, so the
  spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
  maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
  pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
  sameDomain, pathPrefix, excludePatterns parameters.

Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.

Closes #315



* fix: resolve CI lint errors in spider implementation

- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
  TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
  no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
  .resolves requires a Promise, not an async function



* fix: address CodeQL security findings in spider/link-extractor

link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
  Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
  with a strict http/https allowlist check on the resolved URL protocol.
  An allowlist is exhaustive by definition; a blocklist will always miss
  obscure schemes like vbscript:, blob:, or future additions.

spider.ts (CodeQL #31 — incomplete multi-character sanitization):
  Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
  an indexOf-based stripTags() function. The regex stops at the first >
  which can be inside a quoted attribute value (e.g. <img alt="a>b">),
  potentially leaving partial tag content in the extracted title.
  The new implementation walks quoted attribute values explicitly so no
  tag content leaks through regardless of its internal structure.



* fix: address all Copilot review comments on spider PR (#343)

- link-extractor: add word-boundary check in extractHref to prevent
  matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
  interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
  as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
  yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
  to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
  add SpiderResponse schema, update 201 response to oneOf



* style: fix prettier formatting in spider files



---------



---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: improve chunking and search retrieval quality (#362)

Chunking improvements:
- Replace HTML comment breadcrumbs with plain-text "Context:" prefixes
  for better embedding quality
- Add configurable inter-chunk overlap (~10% default) via ChunkOptions
- Add paragraph-boundary splitting for oversized sections instead of
  hard character cuts
- Prepend document metadata (title/library/version) to chunk text
  before embedding for richer semantic representations

Search improvements:
- Implement hybrid search via Reciprocal Rank Fusion (RRF) merging
  vector and FTS5 results when both are available
- Switch FTS5 to AND-by-default logic for better precision, with
  automatic OR fallback when AND yields no results
- Add title boosting (1.5x multiplier) when query terms match the
  document title
- Make count queries lazy — skip expensive COUNT when results fit in
  one page

Includes comprehensive tests for all new features plus a retrieval
quality benchmark test suite.

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* test: add end-to-end retrieval quality benchmark with sqlite-vec

Integration test that indexes an 8-doc corpus with TF-IDF embeddings
into sqlite-vec, then validates: hybrid RRF fusion, title boosting,
FTS5 AND logic, and overall top-3 precision (6/6 queries correct).

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* fix: correct lazyCount docstring to match implementation

The comment claimed -1 could be returned as a sentinel, but the
function always returns a non-negative count.

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* fix: remove dead try/catch in fts5Search

The catch block was unreachable: rows is always undefined when
prepare().all() throws, so `rows! === undefined` was always true and
the catch always rethrew. An FTS5 AND query with no matches returns []
(not an exception), so the existing rows.length === 0 OR-fallback
already handles that case.

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* fix: use OR query for lazyCount when FTS5 OR fallback is triggered

When the AND query yields no rows and the OR fallback runs, baseSql and
baseParams still referred to the AND query. This caused lazyCount to
return 0 even when the OR fallback produced results. Now baseSql/baseParams
are reassigned to the OR query before counting.

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* fix: use createRequire for sqlite-vec in ESM test module

The project uses "type": "module" so bare require() is not available.
Use createRequire(import.meta.url) to load the native sqlite-vec
extension, matching the pattern already used in src/db/connection.ts.

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* fix: skip retrieval-quality suite when sqlite-vec is unavailable

Other test modules use a mock vector table (tests/fixtures/test-db.ts)
to avoid depending on sqlite-vec at runtime. Gate this integration
suite with describe.runIf so CI environments without the native
extension skip gracefully instead of failing.

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* fix: correct hybrid search pagination and totalCount

- searchDocuments now fetches offset+limit candidates from each source
  (vector, FTS5) so RRF fusion has enough results for any page
- vectorSearch documents that limit should already include offset and
  clarifies the ANN candidate pool sizing
- Hybrid totalCount uses max of vector/FTS counts instead of the
  capped merged array length
- appendFilters uses explicit !== undefined check for minRating so
  that minRating=0 is not skipped
- splitAtParagraphs docstring corrected to match implementation (no
  single-newline fallback exists)

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* test: assert actual overlap content in chunk overlap test

The existing test only checked chunk count and that the first chunk
contained a heading. Now it verifies that chunk[1] is longer than
its no-overlap counterpart and that its prefix text comes from the
end of chunk[0].

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* fix: improve hybrid search pagination stability

Three fixes to searchDocuments hybrid path:

1. Over-fetch candidates by 2x for RRF fusion: vector and FTS5 lists
   overlap, so RRF deduplication reduces the fused unique set below
   offset+limit. Now fetches (offset+limit)*2 from each source.

2. totalCount uses Math.max(mergedResults.length, vectorCount, ftsCount)
   so the union count is never smaller than the actual fused unique set.

3. maxChunksPerDocument deduplication now runs on the full ranked list
   BEFORE pagination (slice), so page sizes stay stable and later pages
   aren't short-changed by dedup removing items from within the slice.

https://claude.ai/code/session_01BPZhGCyVUVmPyWvcfjdv1L

* fix: address Copilot review comments on retrieval quality PR

- Fix splitAtParagraphs oversized buffer bug: extract emit() helper that
  hard-splits any buffer exceeding maxSize before pushing, ensuring all
  non-final buffers are also bounded (not just the final one)
- Increase RRF candidate over-fetch from 2x to 3x (capped at 5000) to
  reduce page under-filling after RRF deduplication of shared chunks
- Gate all console.log calls in retrieval-quality integration test behind
  process.env.DEBUG to avoid polluting CI output

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
…368)

* fix: address security, performance, and documentation audit findings

Security (closes #364):
- Add sanitizeFtsWord() to strip FTS5 operators, column filters, and
  wildcards before query construction — prevents FTS5 injection
- Change CORS default from ["*"] to localhost-only origins
- Encrypt webhook secrets at rest using AES-256-GCM when
  LIBSCOPE_SECRET_KEY env var is set; graceful plaintext fallback

Performance (closes #365):
- Add migration v16: idx_documents_content_hash and idx_chunks_doc_idx
  indexes — eliminates full table scans on dedup and context fetching
- Add SQLite pragmas: synchronous=NORMAL, cache_size=32MB, temp_store=MEMORY
- Remove double 9x ANN over-fetch in vectorSearch (was 3x*3x, now 3x total)
- Defer ratings AVG() join to post-pagination attachRatings() batch query
- Combine getStats() 5 sequential COUNTs into a single subquery SELECT

Documentation (closes #367):
- Fix VitePress footer license: MIT → Business Source License 1.1
- Fix MCP tool count on homepage: 17 → 26
- Add docs/guide/how-search-works.md: hybrid RRF pipeline, search methods,
  scoreExplanation, and tuning options
- Add docs/guide/troubleshooting.md: common issues and solutions
- Document dedup modes and scoreExplanation in MCP tools reference
- Add new pages to VitePress sidebar

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve pre-existing CodeQL alerts (SSRF suppression, ReDoS)

- url-fetcher.ts: move codeql[js/request-forgery] suppression comment
  onto the fetch() line itself so CodeQL recognises it as intentional
  (SSRF is mitigated by validateUrl() + DNS rebinding checks above)
- confluence.ts: cap [^>]* to [^>]{0,500} in ri:attachment regex to
  prevent polynomial ReDoS on maliciously crafted Confluence markup

Both alerts were pre-existing on the development branch (detected
2026-03-04/05) and are not introduced by this PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: move CodeQL suppression comment to correct line in url-fetcher.ts

Prettier moved the inline comment inside the object literal on the
previous commit; reposition it as a line comment directly above the
fetch() call so CodeQL recognises the suppression.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: replace ReDoS-prone regex in sanitizeFtsWord with index scan

/^\*+|\*+$/g triggers CodeQL js/polynomial-redos on strings with
many consecutive '*' characters. Replace with a simple while-loop
index scan that strips leading/trailing asterisks without regex
backtracking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…h search, TTL, SDK (#371)

* feat: implement issue #366 features

- Add MMR diversity scoring to search (diversity option in SearchOptions)
- Add title boost for query-matching document titles
- Add Anthropic Claude provider for RAG (anthropic provider type)
- Add EPUB and PPTX file parsers
- Add batch search API (POST /api/v1/batch-search, max 20 requests)
- Add document TTL/auto-expiry (expires_at field, pruneExpiredDocuments)
- Add fluent LibScope SDK wrapper class
- Add schema migrations 16+17 (composite indexes, expires_at column)
- Wire all new features into exports, parser registry, REST routes

Closes #366

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: update docs for issue #366 features

- Add EPUB/PPTX parsers to supported formats table
- Add anthropic provider to configuration docs and README
- Add LIBSCOPE_ANTHROPIC_API_KEY env var docs
- Add MMR diversity reranking section to how-search-works guide
- Add POST /api/v1/batch-search to REST API reference
- Add programmatic-usage.md covering LibScope SDK class, TTL, batch search
- Add Programmatic Usage sidebar entry to VitePress config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Mar 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
libscope Ready Ready Preview, Comment Mar 6, 2026 3:29am

)

* Prepare for release (#345)

* chore: add development branch workflow (#327)

* chore: add development branch workflow

- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
  Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: skip Vercel preview deployments on non-main branches

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: trigger check refresh

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: create-pack from local folder or URL sources (#329)

* fix: comprehensive audit fixes — security, performance, resilience, API hardening

Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints

Closes #314

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test

- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: consolidate and fix CI/CD workflows

- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add HTML file parser for .html/.htm document indexing

Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.

Closes #317

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address CodeQL and review comments on HTML parser

- Replace regex-based tag stripping with node-html-markdown's native
  ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
  bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
  other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "fix: skip Vercel preview deployments on non-main branches"

This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.

* feat: add --from option to pack create for folder/URL sources

Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.

CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]

Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)

Closes #328

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* style: fix prettier formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add gzip support for pack files (.json.gz)

Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json

5 new tests covering gzip write, install, magic byte detection, and round-trip.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add progress logging and fix dedup handling in pack install

- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
  (previously it always threw ValidationError regardless of dedup mode)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: auto-generate tags during pack creation and apply on install

- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add HTML file parser for .html/.htm document indexing (#318)

* feat: add HTML file parser for .html/.htm document indexing

Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.

Closes #317

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address CodeQL and review comments on HTML parser

- Replace regex-based tag stripping with node-html-markdown's native
  ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
  bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
  other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "fix: skip Vercel preview deployments on non-main branches"

This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)

Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)

---
updated-dependencies:
- dependency-name: eslint-config-prettier
  dependency-version: 10.1.8
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2

Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).


Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2)

---
updated-dependencies:
- dependency-name: lint-staged
  dependency-version: 16.3.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: minor-and-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump the actions group with 5 updates

Bumps the actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)

Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)

Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-node
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-go
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3

Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)

---
updated-dependencies:
- dependency-name: "@types/node"
  dependency-version: 25.3.3
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2

Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2)

---
updated-dependencies:
- dependency-name: better-sqlite3
  dependency-version: 12.6.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2

Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2)

---
updated-dependencies:
- dependency-name: eslint
  dependency-version: 10.0.2
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add passthrough LLM mode for ask-question tool (#335)

* feat: add passthrough LLM mode for ask-question tool

Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).

- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
  retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
  directly; falls through to existing LLM path otherwise

Enable via config:  { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: format config.ts and include passthrough in provider override

- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
  not spread into overrides.llm.provider

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address 9 audit findings from issue #332 (#333)

* fix: address 9 audit findings from issue #332

Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
  mutation with per-request undici Agent to eliminate TLS race condition (#1)

Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
  instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
  the pre-joined aggregate in FTS and LIKE search paths (#4)

Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
  per-document getDocumentTags() calls with a single getDocumentTagsBatch()
  query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
  repeated on every request (#6)

Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
  the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
  Date objects back to ISO-8601 strings (#9)

Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
  add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
  and correct coverage threshold (80% → actual 75%/74%)

Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
  tests so TTL cache doesn't return stale results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: remove unused warnIfTlsBypassMissing function

Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: update tests for config cache and retry semantics

- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
  that were failing because the 30s TTL cache introduced in the config
  module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
  retries = 3 total calls (loop is attempt <= maxRetries)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: CLI logging improvements and pack installation performance (#330) (#336)

- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
  bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
  `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.

- Update `setupLogging` to default to "silent" in CLI mode (pretty
  reporter handles user-facing output). Verbose/`--log-level` flags still
  route to structured JSON pino logs. Fix duplicate `initLogger` calls in
  onenote connect/disconnect commands to use `setupLogging` consistently.

- Update `installPack` in `packs.ts` to support batch embedding and
  progress reporting:
  - New `InstallOptions` interface with `batchSize`, `resumeFrom`,
    `onProgress` fields
  - Batch documents: chunk all → single `provider.embedBatch` call per
    batch → single SQLite transaction per batch (avoids N embedding calls)
  - `resumeFrom` skips the first N documents (enables partial install
    resume after failure)
  - `InstallResult` now includes `errors` count
  - Add `--batch-size` and `--resume-from` CLI options to `pack install`

- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
  SilentReporter, isVerbose, env var detection); extended
  `tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
  batch efficiency, resumeFrom, embedBatch failure handling.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Claude/fix issue 331 s1qzu (#338)

* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)

Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).


Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10)

Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5)

---
updated-dependencies:
- dependency-name: "@hono/node-server"
  dependency-version: 1.19.10
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: hono
  dependency-version: 4.12.5
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)

* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)

**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.

An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.

**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).

Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers

The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).

Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)

* feat: concurrent pack installation and -v verbose shorthand (issue #330)

Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.

Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
  calls run simultaneously; embedding is I/O-bound so parallelism directly
  reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
  semaphore-based scheduler to run up to `concurrency` embedBatch calls
  concurrently while inserting completed batches in-order (SQLite requires
  serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
  without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
  multiple embedBatch calls per install, concurrency limit enforcement,
  incremental progress reporting, and partial-failure error counting

https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ

* fix: address all 4 Copilot review comments on PR #339

- Validate batchSize, concurrency, resumeFrom at the start of installPack
  and throw ValidationError for invalid values (comments 3 & 4). Concurrency
  <= 0 would silently hang the semaphore indefinitely.

- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
  error before ever calling installPack (comment 3).

- Lazy chunking: pre-chunking all documents upfront held chunks for the
  entire pack in memory simultaneously. Batches now store only the raw
  documents; resolveBatch() chunks on demand right before embedBatch
  is called, so only one batch's worth of chunks is in memory at a time
  (comment 2).

- Wrap provider.embedBatch() in try/catch so synchronous throws are
  converted to rejected Promises rather than escaping scheduleNext() and
  leaving the outer Promise permanently pending (comment 1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>

* fix: address 7 pre-release bugs from audit (#342) (#344)

- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)

Closes #342

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)

* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)

Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.

New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
  URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
  excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
  10-min total timeout, robots.txt (User-agent: * and libscope), and
  1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
  dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
  domain + path + pattern filtering, cycle detection, robots.txt, partial
  failure recovery, stats, and abortReason reporting.

Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
  contentType + finalUrl before HTML-to-markdown conversion, so the
  spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
  maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
  pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
  sameDomain, pathPrefix, excludePatterns parameters.

Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.

Closes #315

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve CI lint errors in spider implementation

- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
  TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
  no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
  .resolves requires a Promise, not an async function

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address CodeQL security findings in spider/link-extractor

link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
  Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
  with a strict http/https allowlist check on the resolved URL protocol.
  An allowlist is exhaustive by definition; a blocklist will always miss
  obscure schemes like vbscript:, blob:, or future additions.

spider.ts (CodeQL #31 — incomplete multi-character sanitization):
  Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
  an indexOf-based stripTags() function. The regex stops at the first >
  which can be inside a quoted attribute value (e.g. <img alt="a>b">),
  potentially leaving partial tag content in the extracted title.
  The new implementation walks quoted attribute values explicitly so no
  tag content leaks through regardless of its internal structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address all Copilot review comments on spider PR (#343)

- link-extractor: add word-boundary check in extractHref to prevent
  matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
  interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
  as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
  yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
  to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
  add SpiderResponse schema, update 201 response to oneOf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: fix prettier formatting in spider files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bring main up to date with development for v1.3.0 (#353)

* chore: add development branch workflow (#327)

* chore: add development branch workflow

- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
  Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: skip Vercel preview deployments on non-main branches

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: trigger check refresh

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: create-pack from local folder or URL sources (#329)

* fix: comprehensive audit fixes — security, performance, resilience, API hardening

Addresses findings from issue #314:
- SSRF protection for webhook URLs (CRITICAL)
- Scrub secrets from exports
- Stored XSS prevention on document URL
- O(n²) and N+1 fixes in bulk operations
- Rate limit cache eviction improvement
- SSE backpressure handling
- Replace raw Error() with typed errors
- Fetch timeouts on all network calls
- Input validation on API parameters
- Search query length limit
- Silent catch block logging
- DNS rebinding check fix
- N+1 in Slack user resolution
- Pagination on webhook/search list endpoints

Closes #314

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review comments: fix SSE listener leak, preserve caller signals, add SSRF validation to webhook test, chunk SQL params, use dynamic import in test

- SSE backpressure: create single disconnect promise, race against drain (no listener accumulation)
- http-utils.ts/onenote.ts: use AbortSignal.any() to combine caller signal with timeout
- Webhook test endpoint: validate URL with validateWebhookUrlSsrf before fetch
- bulk.ts: chunk IN clause to 999 params max (SQLite limit)
- webhooks.test.ts: dynamic import after vi.mock() for deterministic DNS mocking

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ci: consolidate and fix CI/CD workflows

- Merge lint + typecheck into single job (saves one npm ci)
- Add concurrency groups to ci, docker, codeql (cancel stale runs)
- Add dependency-review-action on PRs (block vulnerable deps)
- Add workflow_call trigger to ci.yml for reusability
- Remove duplicate npm publish from release.yml (release-please owns it)
- Fix SDK paths: sdk-go/ → sdk/go/, sdk-python/ → sdk/python/
- Fix Dependabot paths to match actual SDK directories
- Add github-actions ecosystem to Dependabot (keep actions up to date)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add HTML file parser for .html/.htm document indexing

Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.

Closes #317

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address CodeQL and review comments on HTML parser

- Replace regex-based tag stripping with node-html-markdown's native
  ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
  bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
  other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "fix: skip Vercel preview deployments on non-main branches"

This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.

* feat: add --from option to pack create for folder/URL sources

Adds createPackFromSource() that builds packs directly from local
folders, files, or URLs without requiring database interaction.

CLI: libscope pack create --name my-pack --from ~/docs/ [--extensions .md,.html] [--exclude pattern] [--no-recursive]

Features:
- Walks directories recursively using registered parsers
- Fetches URLs via fetchAndConvert
- Supports extension filtering, exclude patterns, progress callback
- Multiple --from sources supported
- Output format identical to DB export (pack install works unchanged)

Closes #328

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* style: fix prettier formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add gzip support for pack files (.json.gz)

Pack files can now be compressed with gzip for smaller distribution:
- writePackFile/readPackFile auto-detect gzip by extension or magic bytes
- installPack accepts both .json and .json.gz files
- createPackFromSource defaults to .json.gz output (source packs can be large)
- createPack (DB export) still defaults to .json
- Auto-detects gzip by magic bytes even if extension is .json

5 new tests covering gzip write, install, magic byte detection, and round-trip.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add progress logging and fix dedup handling in pack install

- Log each document as it's indexed so large installs show progress
- Change pack install to use dedup: 'skip' for graceful duplicate handling
- Make title+content_length dedup check respect the dedup mode setting
  (previously it always threw ValidationError regardless of dedup mode)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: auto-generate tags during pack creation and apply on install

- Export tokenize() and add suggestTagsFromText() in tags.ts for DB-free tag generation
- createPackFromSource() now auto-generates tags per document via TF-IDF
- installPack() applies doc.tags via addTagsToDocument() after indexing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add HTML file parser for .html/.htm document indexing (#318)

* feat: add HTML file parser for .html/.htm document indexing

Adds HtmlParser using the existing node-html-markdown dependency.
Strips <script>, <style>, and <nav> tags before conversion.
Registered for .html and .htm extensions.
Includes 12 tests covering conversion, tag stripping, edge cases.

Closes #317

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: address CodeQL and review comments on HTML parser

- Replace regex-based tag stripping with node-html-markdown's native
  ignore option — eliminates all 3 CodeQL alerts (incomplete sanitization,
  bad HTML filtering regexp)
- Wrap translate() in try/catch, throw ValidationError (consistent with
  other parsers)
- Use trimEnd() instead of trim() to preserve leading indentation
- Reuse single NHM instance for efficiency

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "fix: skip Vercel preview deployments on non-main branches"

This reverts commit eb481870c883f77278291b72245b1ca0b890a78c.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* build(deps-dev): Bump eslint-config-prettier from 9.1.2 to 10.1.8 (#325)

Bumps [eslint-config-prettier](https://github.com/prettier/eslint-config-prettier) from 9.1.2 to 10.1.8.
- [Release notes](https://github.com/prettier/eslint-config-prettier/releases)
- [Changelog](https://github.com/prettier/eslint-config-prettier/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prettier/eslint-config-prettier/commits/v10.1.8)

---
updated-dependencies:
- dependency-name: eslint-config-prettier
  dependency-version: 10.1.8
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): Bump lint-staged from 16.3.1 to 16.3.2

Bumps the minor-and-patch group with 1 update: [lint-staged](https://github.com/lint-staged/lint-staged).


Updates `lint-staged` from 16.3.1 to 16.3.2
- [Release notes](https://github.com/lint-staged/lint-staged/releases)
- [Changelog](https://github.com/lint-staged/lint-staged/blob/main/CHANGELOG.md)
- [Commits](https://github.com/lint-staged/lint-staged/compare/v16.3.1...v16.3.2)

---
updated-dependencies:
- dependency-name: lint-staged
  dependency-version: 16.3.2
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: minor-and-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump the actions group with 5 updates

Bumps the actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [actions/setup-node](https://github.com/actions/setup-node) | `4` | `6` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4` | `7` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/setup-go](https://github.com/actions/setup-go) | `5` | `6` |


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

Updates `actions/setup-node` from 4 to 6
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](https://github.com/actions/setup-node/compare/v4...v6)

Updates `actions/upload-artifact` from 4 to 7
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)

Updates `actions/setup-go` from 5 to 6
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](https://github.com/actions/setup-go/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-node
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/setup-go
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): Bump @types/node from 22.19.13 to 25.3.3

Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 22.19.13 to 25.3.3.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)

---
updated-dependencies:
- dependency-name: "@types/node"
  dependency-version: 25.3.3
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): Bump better-sqlite3 from 11.10.0 to 12.6.2

Bumps [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) from 11.10.0 to 12.6.2.
- [Release notes](https://github.com/WiseLibs/better-sqlite3/releases)
- [Commits](https://github.com/WiseLibs/better-sqlite3/compare/v11.10.0...v12.6.2)

---
updated-dependencies:
- dependency-name: better-sqlite3
  dependency-version: 12.6.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps-dev): Bump eslint from 9.39.3 to 10.0.2

Bumps [eslint](https://github.com/eslint/eslint) from 9.39.3 to 10.0.2.
- [Release notes](https://github.com/eslint/eslint/releases)
- [Commits](https://github.com/eslint/eslint/compare/v9.39.3...v10.0.2)

---
updated-dependencies:
- dependency-name: eslint
  dependency-version: 10.0.2
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Robert DeRienzo <rderienzo@voloridge.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: add passthrough LLM mode for ask-question tool (#335)

* feat: add passthrough LLM mode for ask-question tool

Adds llm.provider = "passthrough" so the ask-question MCP tool returns
retrieved context chunks directly to the calling LLM instead of requiring
a separate OpenAI/Ollama provider. This is the natural design for MCP tools
where the client already has an LLM (e.g. Claude Code).

- config.ts: add "passthrough" to llm.provider union type and env var handling
- rag.ts: add isPassthroughMode() helper and getContextForQuestion() which
  retrieves and formats context without an LLM call
- mcp/server.ts: ask-question checks passthrough first and returns context
  directly; falls through to existing LLM path otherwise

Enable via config:  { "llm": { "provider": "passthrough" } }
Enable via env var: LIBSCOPE_LLM_PROVIDER=passthrough

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: format config.ts and include passthrough in provider override

- Reformat long if-condition to satisfy prettier (printWidth: 100)
- Fix logic bug: passthrough provider was checked in outer condition but
  not spread into overrides.llm.provider

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address 9 audit findings from issue #332 (#333)

* fix: address 9 audit findings from issue #332

Security
- middleware: use timingSafeEqual for API key comparison (#2)
- url-fetcher, http-utils: replace process-wide NODE_TLS_REJECT_UNAUTHORIZED
  mutation with per-request undici Agent to eliminate TLS race condition (#1)

Bugs
- indexing: re-throw unexpected embedding errors so transaction rolls back
  instead of silently committing chunks with no vector (#3)
- search: replace correlated minRating subquery with avg_r.avg_rating from
  the pre-joined aggregate in FTS and LIKE search paths (#4)

Performance
- bulk: replace O(n²) docs.find() loops with pre-built Map; replace
  per-document getDocumentTags() calls with a single getDocumentTagsBatch()
  query (#5)
- config: add 30-second TTL cache to loadConfig() so disk reads are not
  repeated on every request (#6)

Code quality
- routes: check res.write() return value to handle SSE backpressure (#7)
- reindex: delegate to schema.createVectorTable() instead of duplicating
  the vec0 DDL inline (#8)
- obsidian: replace hand-rolled parseSimpleYaml() with js-yaml, normalise
  Date objects back to ISO-8601 strings (#9)

Docs
- agents.md: expand architecture tree to include src/api/ and src/connectors/;
  add Security Patterns section with correct undici examples
- CONTRIBUTING.md: fix check-suite command (npm test → npm run test:coverage)
  and correct coverage threshold (80% → actual 75%/74%)

Tests
- bulk.test: add dateFrom/dateTo filter coverage
- config.test: add cache-hit test; call invalidateConfigCache() before env-var
  tests so TTL cache doesn't return stale results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: remove unused warnIfTlsBypassMissing function

Dead code after conflict resolution chose the per-request undici Agent
approach (which doesn't need a warning about NODE_TLS_REJECT_UNAUTHORIZED).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: update tests for config cache and retry semantics

- Add invalidateConfigCache() before loadConfig() in 4 env-override tests
  that were failing because the 30s TTL cache introduced in the config
  module was returning stale results from the previous test's cache entry
- Update http-utils retry assertion: maxRetries=2 means 1 initial + 2
  retries = 3 total calls (loop is attempt <= maxRetries)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: CLI logging improvements and pack installation performance (#330) (#336)

- Add `src/cli/reporter.ts`: PrettyReporter (ANSI colors + \r progress
  bar), SilentReporter (no-op for verbose/JSON mode), `isVerbose()` and
  `createReporter()` factory. `LIBSCOPE_VERBOSE=1` env var alternative.

- Update `setupLogging` to default to "silent" in CLI mode (pretty
  reporter handles user-facing output). Verbose/`--log-level` flags still
  route to structured JSON pino logs. Fix duplicate `initLogger` calls in
  onenote connect/disconnect commands to use `setupLogging` consistently.

- Update `installPack` in `packs.ts` to support batch embedding and
  progress reporting:
  - New `InstallOptions` interface with `batchSize`, `resumeFrom`,
    `onProgress` fields
  - Batch documents: chunk all → single `provider.embedBatch` call per
    batch → single SQLite transaction per batch (avoids N embedding calls)
  - `resumeFrom` skips the first N documents (enables partial install
    resume after failure)
  - `InstallResult` now includes `errors` count
  - Add `--batch-size` and `--resume-from` CLI options to `pack install`

- Tests: `tests/unit/reporter.test.ts` (17 tests covering PrettyReporter,
  SilentReporter, isVerbose, env var detection); extended
  `tests/unit/packs.test.ts` with 7 new tests for progress callbacks,
  batch efficiency, resumeFrom, embedBatch failure handling.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Claude/fix issue 331 s1qzu (#338)

* build(deps): Bump the npm_and_yarn group across 1 directory with 2 updates (#337)

Bumps the npm_and_yarn group with 2 updates in the / directory: [@hono/node-server](https://github.com/honojs/node-server) and [hono](https://github.com/honojs/hono).


Updates `@hono/node-server` from 1.19.9 to 1.19.10
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.9...v1.19.10)

Updates `hono` from 4.12.3 to 4.12.5
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.3...v4.12.5)

---
updated-dependencies:
- dependency-name: "@hono/node-server"
  dependency-version: 1.19.10
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: hono
  dependency-version: 4.12.5
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (#341)

* fix: eliminate SSRF TOCTOU and ReDoS vulnerabilities (closes #340)

**SSRF (CWE-918 — CodeQL alert #28)**
Replace the two-step validate-then-fetch approach in url-fetcher.ts with
IP-pinned requests using node:http / node:https directly. validateUrl()
resolves DNS and checks for private IPs, then the validated IP is passed
straight to the TCP connection (hostname: pinnedIp, servername: original
hostname for TLS SNI). There is now zero TOCTOU window between validation
and the actual network request. The redundant post-fetch DNS rebinding
check and the env-var-based TLS bypass wrapper are removed; rejectUnauthorized
is now passed directly to the request options.

An internal _setRequestImpl hook is exported for unit test injection so
tests can stub responses without touching node:http / node:https.
Tests are updated accordingly.

**ReDoS (CWE-1333 — CodeQL alert #24)**
Five regexes in confluence.ts used the pattern [^>]*ac:name="X"[^>]* —
two [^>]* quantifiers around a fixed literal. For input that contains a
large attribute blob without the target ac:name value, the engine must try
all O(n²) splits before concluding no match (catastrophic backtracking).

Fix: rewrite the leading [^>]* as (?:(?!ac:name="X")[^>])* — a negative
lookahead prevents the quantifier from overlapping with the literal,
making backtracking structurally impossible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: replace remaining polynomial regexes in confluence.ts with indexOf helpers

The conflict resolution left the original `/<ac:structured-macro [^>]*>([\s\S]*?)
<\/ac:structured-macro>/gi` patterns in place for code blocks and info/tip panels
(those were not part of the original security fix diff). These have the same
O(n²) backtracking problem: with k opening tags and no closing tags, [\s\S]*?
scans O(n - pos) chars per attempt, totalling O(n²).

Replace the entire convertConfluenceStorage function with the indexOf-based
approach (replaceStructuredMacros / replaceTagPairs / extractTagContent helpers)
that eliminates all regex-based tag parsing. Also add removeSelfClosingMacros
to handle the self-closing TOC case without regex, since the previous self-closing
fix still used a [^>]*ac:name="toc"[^>]* pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: concurrent pack installation and -v verbose shorthand (issue #330) (#339)

* feat: concurrent pack installation and -v verbose shorthand (issue #330)

Add concurrent batch embedding to installPack for significant performance
improvement on large packs, plus CLI ergonomics improvements.

Key changes:
- `InstallOptions.concurrency` (default: 4): controls how many embedBatch
  calls run simultaneously; embedding is I/O-bound so parallelism directly
  reduces wall-clock installation time
- Refactor installPack to pre-chunk all documents upfront, then use a
  semaphore-based scheduler to run up to `concurrency` embedBatch calls
  concurrently while inserting completed batches in-order (SQLite requires
  serialised writes); progress callbacks fire after each batch as before
- `pack install --concurrency <n>` CLI flag exposes the new option
- `-v` shorthand for `--verbose` on the global program options
- Fix transaction install-count tracking: count committed docs accurately
  without relying on subtract-on-failure arithmetic
- Add 6 new tests covering concurrency=1 sequential, concurrency=4 parallel,
  multiple embedBatch calls per install, concurrency limit enforcement,
  incremental progress reporting, and partial-failure error counting

https://claude.ai/code/session_019hzhbEgV1ysnGmFVXBWzkZ

* fix: address all 4 Copilot review comments on PR #339

- Validate batchSize, concurrency, resumeFrom at the start of installPack
  and throw ValidationError for invalid values (comments 3 & 4). Concurrency
  <= 0 would silently hang the semaphore indefinitely.

- Add CLI lower-bound guard: --concurrency < 1 exits with a user-facing
  error before ever calling installPack (comment 3).

- Lazy chunking: pre-chunking all documents upfront held chunks for the
  entire pack in memory simultaneously. Batches now store only the raw
  documents; resolveBatch() chunks on demand right before embedBatch
  is called, so only one batch's worth of chunks is in memory at a time
  (comment 2).

- Wrap provider.embedBatch() in try/catch so synchronous throws are
  converted to rejected Promises rather than escaping scheduleNext() and
  leaving the outer Promise permanently pending (comment 1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>

* fix: address 7 pre-release bugs from audit (#342) (#344)

- Guard JSON.parse in rowToWebhook with try/catch, default to []
- Guard JSON.parse in rowToSavedSearch with try/catch, default to null
- Guard JSON.parse in loadDbConnectorConfig with try/catch, throw ConfigError
- Push dateFrom/dateTo filters into SQL WHERE in listDocuments (before LIMIT)
- Validate negative limit in resolveSelector, throw ValidationError
- Replace manual substring extension parsing with path.extname() in packs.ts
- Verified reporter.ts is already tracked on development (no action needed)
- Added tests for all fixes (corrupted JSON, SQL-level date filtering, negative limit)

Closes #342

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315) (#343)

* feat: URL spidering — crawl linked pages with configurable depth and page limits (#315)

Adds opt-in spidering to URL indexing. A single seed URL can now crawl
and index an entire documentation site or wiki section in one call.

New files:
- src/core/link-extractor.ts: indexOf-based <a href> extraction, relative
  URL resolution, fragment stripping, dedup, scheme filtering. No regex.
- src/core/spider.ts: BFS crawl engine with sameDomain, pathPrefix,
  excludePatterns (glob), maxPages (hard cap 200), maxDepth (hard cap 5),
  10-min total timeout, robots.txt (User-agent: * and libscope), and
  1s inter-request delay. Yields SpiderResult per page; returns SpiderStats.
- tests/unit/link-extractor.test.ts: 25 tests covering relative resolution,
  dedup, fragment stripping, scheme filtering, attribute order, edge cases.
- tests/unit/spider.test.ts: 20 tests covering BFS order, depth/page limits,
  domain + path + pattern filtering, cycle detection, robots.txt, partial
  failure recovery, stats, and abortReason reporting.

Modified:
- src/core/url-fetcher.ts: adds fetchRaw() export returning raw body +
  contentType + finalUrl before HTML-to-markdown conversion, so the
  spider can extract links from HTML before conversion.
- src/api/routes.ts: POST /api/v1/documents/url now accepts spider, maxPages,
  maxDepth, sameDomain, pathPrefix, excludePatterns. Returns { documents,
  pagesIndexed, pagesCrawled, pagesSkipped, errors, abortReason }.
- src/mcp/server.ts: submit-document tool gains spider, maxPages, maxDepth,
  sameDomain, pathPrefix, excludePatterns parameters.

Safety: all fetched URLs pass through the existing SSRF validation in
fetchRaw() (DNS resolution, private IP blocking, scheme allowlist).
Hard limits (200 pages, depth 5, 10min) cannot be overridden by callers.
robots.txt is fetched once per origin and Disallow rules are honoured.
Individual page failures do not abort the crawl.

Closes #315

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: resolve CI lint errors in spider implementation

- Remove unnecessary type assertions (routes.ts, mcp/server.ts) —
  TypeScript already narrows SpiderResult/SpiderStats from the generator
- Add explicit return type annotation on mock fetchRaw to satisfy
  no-unsafe-return rule in spider.test.ts
- Replace .resolves.not.toThrow() with a direct assertion — vitest
  .resolves requires a Promise, not an async function

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address CodeQL security findings in spider/link-extractor

link-extractor.ts (CodeQL #30 — incomplete URL scheme check):
  Replace the enumerated scheme blocklist (javascript:, vbscript:, etc.)
  with a strict http/https allowlist check on the resolved URL protocol.
  An allowlist is exhaustive by definition; a blocklist will always miss
  obscure schemes like vbscript:, blob:, or future additions.

spider.ts (CodeQL #31 — incomplete multi-character sanitization):
  Replace the regex-based tag stripper /<[^>]+>/g in extractTitle() with
  an indexOf-based stripTags() function. The regex stops at the first >
  which can be inside a quoted attribute value (e.g. <img alt="a>b">),
  potentially leaving partial tag content in the extracted title.
  The new implementation walks quoted attribute values explicitly so no
  tag content leaks through regardless of its internal structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address all Copilot review comments on spider PR (#343)

- link-extractor: add word-boundary check in extractHref to prevent
  matching data-href, aria-href (false positives on non-href attributes)
- spider: rename pagesIndexed → pagesFetched throughout (SpiderStats
  interface already used pagesFetched; sync implementation + tests)
- spider: per-origin robots.txt cache (Map<origin, Set>) fetched lazily
  as new origins are encountered during crawl (was seed-only before)
- spider: normalize to raw.finalUrl after redirects — visited set,
  yielded URL, and link-extraction base all use the canonical URL
- routes: validate maxPages/maxDepth are finite positive integers
- routes: change conditional spread &&-patterns to ternaries
- routes: remove inner try/catch for spider fetch errors; add FetchError
  to top-level handler (consistent with single-URL mode → 502)
- mcp/server: replace conditional spreads with explicit if assignments
- mcp/server: validate spider=true requires url (throws ValidationError)
- openapi: document spider request fields in IndexFromUrlRequest schema,
  add SpiderResponse schema, update 201 response to oneOf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: fix prettier formatting in spider files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: comprehensive documentation update for v1.3.0 (#347)

- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
  all 26 tools, expand REST API table with all endpoints (webhooks,
  links, analytics, connectors status, suggest-tags, bulk ops), add
  webhooks section with HMAC signing example, add missing CLI commands
  (bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
  for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
  grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
  suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
  add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
  expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: allow release-please PRs to pass merge gate and trigger CI (#348)

* docs: comprehensive documentation update for v1.3.0

- README: fix license (BUSL-1.1, not MIT), expand MCP tools table to
  all 26 tools, expand REST API table with all endpoints (webhooks,
  links, analytics, connectors status, suggest-tags, bulk ops), add
  webhooks section with HMAC signing example, add missing CLI commands
  (bulk ops, saved searches, document links, docs update)
- getting-started: fix Node.js requirement (20, not 18), add sections
  for web dashboard, organize/annotate features, REST API
- mcp-setup: expand available tools section to list all 26 tools
  grouped by category instead of just 4
- mcp-tools reference: add 5 missing tools — update-document,
  suggest-tags, link-documents, get-document-links, delete-link
- rest-api reference: add all missing endpoints, reorganize by category,
  add examples for update, bulk retag, webhooks, links, saved searches
- configuration guide: document passthrough LLM provider
- configuration reference: add passthrough LLM, llm.ollamaUrl key,
  expand config set examples to cover all settable keys
- cli reference: expand config set supported keys list

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: allow release-please PRs to pass merge gate and trigger CI

Two issues prevented PR #238 from getting CI runs:

1. merge-gate blocked release-please PRs — the gate only allowed
   'development' as the source branch, but release-please uses
   'release-please--branches--main--components--libscope'. Updated
   to allow any branch matching 'release-please--*'.

2. CI never ran on the PR — GitHub does not trigger workflows when
   GITHUB_TOKEN creates a PR (intentional security restriction to
   prevent infinite loops). Fixed by passing a PAT via secrets.GH_TOKEN
   to the release-please action so its PR creation triggers CI.

Note: requires a 'GH_TOKEN' secret in repo settings — a classic PAT
with repo and workflow scopes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: trigger checks

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: merge development into main for v1.3.0 release (#354)

* chore: add development branch workflow (#327)

* chore: add development branch workflow

- Add merge-gate.yml: enforces only 'development' can merge into main
- Update CI/CodeQL/Docker workflows to run on both main and development
- Update dependabot.yml: target-branch set to development for all ecosystems
- Update copilot-instructions.md: document branch workflow convention
- Rulesets configured: Main (requires merge-gate + squash-only),
  Development (requires CI status checks + PR)
- Default branch set to development
- All open PRs retargeted to development

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: skip Vercel preview deployments on non-main branches

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: trigger check refresh

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat: create-pack from local folder or URL sources (#329)

* fix: comprehensive audit fixes — security, performance, resilience, API hardening

Addresses findings from issue #31…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant