Skip to content

chore: add monorepo workspace skeleton (no behavior change)#89

Open
nicklamonov wants to merge 4 commits into
masterfrom
chore/monorepo-skeleton
Open

chore: add monorepo workspace skeleton (no behavior change)#89
nicklamonov wants to merge 4 commits into
masterfrom
chore/monorepo-skeleton

Conversation

@nicklamonov

@nicklamonov nicklamonov commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

PR 1 of the monorepo migration tracked in epic #90 — fold the URL-to-Markdown actor (currently the standalone apify/page-scraper repo) into this repo as a sibling actor sharing the scraping engine in-process instead of over HTTP.

This PR adds the workspace scaffolding only — RAG's source layout, build, and runtime behavior are unchanged.

Tracking: #90

What changes

  • package.json — adds "private": true, "packageManager": "pnpm@10.33.4" + a devEngines.packageManager block, and lerna + turbo as devDependencies. Workspaces are declared in pnpm-workspace.yaml, not the npm workspaces field. patch-package stays in dependencies (see Tooling note).
  • pnpm-workspace.yaml — workspace globs (packages/*, packages/actors/*), nodeLinker: hoisted, and onlyBuiltDependencies (esbuild, playwright). Mirrors apify/actor-scraper.
  • pnpm-lock.yaml — replaces package-lock.json.
  • lerna.json — independent per-package versioning, conventional commits, GitHub releases, npmClient: pnpm. Mirrors apify/actor-scraper.
  • turbo.json — standard build/test/lint/clean task graph (dependsOn: ["^build"], outputs dist/**).
  • tsconfig.base.json — shared base config that future workspace packages will extend. RAG's own tsconfig.json is untouched.
  • packages/.gitkeep — placeholder for the (currently empty) workspace dir.
  • .gitignore — ignore .turbo cache.
  • .github/workflows/checks.yml — pin Node to 22 (was 'latest', which now resolves to Node 26) and switch the package manager from npm to pnpm (install via apify/actions/pnpm-install, build/lint/test via pnpm), plus concurrency + cancel-in-progress. Fixes a CI hang: Playwright 1.46.0's installer only supports Node 18/20/22, and on Node 26 its post-download unzip step stalls silently until the run is cancelled. Node 22 also matches the production base image (apify/actor-node-playwright-firefox:22-*).

What is not changed

  • RAG's src/, .actor/, Dockerfile, tsconfig.json, build scripts, tests.
  • The current npm run build / npm run start:dev / apify push flows.
  • The production actor still builds with npm inside Docker (the Apify base image is npm-based, same as apify/actor-scraper); pnpm is only the dev/CI/workspace layer.

Verification

  • pnpm install completes (1119 packages); patch-package applies playwright-core@1.46.0 under both pnpm and npm.
  • pnpm run build (tsc) and pnpm run lint succeed.
  • turbo run build runs cleanly with "0 packages in scope" — expected since no workspace packages exist yet.
  • Tests pass 9/11 locally (10/11 single-threaded). The remaining failures are a browser-launch timeout / vitest file-parallelism flake on the dev machine (the playwright crawler test passes in isolation) — identical to the pre-migration npm baseline, i.e. not caused by pnpm.
  • Production parity: a full docker build of .actor/Dockerfile succeeds, and the Firefox playwright-core patch is confirmed present in the final image (applied by patch-package's postinstall during the image's npm install).
  • Workflow: actionlint passes clean; an act dry-run resolves the full job graph, including the apify/actions/pnpm-install composite (which runs pnpm install and caches by pnpm-lock.yaml hash).

Upcoming PRs (tracked in #90)

  • PR 2 — relocate RAG's src/ and .actor/ into packages/actors/rag-web-browser/.
  • PR 3 — extract the shared scraping engine into packages/scraping-engine/.
  • PR 4 — add packages/actors/url-to-markdown/ consuming the engine.
  • PR 5 — switch CI to a matrix push job for both actors.
  • PR 6 — point the new url-to-markdown actor's source at this repo.
  • PR 7 — deprecate the old standalone apify/page-scraper repo + actor.

Tooling note

Matches apify/actor-scraper's stack — pnpm workspaces + Lerna (independent versioning) + Turbo. (An earlier draft of this plan said npm "to mirror the reference"; that was a misread — the reference monorepo is on pnpm [pnpm-lock.yaml, packageManager: pnpm@10.33.4], so this PR uses pnpm.)

The patch on playwright-core is intentionally kept on patch-package rather than migrated to pnpm's native patchedDependencies: the production actor image builds with npm, which does not understand pnpm patches — a native migration silently dropped the Firefox patch from the prod image. patch-package's postinstall runs under both npm (Docker) and pnpm (dev/CI), so both paths apply the patch.

🤖 Generated with Claude Code

This is PR #1 of the planned migration to host the URL-to-Markdown
actor (formerly apify/page-scraper) as a sibling actor in this repo.

Adds the workspace scaffolding only — RAG Web Browser's source layout,
build, and runtime behavior are unchanged in this PR. Subsequent PRs
will:

  - PR #2: relocate RAG's src/ and .actor/ into packages/actors/rag-web-browser/
  - PR #3: extract the shared scraping engine into packages/scraping-engine/
  - PR #4: add packages/actors/url-to-markdown/ consuming the engine
  - PR #5: switch CI to a matrix push for both actors

What this PR changes:

  - package.json: add "private": true, "workspaces": ["packages/*",
    "packages/actors/*"], "packageManager": "npm@10.9.2", and lerna +
    turbo as devDependencies.
  - lerna.json: independent versioning, conventional commits, github
    releases (matching apify/actor-scraper's setup).
  - turbo.json: build / test / lint / clean tasks with the standard
    dependsOn:["^build"] graph and dist/** outputs.
  - tsconfig.base.json: shared base config (extends @apify/tsconfig)
    that future workspace packages will extend. RAG's own tsconfig.json
    is unchanged.
  - packages/.gitkeep: placeholder so the empty workspace dir is tracked.
  - .gitignore: ignore .turbo cache.

Verification:

  - npm install completes (1159 packages, patch-package runs).
  - npm run build (tsc) succeeds.
  - npx turbo run build runs cleanly with "0 packages in scope" (as
    expected — no workspace packages exist yet).
  - Non-Playwright tests pass (9/11). The 2 Playwright tests fail
    locally only because Playwright browsers aren't installed; this is
    independent of the workspace changes.

Tooling note: matches apify/actor-scraper's stack exactly — npm
workspaces + Lerna (independent versioning) + Turbo. The earlier draft
plan referenced pnpm; npm is the right call to mirror the reference
monorepo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added this to the 141st sprint - Tooling team milestone May 29, 2026
@github-actions github-actions Bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 29, 2026
@nicklamonov nicklamonov added the adhoc Ad-hoc unplanned task added during the sprint. label May 29, 2026
@nicklamonov nicklamonov removed this from the 141st sprint - Tooling team milestone May 29, 2026
@github-actions github-actions Bot added this to the 141st sprint - Tooling team milestone May 29, 2026
@nicklamonov nicklamonov force-pushed the chore/monorepo-skeleton branch from 8fc4ad0 to 00e585c Compare May 29, 2026 12:25
The previous `node-version: 'latest'` resolved to Node 26.2.0 on
current ubuntu-latest runners. Playwright 1.46.0's installer was
released August 2024 and only officially supports Node 18 / 20 / 22 —
on Node 26 its post-download `unzip` step hangs silently with no
progress output, causing the CI step to time out.

Pinning to Node 22:
- Inside Playwright 1.46.0's supported matrix
- Current Node LTS
- Matches the production base image (apify/actor-node-playwright-firefox:22-*)

Master's last successful CI run on 2026-05-01 happened to land on a
Node version that worked with Playwright; the implicit `'latest'`
pointer rolled over to Node 26 since then. This pin fixes that drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nicklamonov nicklamonov marked this pull request as ready for review May 29, 2026 13:16
Mirror the reference monorepo (apify/actor-scraper), which uses
pnpm + Lerna + Turbo — not npm. The earlier "corrected to npm to mirror
the reference" note was based on a misread; the reference is on pnpm.

- package.json: drop the npm `workspaces` field (pnpm reads
  pnpm-workspace.yaml), set packageManager to pnpm@10.33.4, add the
  devEngines block.
- pnpm-workspace.yaml: workspace globs + nodeLinker: hoisted +
  onlyBuiltDependencies (esbuild, playwright), matching actor-scraper.
- Regenerate the lockfile (package-lock.json -> pnpm-lock.yaml).
- lerna.json: npmClient: pnpm.
- checks.yml: install via apify/actions/pnpm-install, run via pnpm,
  add concurrency/cancel-in-progress. Node stays pinned to 22.

patch-package is intentionally kept (not migrated to pnpm's native
patchedDependencies): the production actor image builds with npm, which
does not understand pnpm patches, so a native migration silently dropped
the playwright-core Firefox patch from the prod image. patch-package's
postinstall runs under both npm and pnpm.

Verified: pnpm build/lint/test green; test results match the npm baseline
(9/11 local, browser-dependent failures unchanged); full docker build
succeeds and the Firefox patch is present in the final image.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nicklamonov nicklamonov requested a review from JuanGalilea June 1, 2026 08:21
@JuanGalilea

Copy link
Copy Markdown

it all looks good but we won't know how good it actually is until we move stuff into subfolders in the following issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants