chore: restructure repo as monorepo and create url-to-markdown actor by ruocco-l · Pull Request #94 · apify/actor-rag-web-browser

ruocco-l · 2026-06-04T16:47:36Z

This PR is not trivial and touches a lot of points. I tried my best to reuse what already was there, hopefully is somewhat readable.

A totally human summary of what's in the PR:

Restructure the repository into a monorepo layout under actors/, with each mini-actor (apify_rag-web-browser and apify_url-to-markdown) having its own
.actor/ directory (actor.json, input_schema.json, README, CHANGELOG).
Introduce apify_url-to-markdown, a new mini-actor that is a subset of RAG Web Browser — it only fetches a single URL and converts it to Markdown (no Google
Search). It exposes a /fetch standby endpoint (vs /search for RAG Web Browser) and accepts a url parameter instead of query.
Both mini-actors share a single codebase and Dockerfile. The ACTOR_PATH_IN_DOCKER_CONTEXT Docker build arg determines which actor starts at runtime — each
actor.json points dockerfile and dockerContextDir back to the shared root Dockerfile, and the platform passes the appropriate build arg per actor (this is done by specifying the miniactor folder in the Source tab).
Refactor the type system: split the Input type into CommonInput, RagWebBrowserInput, and UrlToMarkdownInput. The code now branches on
selectedMiniActor (derived from the build arg) to apply actor-specific input validation and route registration.
The MCP server is now generic (McpServer class, was RagWebBrowserServer) and configures its tool name and input schema based on the selected mini-actor. (@jirispilka please help me figure out if what I did makes any lick of sense)
The search crawler is only started when the RAG Web Browser actor is selected, since URL to Markdown doesn't need Google Search.
Clean up unused fields: remove readableTextCharThreshold (dead code) and countryCode/languageCode (were defined in types but never used).

Update after first feedback:

Replace the ACTOR_PATH_IN_DOCKER_CONTEXT Docker build arg with the platform-provided ACTOR_FULL_NAME environment variable for mini-actor selection. This removes the custom ARG/ENV block from the
Dockerfile entirely — the platform already sets ACTOR_FULL_NAME at runtime, so the build-time plumbing was unnecessary. The MINI_ACTORS lookup keys are simplified accordingly (e.g. apify_rag-web-browser
→ rag-web-browser).
Trim the URL to Markdown input schema — remove fields that are now hardcoded (requestTimeoutSecs, removeCookieWarnings, maxRequestRetries, dynamicContentWaitSecs) and reorder the remaining fields to
put scrapingTool right after url. RAG Web Browser validates these from its input schema as before; URL to Markdown now hardcodes sensible defaults (e.g. 40s timeout, 1 retry,
cookie warnings on) since those fields are not exposed in its input schema.
Add checkAccess: false when creating the SERP proxy configuration to skip access validation on search proxies.
Wrap input processing in main.ts with an Actor.fail() call so the actor reports a clear failure message when input validation rejects bad input, instead of silently crashing.
Restructure the URL to Markdown output: add dataset storage views to its actor.json with a table displaying the page URL and extracted Markdown.

jirispilka

Thanks @ruocco-l

Looks good to me! Pre-approving.

I've tested it at my local and it works.

Regarding MCP. My immediate reaction was to delete it because we have mcp.apify.com

But what if some customer is using it? Can we log it somewhere and evaluate it later?

I left a couple of suggestions and nits.

Regarding mini-actors

Human: I was looking at the code, selectedMiniActor is passed all over, across many functions, but the decision is based solely on the
ACTOR_PATH_IN_DOCKER_CONTEXT env variable, which is available everywhere in the process. So I task Claude to simplify it:

CLAUDE:
The problem. selectedMiniActor is a string threaded through ~8 functions, and every receiver decides what it means by comparing it
(=== MINIACTORS.RAG_WEB_BROWSER). The same knowledge — "rag-web-browser searches, url-to-markdown doesn't" — is repeated across
input.ts, main.ts, and search.ts. Since the value comes purely from a process-wide env constant, there's no reason to thread it at
all.

The fix. Resolve it once, in a typed helper, and read named facts off it instead of comparing strings:

  // src/mini-actors.ts
  interface MiniActor {
      name: 'apify_rag-web-browser' | 'apify_url-to-markdown';
      inputSchema: typeof ragSchema | typeof url2mSchema;
      mcpServerName: string;
      runsSearch: boolean;   // the one fact call sites actually need
  }

  const MINI_ACTORS: Record<MiniActor['name'], MiniActor> = {
      'apify_rag-web-browser': { name: 'apify_rag-web-browser', inputSchema: ragSchema,  mcpServerName:
  'mcp-server-rag-web-browser',  runsSearch: true  },
      'apify_url-to-markdown': { name: 'apify_url-to-markdown', inputSchema: url2mSchema, mcpServerName:
  'mcp-server-url-to-markdown', runsSearch: false },
  };

  export function getMiniActor(): MiniActor {
      const name = process.env.ACTOR_PATH_IN_DOCKER_CONTEXT?.split('/')[1] as MiniActor['name'] | undefined;
      if (!name || !(name in MINI_ACTORS)) throw new UserInputError(`Unknown actor context "${name ?? '(unset)'}".`);
      return MINI_ACTORS[name];
  }

Then remove the threading. Drop selectedMiniActor from every signature and return value; call getMiniActor() where needed and read
a field:

  if (getMiniActor().runsSearch) { ... }                   // main.ts + search.ts string compares
  const { inputSchema, mcpServerName } = getMiniActor();   // mcp/server.ts, no fallback

The one real fork (which input validator to run) stays as two functions, chosen once:

  const { input, searchCrawlerOptions } = getMiniActor().runsSearch
      ? await processRagWebBrowserInput(originalInput, standbyInit)
      : await processUrlToMarkdownInput(originalInput, standbyInit);

Result. selectedMiniActor leaves ~8 signatures, all three ! assertions go, the ?? TOOL_CONFIGS[...] fallback and throw await
Actor.fail() hack go, and MINIACTORS + TOOL_CONFIGS collapse into one MINI_ACTORS table. Adding a third actor = one new row, not
edits across 5 files. Tests already set the env var, so they keep working.

Trade-off. Functions now read process.env indirectly. For a fixed build-time constant that's fine; if you want it explicit,
resolve getMiniActor() once in main.ts and pass the object down — you still delete every string comparison.

jirispilka · 2026-06-08T15:02:39Z

Do we expect the schema for url-to-markdown and rag-web-browser to the same sub-set of fields?

Right now, validateAndFillInput expects that. If there will be different fields, we will need to have different functions for validateAndFillInput

As of now yes, they are working on common input fields. Exclusive fields will go to dedicated processRagWebBrowserInput or processUrlToMarkdownInput (wrappers around validateAndFillInput)

ok 👍🏻

jirispilka · 2026-06-08T15:28:12Z

    const HELP_MESSAGE = `Send a GET request to ${process.env.ACTOR_STANDBY_URL}/search?query=hello+world`
-        + ` or to ${process.env.ACTOR_STANDBY_URL}/messages to use Model context protocol.`;
+        + ` or to ${process.env.ACTOR_STANDBY_URL}/message to use Model context protocol.`;


HELP_MESSAGE is not correct for url-to-markdown

jirispilka · 2026-06-08T15:33:04Z

    "input": "./input_schema.json",
-    "dockerfile": "./Dockerfile",
+    "dockerContextDir": "../../..",
+    "changelog": "../../../CHANGELOG.md",


Is this path correct?
The CHANGELOG.md sits at the same path as actor.json

jirispilka · 2026-06-08T15:33:28Z

+    "version": "1.0",
+    "input": "./input_schema.json",
+    "dockerContextDir": "../../..",
+    "changelog": "../../../CHANGELOG.md",


The same as above

Is this path correct?
The CHANGELOG.md sits at the same path as actor.json

ruocco-l · 2026-06-09T10:20:44Z

@jirispilka thank you for the suggestion!

nikitachapovskii-dev

Thanks @ruocco-l , really solid work!
Architecture decisions looks solid to me, no concerns there. Just left 2 small points inline. Approving so I'm not blocking

nikitachapovskii-dev · 2026-06-09T15:36:33Z

Now that we got Routes.FETCH going through this same handler we could do

const params = parseParameters(request.url?.slice(getMiniActor().route.length) ?? ''); instead

nikitachapovskii-dev · 2026-06-09T15:37:02Z

-        inputSchema.properties.requestTimeoutSecs.minimum,
-        inputSchema.properties.requestTimeoutSecs.maximum,
-        inputSchema.properties.requestTimeoutSecs.default,
+        ragWebBrowserInputSchema.properties.requestTimeoutSecs.minimum,


Following up on the thread with @jirispilka above

The day someone changes a default in the u2m schema it will be rewriten. Probably we can do getMiniActor().properties....

@nicklamonov waiting on your decision: if we remove the time out (and the other inputs) I'll just rework this

To track here - let's remove:

Cookie

Retries

Timeout

Debug mode (hide)

(maybe) HTML elements to remove. To be confirmed again

Done in 4c324bf. It's a bit ugly because we do need this information that are still used but defaulted on RAG. Maybe in the near future we can deprecate them and then remove them.

metalwarrior665

Just few nits

metalwarrior665 · 2026-06-09T16:56:50Z

+        'serpMaxRetries',
+    );
+
+    const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup] });


All plans have SERPs

Suggested change

const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup] });

const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup], checkAccess: false });

The input also allow for shader proxies as a "serp proxy group" and we should check for it. Maybe we should remove the option to allow shader and just hardcode google serp?

Sorry my bad. But I think even if SHADER is an option, it is better to skip the check.

Minimum users will select it. Those that do would get a runtime error which is still understandable

RAG is about latency, with the check you penalize everyone (when batch or cold start) with 250ms delay.

Not sure if anyone used it ever. It can be faster than SERP but only sometimes since it has to do retries so if it does more than like 1 retry, it will be slower. No strong opinion on my end.

Yeah, On input analysis SERP is chosen 99.9% of the time lol

metalwarrior665 · 2026-06-09T17:02:45Z

+ENV ACTOR_PATH_IN_DOCKER_CONTEXT="${ACTOR_PATH_IN_DOCKER_CONTEXT}"
+
+# log the ACTOR_PATH_IN_DOCKER_CONTEXT variable when building the actor
+RUN echo "ACTOR_PATH_IN_DOCKER_CONTEXT=${ACTOR_PATH_IN_DOCKER_CONTEXT}"


We migrated into using ACTOR_FULL_NAME (see example) but only pays off if you have more miniactors, here it will have minimal impact. But might be good to do it for the future

metalwarrior665 · 2026-06-09T17:06:36Z

+const MINI_ACTORS: Record<string, MiniActor> = {
+    'apify_rag-web-browser': {
+        name: 'apify_rag-web-browser',
+        runsSearch: true,


The original idea of miniactors is that their input will be merged at the start and then the rest of the code doesn't know about them at all. You just add routes and options but it doesn't matter what miniactor added them. That assumption already broke in some repos like Instagram where we carry the miniactor names along mostly for event prices which partially defeated their original purpose.

I would think if you can still go that way, just pass runSearch: boolean as standalone config.

But it doesn't have a big impact here, the number of uses of miniactors is pretty low

metalwarrior665 · 2026-06-09T17:08:54Z

+
+    // Throw an error if the query and is not provided and standbyInit is false.
+    if (!input.query && !standbyInit) {
+        throw new UserInputError('The `query` parameter must be provided and non-empty.');


We should probably Actor.fail somewhere

metalwarrior665 · 2026-06-09T17:11:03Z

+    if (!input.serpProxyGroup || input.serpProxyGroup.length === 0) {
+        input.serpProxyGroup = ragWebBrowserInputSchema.properties.serpProxyGroup.default as SERPProxyGroup;
+    } else if (input.serpProxyGroup !== 'GOOGLE_SERP' && input.serpProxyGroup !== 'SHADER') {


We should just rely on input schemas, there is both default and enum so these are enforced. Unless this also applies to standby (there is hould be possible in the openapi schema). The same for other checks.

The problem with this is that the input can also be passed as search parameters in the /search or /fetch http request on standby, we have to basically redo the work for it.

chore: update dataset details and memory settings

chore: Regex for domain name validation for URL

docs: Update texts in input_schema.json

fix: update domain regex

ruocco-l added 8 commits June 3, 2026 16:12

chore: create actors folder and shared docker file

86e5ff0

chore: fix folder name and imports

85feb4c

url2m first implementation

8ff16be

update input schema

5ca2862

do not check url on standby actor

0ebf161

parse all possible properties

f490774

use proper tools for mcp server

0fa7031

create dedicated route

a24482c

ruocco-l requested review from jirispilka, metalwarrior665 and nikitachapovskii-dev June 4, 2026 16:47

jirispilka reviewed Jun 8, 2026

View reviewed changes

jirispilka approved these changes Jun 8, 2026

View reviewed changes

ruocco-l added 2 commits June 9, 2026 11:04

refactor miniactor selection

d4088ef

correct path for changelogs

798bdff

install only firefox browser

d25bcdd

nikitachapovskii-dev approved these changes Jun 9, 2026

View reviewed changes

metalwarrior665 reviewed Jun 9, 2026

View reviewed changes

ruocco-l and others added 11 commits June 10, 2026 10:07

slice url correctly

8493bd2

use ACTOR_FULL_NAME env

bafe9f6

actor fail on bad input processing

d8f8eca

restructure output

5541bc7

rework input

4c324bf

avoid checking access on serp proxies

d32e32f

further changes on input schema

1478072

Update dataset details and memory settings

f3aea79

add output schema

c033a88

lint

2bd047d

install browser with dependencies

9c3683e

nicklamonov and others added 3 commits June 10, 2026 15:16

Update texts in input_schema.json

afda079

Regex for domain name validation for URL

acbcdb6

use node 22 for tests

61b3c8b

ruocco-l force-pushed the chore/monorepo-structure branch from 9c3683e to 61b3c8b Compare June 10, 2026 13:21

ruocco-l and others added 15 commits June 10, 2026 14:23

use --with-deps flag

ba685ac

Merge pull request #96 from apify/actor-json-update

90e28b4

chore: update dataset details and memory settings

do not specify firefox

599b088

Merge branch 'chore/monorepo-structure' into regex-for-url

5128094

Merge pull request #98 from apify/regex-for-url

e2e9b97

chore: Regex for domain name validation for URL

Merge pull request #97 from apify/input-schema-update

5eeefb3

docs: Update texts in input_schema.json

fix domain regex

9943682

Merge pull request #99 from apify/fix-domain-regex

af21a61

fix: update domain regex

Typo fix in the actor.json

f58174a

provide PPE support

4b7f9e1

small fixes

d4c1390

update apify dependency

235e392

try catch charging

046ca56

add prenav hook to avoid waiting on navigation

798ad33

remove unused input

014e9f2

ruocco-l force-pushed the chore/monorepo-structure branch from 31c9574 to 014e9f2 Compare June 10, 2026 18:02

	const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup] });
	const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup], checkAccess: false });

Conversation

ruocco-l commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jirispilka left a comment

Choose a reason for hiding this comment

Regarding mini-actors

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruocco-l commented Jun 9, 2026

Uh oh!

nikitachapovskii-dev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metalwarrior665 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ruocco-l commented Jun 4, 2026 •

edited

Loading