Skip to content

chore: restructure repo as monorepo and create url-to-markdown actor#94

Open
ruocco-l wants to merge 40 commits into
masterfrom
chore/monorepo-structure
Open

chore: restructure repo as monorepo and create url-to-markdown actor#94
ruocco-l wants to merge 40 commits into
masterfrom
chore/monorepo-structure

Conversation

@ruocco-l

@ruocco-l ruocco-l commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

This PR is not trivial and touches a lot of points. I tried my best to reuse what already was there, hopefully is somewhat readable.

A totally human summary of what's in the PR:

  • Restructure the repository into a monorepo layout under actors/, with each mini-actor (apify_rag-web-browser and apify_url-to-markdown) having its own
    .actor/ directory (actor.json, input_schema.json, README, CHANGELOG).
  • Introduce apify_url-to-markdown, a new mini-actor that is a subset of RAG Web Browser — it only fetches a single URL and converts it to Markdown (no Google
    Search). It exposes a /fetch standby endpoint (vs /search for RAG Web Browser) and accepts a url parameter instead of query.
  • Both mini-actors share a single codebase and Dockerfile. The ACTOR_PATH_IN_DOCKER_CONTEXT Docker build arg determines which actor starts at runtime — each
    actor.json points dockerfile and dockerContextDir back to the shared root Dockerfile, and the platform passes the appropriate build arg per actor (this is done by specifying the miniactor folder in the Source tab).
  • Refactor the type system: split the Input type into CommonInput, RagWebBrowserInput, and UrlToMarkdownInput. The code now branches on
    selectedMiniActor (derived from the build arg) to apply actor-specific input validation and route registration.
  • The MCP server is now generic (McpServer class, was RagWebBrowserServer) and configures its tool name and input schema based on the selected mini-actor. (@jirispilka please help me figure out if what I did makes any lick of sense)
  • The search crawler is only started when the RAG Web Browser actor is selected, since URL to Markdown doesn't need Google Search.
  • Clean up unused fields: remove readableTextCharThreshold (dead code) and countryCode/languageCode (were defined in types but never used).

Update after first feedback:

  • Replace the ACTOR_PATH_IN_DOCKER_CONTEXT Docker build arg with the platform-provided ACTOR_FULL_NAME environment variable for mini-actor selection. This removes the custom ARG/ENV block from the
    Dockerfile entirely — the platform already sets ACTOR_FULL_NAME at runtime, so the build-time plumbing was unnecessary. The MINI_ACTORS lookup keys are simplified accordingly (e.g. apify_rag-web-browser
    → rag-web-browser).
  • Trim the URL to Markdown input schema — remove fields that are now hardcoded (requestTimeoutSecs, removeCookieWarnings, maxRequestRetries, dynamicContentWaitSecs) and reorder the remaining fields to
    put scrapingTool right after url. RAG Web Browser validates these from its input schema as before; URL to Markdown now hardcodes sensible defaults (e.g. 40s timeout, 1 retry,
    cookie warnings on) since those fields are not exposed in its input schema.
  • Add checkAccess: false when creating the SERP proxy configuration to skip access validation on search proxies.
  • Wrap input processing in main.ts with an Actor.fail() call so the actor reports a clear failure message when input validation rejects bad input, instead of silently crashing.
  • Restructure the URL to Markdown output: add dataset storage views to its actor.json with a table displaying the page URL and extracted Markdown.

@jirispilka jirispilka left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ruocco-l

Looks good to me! Pre-approving.

I've tested it at my local and it works.

Regarding MCP. My immediate reaction was to delete it because we have mcp.apify.com

But what if some customer is using it? Can we log it somewhere and evaluate it later?

I left a couple of suggestions and nits.

Regarding mini-actors

Human: I was looking at the code, selectedMiniActor is passed all over, across many functions, but the decision is based solely on the
ACTOR_PATH_IN_DOCKER_CONTEXT env variable, which is available everywhere in the process. So I task Claude to simplify it:

CLAUDE:
The problem. selectedMiniActor is a string threaded through ~8 functions, and every receiver decides what it means by comparing it
(=== MINIACTORS.RAG_WEB_BROWSER). The same knowledge — "rag-web-browser searches, url-to-markdown doesn't" — is repeated across
input.ts, main.ts, and search.ts. Since the value comes purely from a process-wide env constant, there's no reason to thread it at
all.

The fix. Resolve it once, in a typed helper, and read named facts off it instead of comparing strings:

  // src/mini-actors.ts
  interface MiniActor {
      name: 'apify_rag-web-browser' | 'apify_url-to-markdown';
      inputSchema: typeof ragSchema | typeof url2mSchema;
      mcpServerName: string;
      runsSearch: boolean;   // the one fact call sites actually need
  }

  const MINI_ACTORS: Record<MiniActor['name'], MiniActor> = {
      'apify_rag-web-browser': { name: 'apify_rag-web-browser', inputSchema: ragSchema,  mcpServerName:
  'mcp-server-rag-web-browser',  runsSearch: true  },
      'apify_url-to-markdown': { name: 'apify_url-to-markdown', inputSchema: url2mSchema, mcpServerName:
  'mcp-server-url-to-markdown', runsSearch: false },
  };

  export function getMiniActor(): MiniActor {
      const name = process.env.ACTOR_PATH_IN_DOCKER_CONTEXT?.split('/')[1] as MiniActor['name'] | undefined;
      if (!name || !(name in MINI_ACTORS)) throw new UserInputError(`Unknown actor context "${name ?? '(unset)'}".`);
      return MINI_ACTORS[name];
  }

Then remove the threading. Drop selectedMiniActor from every signature and return value; call getMiniActor() where needed and read
a field:

  if (getMiniActor().runsSearch) { ... }                   // main.ts + search.ts string compares
  const { inputSchema, mcpServerName } = getMiniActor();   // mcp/server.ts, no fallback

The one real fork (which input validator to run) stays as two functions, chosen once:

  const { input, searchCrawlerOptions } = getMiniActor().runsSearch
      ? await processRagWebBrowserInput(originalInput, standbyInit)
      : await processUrlToMarkdownInput(originalInput, standbyInit);

Result. selectedMiniActor leaves ~8 signatures, all three ! assertions go, the ?? TOOL_CONFIGS[...] fallback and throw await
Actor.fail() hack go, and MINIACTORS + TOOL_CONFIGS collapse into one MINI_ACTORS table. Adding a third actor = one new row, not
edits across 5 files. Tests already set the env var, so they keep working.

Trade-off. Functions now read process.env indirectly. For a fixed build-time constant that's fine; if you want it explicit,
resolve getMiniActor() once in main.ts and pass the object down — you still delete every string comparison.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect the schema for url-to-markdown and rag-web-browser to the same sub-set of fields?

Right now, validateAndFillInput expects that. If there will be different fields, we will need to have different functions for validateAndFillInput

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now yes, they are working on common input fields. Exclusive fields will go to dedicated processRagWebBrowserInput or processUrlToMarkdownInput (wrappers around validateAndFillInput)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok 👍🏻

Comment thread src/server.ts Outdated
Comment on lines +14 to +15
const HELP_MESSAGE = `Send a GET request to ${process.env.ACTOR_STANDBY_URL}/search?query=hello+world`
+ ` or to ${process.env.ACTOR_STANDBY_URL}/messages to use Model context protocol.`;
+ ` or to ${process.env.ACTOR_STANDBY_URL}/message to use Model context protocol.`;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HELP_MESSAGE is not correct for url-to-markdown

"input": "./input_schema.json",
"dockerfile": "./Dockerfile",
"dockerContextDir": "../../..",
"changelog": "../../../CHANGELOG.md",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this path correct?
The CHANGELOG.md sits at the same path as actor.json

"version": "1.0",
"input": "./input_schema.json",
"dockerContextDir": "../../..",
"changelog": "../../../CHANGELOG.md",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above

Is this path correct?
The CHANGELOG.md sits at the same path as actor.json

@ruocco-l

ruocco-l commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@jirispilka thank you for the suggestion!

@nikitachapovskii-dev nikitachapovskii-dev left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ruocco-l , really solid work!
Architecture decisions looks solid to me, no concerns there. Just left 2 small points inline. Approving so I'm not blocking

Comment thread src/search.ts Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we got Routes.FETCH going through this same handler we could do

const params = parseParameters(request.url?.slice(getMiniActor().route.length) ?? ''); instead

Comment thread src/input.ts Outdated
inputSchema.properties.requestTimeoutSecs.minimum,
inputSchema.properties.requestTimeoutSecs.maximum,
inputSchema.properties.requestTimeoutSecs.default,
ragWebBrowserInputSchema.properties.requestTimeoutSecs.minimum,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on the thread with @jirispilka above

The day someone changes a default in the u2m schema it will be rewriten. Probably we can do getMiniActor().properties....

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklamonov waiting on your decision: if we remove the time out (and the other inputs) I'll just rework this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To track here - let's remove:

  • Cookie
  • Retries
  • Timeout
  • Debug mode (hide)
  • (maybe) HTML elements to remove. To be confirmed again

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4c324bf. It's a bit ugly because we do need this information that are still used but defaulted on RAG. Maybe in the near future we can deprecate them and then remove them.

@metalwarrior665 metalwarrior665 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just few nits

Comment thread src/input.ts Outdated
'serpMaxRetries',
);

const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup] });

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All plans have SERPs

Suggested change
const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup] });
const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup], checkAccess: false });

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input also allow for shader proxies as a "serp proxy group" and we should check for it. Maybe we should remove the option to allow shader and just hardcode google serp?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my bad. But I think even if SHADER is an option, it is better to skip the check.

  • Minimum users will select it. Those that do would get a runtime error which is still understandable
  • RAG is about latency, with the check you penalize everyone (when batch or cold start) with 250ms delay.

Not sure if anyone used it ever. It can be faster than SERP but only sometimes since it has to do retries so if it does more than like 1 retry, it will be slower. No strong opinion on my end.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, On input analysis SERP is chosen 99.9% of the time lol

Comment thread Dockerfile Outdated
ENV ACTOR_PATH_IN_DOCKER_CONTEXT="${ACTOR_PATH_IN_DOCKER_CONTEXT}"

# log the ACTOR_PATH_IN_DOCKER_CONTEXT variable when building the actor
RUN echo "ACTOR_PATH_IN_DOCKER_CONTEXT=${ACTOR_PATH_IN_DOCKER_CONTEXT}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We migrated into using ACTOR_FULL_NAME (see example) but only pays off if you have more miniactors, here it will have minimal impact. But might be good to do it for the future

Comment thread src/mini-actors.ts
const MINI_ACTORS: Record<string, MiniActor> = {
'apify_rag-web-browser': {
name: 'apify_rag-web-browser',
runsSearch: true,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original idea of miniactors is that their input will be merged at the start and then the rest of the code doesn't know about them at all. You just add routes and options but it doesn't matter what miniactor added them. That assumption already broke in some repos like Instagram where we carry the miniactor names along mostly for event prices which partially defeated their original purpose.

I would think if you can still go that way, just pass runSearch: boolean as standalone config.

But it doesn't have a big impact here, the number of uses of miniactors is pretty low

Comment thread src/input.ts

// Throw an error if the query and is not provided and standbyInit is false.
if (!input.query && !standbyInit) {
throw new UserInputError('The `query` parameter must be provided and non-empty.');

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably Actor.fail somewhere

Comment thread src/input.ts
Comment on lines +127 to +129
if (!input.serpProxyGroup || input.serpProxyGroup.length === 0) {
input.serpProxyGroup = ragWebBrowserInputSchema.properties.serpProxyGroup.default as SERPProxyGroup;
} else if (input.serpProxyGroup !== 'GOOGLE_SERP' && input.serpProxyGroup !== 'SHADER') {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just rely on input schemas, there is both default and enum so these are enforced. Unless this also applies to standby (there is hould be possible in the openapi schema). The same for other checks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that the input can also be passed as search parameters in the /search or /fetch http request on standby, we have to basically redo the work for it.

@ruocco-l ruocco-l force-pushed the chore/monorepo-structure branch from 9c3683e to 61b3c8b Compare June 10, 2026 13:21
@ruocco-l ruocco-l force-pushed the chore/monorepo-structure branch from 31c9574 to 014e9f2 Compare June 10, 2026 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants