chore: restructure repo as monorepo and create url-to-markdown actor#94
chore: restructure repo as monorepo and create url-to-markdown actor#94ruocco-l wants to merge 40 commits into
Conversation
jirispilka
left a comment
There was a problem hiding this comment.
Thanks @ruocco-l
Looks good to me! Pre-approving.
I've tested it at my local and it works.
Regarding MCP. My immediate reaction was to delete it because we have mcp.apify.com
But what if some customer is using it? Can we log it somewhere and evaluate it later?
I left a couple of suggestions and nits.
Regarding mini-actors
Human: I was looking at the code, selectedMiniActor is passed all over, across many functions, but the decision is based solely on the
ACTOR_PATH_IN_DOCKER_CONTEXT env variable, which is available everywhere in the process. So I task Claude to simplify it:
CLAUDE:
The problem. selectedMiniActor is a string threaded through ~8 functions, and every receiver decides what it means by comparing it
(=== MINIACTORS.RAG_WEB_BROWSER). The same knowledge — "rag-web-browser searches, url-to-markdown doesn't" — is repeated across
input.ts, main.ts, and search.ts. Since the value comes purely from a process-wide env constant, there's no reason to thread it at
all.
The fix. Resolve it once, in a typed helper, and read named facts off it instead of comparing strings:
// src/mini-actors.ts
interface MiniActor {
name: 'apify_rag-web-browser' | 'apify_url-to-markdown';
inputSchema: typeof ragSchema | typeof url2mSchema;
mcpServerName: string;
runsSearch: boolean; // the one fact call sites actually need
}
const MINI_ACTORS: Record<MiniActor['name'], MiniActor> = {
'apify_rag-web-browser': { name: 'apify_rag-web-browser', inputSchema: ragSchema, mcpServerName:
'mcp-server-rag-web-browser', runsSearch: true },
'apify_url-to-markdown': { name: 'apify_url-to-markdown', inputSchema: url2mSchema, mcpServerName:
'mcp-server-url-to-markdown', runsSearch: false },
};
export function getMiniActor(): MiniActor {
const name = process.env.ACTOR_PATH_IN_DOCKER_CONTEXT?.split('/')[1] as MiniActor['name'] | undefined;
if (!name || !(name in MINI_ACTORS)) throw new UserInputError(`Unknown actor context "${name ?? '(unset)'}".`);
return MINI_ACTORS[name];
}Then remove the threading. Drop selectedMiniActor from every signature and return value; call getMiniActor() where needed and read
a field:
if (getMiniActor().runsSearch) { ... } // main.ts + search.ts string compares
const { inputSchema, mcpServerName } = getMiniActor(); // mcp/server.ts, no fallbackThe one real fork (which input validator to run) stays as two functions, chosen once:
const { input, searchCrawlerOptions } = getMiniActor().runsSearch
? await processRagWebBrowserInput(originalInput, standbyInit)
: await processUrlToMarkdownInput(originalInput, standbyInit);Result. selectedMiniActor leaves ~8 signatures, all three ! assertions go, the ?? TOOL_CONFIGS[...] fallback and throw await
Actor.fail() hack go, and MINIACTORS + TOOL_CONFIGS collapse into one MINI_ACTORS table. Adding a third actor = one new row, not
edits across 5 files. Tests already set the env var, so they keep working.
Trade-off. Functions now read process.env indirectly. For a fixed build-time constant that's fine; if you want it explicit,
resolve getMiniActor() once in main.ts and pass the object down — you still delete every string comparison.
There was a problem hiding this comment.
Do we expect the schema for url-to-markdown and rag-web-browser to the same sub-set of fields?
Right now, validateAndFillInput expects that. If there will be different fields, we will need to have different functions for validateAndFillInput
There was a problem hiding this comment.
As of now yes, they are working on common input fields. Exclusive fields will go to dedicated processRagWebBrowserInput or processUrlToMarkdownInput (wrappers around validateAndFillInput)
| const HELP_MESSAGE = `Send a GET request to ${process.env.ACTOR_STANDBY_URL}/search?query=hello+world` | ||
| + ` or to ${process.env.ACTOR_STANDBY_URL}/messages to use Model context protocol.`; | ||
| + ` or to ${process.env.ACTOR_STANDBY_URL}/message to use Model context protocol.`; |
There was a problem hiding this comment.
HELP_MESSAGE is not correct for url-to-markdown
| "input": "./input_schema.json", | ||
| "dockerfile": "./Dockerfile", | ||
| "dockerContextDir": "../../..", | ||
| "changelog": "../../../CHANGELOG.md", |
There was a problem hiding this comment.
Is this path correct?
The CHANGELOG.md sits at the same path as actor.json
| "version": "1.0", | ||
| "input": "./input_schema.json", | ||
| "dockerContextDir": "../../..", | ||
| "changelog": "../../../CHANGELOG.md", |
There was a problem hiding this comment.
The same as above
Is this path correct?
The CHANGELOG.md sits at the same path as actor.json
|
@jirispilka thank you for the suggestion! |
nikitachapovskii-dev
left a comment
There was a problem hiding this comment.
Thanks @ruocco-l , really solid work!
Architecture decisions looks solid to me, no concerns there. Just left 2 small points inline. Approving so I'm not blocking
There was a problem hiding this comment.
Now that we got Routes.FETCH going through this same handler we could do
const params = parseParameters(request.url?.slice(getMiniActor().route.length) ?? ''); instead
| inputSchema.properties.requestTimeoutSecs.minimum, | ||
| inputSchema.properties.requestTimeoutSecs.maximum, | ||
| inputSchema.properties.requestTimeoutSecs.default, | ||
| ragWebBrowserInputSchema.properties.requestTimeoutSecs.minimum, |
There was a problem hiding this comment.
Following up on the thread with @jirispilka above
The day someone changes a default in the u2m schema it will be rewriten. Probably we can do getMiniActor().properties....
There was a problem hiding this comment.
@nicklamonov waiting on your decision: if we remove the time out (and the other inputs) I'll just rework this
There was a problem hiding this comment.
To track here - let's remove:
- Cookie
- Retries
- Timeout
- Debug mode (hide)
- (maybe) HTML elements to remove. To be confirmed again
There was a problem hiding this comment.
Done in 4c324bf. It's a bit ugly because we do need this information that are still used but defaulted on RAG. Maybe in the near future we can deprecate them and then remove them.
| 'serpMaxRetries', | ||
| ); | ||
|
|
||
| const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup] }); |
There was a problem hiding this comment.
All plans have SERPs
| const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup] }); | |
| const proxySearch = await Actor.createProxyConfiguration({ groups: [input.serpProxyGroup], checkAccess: false }); |
There was a problem hiding this comment.
The input also allow for shader proxies as a "serp proxy group" and we should check for it. Maybe we should remove the option to allow shader and just hardcode google serp?
There was a problem hiding this comment.
Sorry my bad. But I think even if SHADER is an option, it is better to skip the check.
- Minimum users will select it. Those that do would get a runtime error which is still understandable
- RAG is about latency, with the check you penalize everyone (when batch or cold start) with 250ms delay.
Not sure if anyone used it ever. It can be faster than SERP but only sometimes since it has to do retries so if it does more than like 1 retry, it will be slower. No strong opinion on my end.
There was a problem hiding this comment.
Yeah, On input analysis SERP is chosen 99.9% of the time lol
| ENV ACTOR_PATH_IN_DOCKER_CONTEXT="${ACTOR_PATH_IN_DOCKER_CONTEXT}" | ||
|
|
||
| # log the ACTOR_PATH_IN_DOCKER_CONTEXT variable when building the actor | ||
| RUN echo "ACTOR_PATH_IN_DOCKER_CONTEXT=${ACTOR_PATH_IN_DOCKER_CONTEXT}" |
There was a problem hiding this comment.
We migrated into using ACTOR_FULL_NAME (see example) but only pays off if you have more miniactors, here it will have minimal impact. But might be good to do it for the future
| const MINI_ACTORS: Record<string, MiniActor> = { | ||
| 'apify_rag-web-browser': { | ||
| name: 'apify_rag-web-browser', | ||
| runsSearch: true, |
There was a problem hiding this comment.
The original idea of miniactors is that their input will be merged at the start and then the rest of the code doesn't know about them at all. You just add routes and options but it doesn't matter what miniactor added them. That assumption already broke in some repos like Instagram where we carry the miniactor names along mostly for event prices which partially defeated their original purpose.
I would think if you can still go that way, just pass runSearch: boolean as standalone config.
But it doesn't have a big impact here, the number of uses of miniactors is pretty low
|
|
||
| // Throw an error if the query and is not provided and standbyInit is false. | ||
| if (!input.query && !standbyInit) { | ||
| throw new UserInputError('The `query` parameter must be provided and non-empty.'); |
There was a problem hiding this comment.
We should probably Actor.fail somewhere
| if (!input.serpProxyGroup || input.serpProxyGroup.length === 0) { | ||
| input.serpProxyGroup = ragWebBrowserInputSchema.properties.serpProxyGroup.default as SERPProxyGroup; | ||
| } else if (input.serpProxyGroup !== 'GOOGLE_SERP' && input.serpProxyGroup !== 'SHADER') { |
There was a problem hiding this comment.
We should just rely on input schemas, there is both default and enum so these are enforced. Unless this also applies to standby (there is hould be possible in the openapi schema). The same for other checks.
There was a problem hiding this comment.
The problem with this is that the input can also be passed as search parameters in the /search or /fetch http request on standby, we have to basically redo the work for it.
9c3683e to
61b3c8b
Compare
chore: update dataset details and memory settings
chore: Regex for domain name validation for URL
docs: Update texts in input_schema.json
fix: update domain regex
31c9574 to
014e9f2
Compare
This PR is not trivial and touches a lot of points. I tried my best to reuse what already was there, hopefully is somewhat readable.
A totally human summary of what's in the PR:
actors/, with each mini-actor (apify_rag-web-browserandapify_url-to-markdown) having its own.actor/directory (actor.json, input_schema.json, README, CHANGELOG).apify_url-to-markdown, a new mini-actor that is a subset of RAG Web Browser — it only fetches a single URL and converts it to Markdown (no GoogleSearch). It exposes a
/fetchstandby endpoint (vs/searchfor RAG Web Browser) and accepts aurlparameter instead ofquery.ACTOR_PATH_IN_DOCKER_CONTEXTDocker build arg determines which actor starts at runtime — eachactor.json points
dockerfileanddockerContextDirback to the shared root Dockerfile, and the platform passes the appropriate build arg per actor (this is done by specifying the miniactor folder in theSourcetab).Inputtype intoCommonInput,RagWebBrowserInput, andUrlToMarkdownInput. The code now branches onselectedMiniActor(derived from the build arg) to apply actor-specific input validation and route registration.McpServerclass, wasRagWebBrowserServer) and configures its tool name and input schema based on the selected mini-actor. (@jirispilka please help me figure out if what I did makes any lick of sense)readableTextCharThreshold(dead code) andcountryCode/languageCode(were defined in types but never used).Update after first feedback:
ACTOR_PATH_IN_DOCKER_CONTEXTDocker build arg with the platform-providedACTOR_FULL_NAMEenvironment variable for mini-actor selection. This removes the custom ARG/ENV block from theDockerfile entirely — the platform already sets
ACTOR_FULL_NAMEat runtime, so the build-time plumbing was unnecessary. TheMINI_ACTORSlookup keys are simplified accordingly (e.g. apify_rag-web-browser→ rag-web-browser).
requestTimeoutSecs, removeCookieWarnings, maxRequestRetries, dynamicContentWaitSecs) and reorder the remaining fields toput scrapingTool right after url. RAG Web Browser validates these from its input schema as before; URL to Markdown now hardcodes sensible defaults (e.g. 40s timeout, 1 retry,
cookie warnings on) since those fields are not exposed in its input schema.
checkAccess: falsewhen creating the SERP proxy configuration to skip access validation on search proxies.main.tswith anActor.fail()call so the actor reports a clear failure message when input validation rejects bad input, instead of silently crashing.actor.jsonwith a table displaying the page URL and extracted Markdown.