Nemo curator laion by avigyabb · Pull Request #31 · anyscale/examples

avigyabb · 2025-12-31T00:25:46Z

No description provided.

Fix deployment

Small fix for querying service

Add example for serving Llama 3 8B

Use L4 instead of A10G for LLM serving example

Can't actually use `pathlib.Path` because that's only designed for file paths not URLs :/ But `urllib.parse.urljoin` works and is a bit prettier (it's a standard python library) --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com>

# Why this change We previously pinned to vLLM v0 due to a bug in Ray 2.48.0 that blocked Hugging Face tokens from runtime dependencies. This is bad because non-maintanable (vLLM v0 will be deprecated soon). I just found out we can pass the token directly through engine parameters, so we can now use vLLM v1 with gated models. # Summary * Switched deployment from vLLM v0 → vLLM v1. * Added a C compiler to the minimal Dockerfile since vLLM v1 depends on Triton, which compiles C code at runtime ( see vllm-project/vllm#2997 ). * Resolved Hugging Face token issue by passing it directly via engine parameters instead of runtime dependencies. * Updated model to [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (more popular for most usage). --- # Testing * Tested Anyscale Service on AWS cloud. --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Robert Nishihara <rkn@anyscale.com>

Multiple changes here... **README** Updated per Douglas’s review ([https://github.com/anyscale/docs/pull/1464](https://github.com/anyscale/docs/pull/1464)): * Added a description at the top. * Applied code highlighting to bash commands. * Made `HF_TOKEN` usage clearer (explicit `export` example) and added instructions for ungated models. * Clarified where to set token/endpoint before querying the service. * Added `pip install openai` requirement before querying * Rephrased future tense and passive voice. **serve\_llama\_3\_1\_8b.py** With Ray 2.49.0 we can now forward the Hugging Face token to vLLM v1 via runtime dependencies: * Pass `HF_TOKEN` as a runtime dependency instead of engine parameters. * Updated ungated model suggestion to use Unsloth’s Llama variant (so we stay within the Llama family instead of switching to Qwen). **Dockerfile** Simplified for Ray 2.49.0: * No need to pin `transformers==4.53.3` anymore * Removed `uv` installation; if the goal is to build a minimal image based on ray then let's remove uv ? This might confuse users --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Robert Nishihara <rkn@anyscale.com>

…scale#15)

Add a tutorial on deploying llama 3.1 8b-instruct. Follow the same format as the llama 3.1 8b example for consistency (see this PR: [https://github.com/anyscale/examples/pull/12](https://github.com/anyscale/examples/pull/12)). As with the reasoning for choosing L4 GPUs in the Llama 3 8B example, here we use 8×A100 GPUs, which are available in both our AWS and GCP clouds. --------- Signed-off-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Aydin Abiar <aydin@anyscale.com> Co-authored-by: Robert Nishihara <rkn@anyscale.com>

…ples into nemo-curator-dedup

robertnishihara · 2026-01-02T00:12:32Z

nemo_curator_semantic_dedup/helper.py

+    return asyncio.run(process_batch(batch, output_dir, batch_num))
+
+
+def download_webdataset(


This assumes the whole dataset fits on disk in one machine, right? (Fine for LAION since it is just URLs, but probably not in general.)

What's the best way to get data into NeMo Curator? E.g., would it make sense to use Ray Data to read the data and stream it in? Or does NeMo Curator have methods for this?

Since nemo curator uses nvidia DALI, I think the ideal data loading story would be to have all the images in something like s3 partitioned into different tar shards. We can then mount the s3 on each of the nodes, with each node accessing the subset of tar shards that it is computing on. Would you like me to build this into the example?

nemo_curator_semantic_dedup/helper.py

nemo_curator_semantic_dedup/Dockerfile

akshay-anyscale and others added 30 commits July 3, 2025 14:42

Merge pull request anyscale#1 from anyscale/dwagner/ir-gleaming-splendor

820d389

Fix deployment

Add llama serving example

208970f

Fixes

0d78714

Small fix for querying service

1f9a2ac

Merge pull request anyscale#3 from anyscale/minor

736f200

Small fix for querying service

Add readme

564b7eb

don't specify head node type

ce08035

Merge pull request anyscale#2 from anyscale/llm

2c154ec

Add example for serving Llama 3 8B

Use L4 instead of A10G

4a45a63

Merge pull request anyscale#4 from anyscale/llm2

4d47beb

Use L4 instead of A10G for LLM serving example

Initial FastVideo example (anyscale#9)

00adce7

minor (anyscale#11)

8bd77c5

Tensor parallel serving example with DeepSpeed and Transformers. (any…

1ab6397

…scale#15)

Add SkyRL GRPO example. (anyscale#16)

7e59ccb

Image processing example (anyscale#21)

b6e9b13

dedup working

4e334c6

working job submit version

f75ed94

remove unnecessary file

5250318

changed custom compute config

917840c

Merge branch 'nemo-curator-dedup' of https://github.com/avigyabb/exam…

ce0dd95

…ples into nemo-curator-dedup

working version

56c6a0b

minimal working version

8a94f36

working

e79c91e

working

76ba5cb

working w/ laion

dacb6a9

working version at scale

cf50787

job yaml cleanup

64d4b55

robertnishihara reviewed Jan 2, 2026

View reviewed changes

nemo_curator_semantic_dedup/helper.py Outdated Show resolved Hide resolved

robertnishihara reviewed Jan 2, 2026

View reviewed changes

nemo_curator_semantic_dedup/Dockerfile Outdated Show resolved Hide resolved

robertnishihara reviewed Jan 2, 2026

View reviewed changes

nemo_curator_semantic_dedup/Dockerfile Outdated Show resolved Hide resolved

avigyabb added 6 commits January 2, 2026 07:40

removed unnecessary functions in helper.py

14b1a0c

changes to dockerfile

867627a

working with ray data

6f0a72d

working with image_processing

4e7b0ad

working

e56f18f

10m example working faster

d209f5f

IanDJordan force-pushed the nemo-curator-laion branch from aa759ac to d209f5f Compare March 21, 2026 07:50

working with ray 2.54

5b795f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemo curator laion#31

Nemo curator laion#31
avigyabb wants to merge 38 commits intoanyscale:mainfrom
avigyabb:nemo-curator-laion

avigyabb commented Dec 31, 2025

Uh oh!

robertnishihara Jan 2, 2026

Uh oh!

avigyabb Jan 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return asyncio.run(process_batch(batch, output_dir, batch_num))


		def download_webdataset(

Conversation

avigyabb commented Dec 31, 2025

Uh oh!

robertnishihara Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

avigyabb Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants