25 May 07:28

mudler

1dcd1ae

v4.3.1 Latest

Latest

What's Changed

Other Changes

Fix kokoros backend build break from Backend trait drift by @Copilot in #9972
chore: ⬆️ Update antirez/ds4 to f91c12b50a1448527c435c028bfc70d1b00f6c33 by @localai-bot in #9975
chore: ⬆️ Update ikawrakow/ik_llama.cpp to 9f7ba245ab41e118f03aa8dd5134d18a81159d02 by @localai-bot in #9973
chore: ⬆️ Update ggml-org/llama.cpp to 549b9d84330c327e6791fa812a7d60c0cf63572e by @localai-bot in #9974

Full Changelog: v4.3.0...v4.3.1

Contributors

localai-bot

Assets 9

24 May 20:25

mudler

v4.3.0

1a30020

v4.3.0

🎉 LocalAI 4.3.0 Release! 🚀

LocalAI 4.3.0 is out!

This release hardens the trust boundary and improves defaults for speed. Backend OCI images now ship with keyless cosign signatures and a per-gallery verification: policy, with an opt-in strict mode that fails closed.
The llama-cpp server-side prompt cache works by default: repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) collapse from minutes to seconds without touching YAML. Distributed mode gets rounds of optimizations. Usage tracking grows a per-API-key + per-user Sources view so admins can finally answer "who is burning the GPU?". And, for everyone on a Jetson/DGX box, the L4T13 (cu130/aarch64) backends are back.

📌 TL;DR

Feature	Summary
🔐 Signed Backends	Keyless cosign + sigstore-go verification for backend OCI images, OCI 1.1 referrers, `not_before` revocation, opt-in strict mode.
⚡ Prompt Cache by Default	`llama-cpp` server-side prompt cache works out of the box. Repeated system prompts go from 5-8 min to seconds.
📊 Usage per API Key	New Sources tab attributes traffic to keys and users. Revoked keys stay readable in history.
🛰️ Distributed v3	Per-request replica routing, cached `probeHealth`, async per-node installs with streaming progress, unified backend-logs entry point.
🩺 Traces UI Stays Snappy	`LOCALAI_TRACING_MAX_BODY_BYTES` caps API + backend trace payloads. Admin Traces page stops drowning in 40 MB embeddings.
🧊 Nix Flake	Dockerless setup for NixOS users via `flake.nix` + dev shell.
🦾 Jetson Thor Restored	`vllm` / `sglang` / `vllm-omni` L4T13 backends switched to PyPI aarch64+cu130 wheels (torch 2.10 ABI fix).

🚀 New Features & Major Enhancements

🔐 Signed Backends with Keyless Cosign

LocalAI now verifies that backend OCI images came from our CI, not a compromised registry or MITM. This closes a real trust gap: the gallery YAML told LocalAI which image to pull, but nothing checked the bytes.

The producer side (.github/workflows/backend_merge.yml) signs every merged backend image (and every per-arch entry under the manifest list) with sigstore/cosign keyless via Fulcio + Rekor, using OCI 1.1 referrers (no legacy :tag.sig). The consumer side (pkg/oci/cosignverify, built on sigstore-go) verifies signatures against a per-gallery verification: policy:

verification:
  issuer_regex: "^https://token\\.actions\\.githubusercontent\\.com$"
  identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@.*$"
  not_before: "2026-05-22T00:00:00Z"

TUF trusted root cached process-wide, so N backends from one gallery do 1 fetch, not N.
not_before is the revocation lever: keyless Fulcio certs are ephemeral, so revocation is policy-side. Advance the date in the gallery YAML and every signature predating the cutoff is invalidated.
Digest pinning closes the TOCTOU window between verify and pull.
Strict mode: --require-backend-integrity (or LOCALAI_REQUIRE_BACKEND_INTEGRITY=true) escalates missing policy / empty SHA256 from warn to hard-fail.

Rollout is backward-compatible: until a gallery ships a verification: block, installs proceed with a warning. The default backend/index.yaml will be populated next, and strict mode is opt-in. See .agents/backend-signing.md for the full producer + consumer story.

🔗 PRs: #9823 (consumer + producer + plumbing), #9957 (fix for current cosign releases).

⚡ Prompt Cache: On by Default

llama-cpp ships with a server-side prompt cache, but until now LocalAI was not enabling it by default. Repeated system prompts (agents, Claude-Code-style coding assistants, OpenAI-compatible CLIs with long instructions) were re-prefilled on every call. With this release, the same workload collapses to seconds without no specific configuration on your side.

Two changes, one default flip each:

kv_unified=true by default in grpc-server.cpp. The previous false was silently force-disabling cache_idle_slots at server init (the host prompt cache was being allocated but never written across requests).
prompt_cache_all defaults to true at the YAML layer, matching upstream llama.cpp's own common.h default. The per-request cache_prompt knob is now on out of the box.

You can still opt out with options: ["kv_unified:false"] or prompt_cache_all: false, and there are new option keys (cache_idle_slots, checkpoint_every_nt) for tuning. Docs in docs/content/advanced/model-configuration.md got a worked example for the repeated-system-prompt workload and a proper explanation of how kv_unified, cache_ram, and cache_idle_slots interact.

🔗 PRs: #9925 (kv_unified + cache_idle_slots defaults + docs), #9951 (prompt_cache_all tristate default).

📊 Per-API-Key Usage Tracking

Closes #9862. The usage page now answers "who spent these tokens?", not just "how many tokens were spent".

usage_records gained Source (apikey / web / legacy), APIKeyID, APIKeyName, plus an idempotent backfill of pre-feature rows on InitDB.
Auth middleware plumbs the resolved *UserAPIKey and the request source through the Echo context. Usage middleware snapshots the key id + name, so revoked keys stay readable in history (rendered as (revoked)).
New endpoints: GET /api/auth/usage/sources (self, no legacy) and GET /api/auth/admin/usage/sources (admin, with user_id / api_key_id filters, 200-key truncation).
React Usage page gains a Sources tab with a source-mix ribbon, a top-7 + Other time chart, and a searchable/sortable table with drill-in chip.
Admin view (follow-up in #9935) also rolls up (source, user_id, user_name) so Web UI session traffic is split per user instead of lumped into one global "Web UI" row, and every named-key row shows the owning account.

Docs: features/authentication.md gained a full Usage Tracking section with the new tab, endpoints, response shape and migration notes.

🔗 PRs: #9920 (core + Sources tab), #9935 (per-user attribution in admin view).

🛰️ Distributed Mode v3

Distributed mode keeps hardening. This release fixes the two things that bit operators hardest in practice and lays the groundwork for the next round of UX.

Per-request routing across replicas (#9968) restores cross-node load balancing. The bug: ModelLoader.Load cached a *Model whose embedded InFlightTrackingClient was bound to a single (nodeID, replicaIndex). After the first request, every subsequent call reused that wrapper and pinned to whichever node won the first pick, even after the reconciler scaled the model out. The reproducer from the report:

dgx-spark1     loaded   in_flight=6
nvidia-thor1   loaded   in_flight=0       (← idle, never gets traffic)

Now SmartRouter.Route runs per request, the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually fires, and the replica-selection rule lives in one place (PickBestReplica) with a mirror spec asserting the SQL ORDER BY and the Go picker agree on a seeded dataset. probeHealth is now memoized per (nodeID, addr) with a 30s TTL and singleflight coalescing, so a burst of new requests doesn't stall on a HealthCheck that llama.cpp serializes against in-flight Predict.

Async per-node installs via the gallery job queue (#9928). POST /api/nodes/:id/backends/install used to block the request for up to 3 minutes while the worker pulled the image, freezing the React UI's Backends picker. It now returns HTTP 202 + jobID immediately, scoped to a one-element targetNodeIDs allowlist, with a node-scoped opcache row so concurrent installs on different nodes don't collide. The Operations panel surfaces a nodeID field for attribution.

Resilient backend installs with streaming progress (#9958). Two phases:

Phase 1: LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT env vars (default 15m, previously hardcoded 3m). A NATS round-trip timeout while the worker is still pulling no longer reports as a hard failure: per-node status becomes running_on_worker, the queue row stays alive without bumping Attempts, and ListBackends proactively clears install rows whose intent is satisfied (so the UI updates instantly instead of waiting up to 15m for the next reconciler tick).
Phase 2: workers publish debounced (~250ms) BackendInstallProgressEvent values on a transient nodes.<nodeID>.backend.install.<opID>.progress subject. The master subscribes for the duration of the request and forwards each event into OpStatus.UpdateStatus, so the admin UI gets per-byte progress for distributed installs the same way local-mode does, with no UI changes. Backward compatible: old workers stay silent, new masters tolerate silence.

Unified backend-logs entry point (#9949). /app/backend-logs/:modelId is now a single, mode-aware route. In standalone it's the local WebSocket view, unchanged. In distributed it probes nodesApi.getModels, filters by model_name, then routes: 0 hits → empty state with a link to Nodes; 1 hit → <Navigate replace> to the per-node logs URL preserving the ?from= deep-link timestamp; N hits → a picker listing each hosting worker with node id, replica index and load state. Every view that links to backend logs now points at the same URL.

Bug-hunt harness. A new distributed test harness landed in tests/distributed/ to catch the kind of regressions the #9968 reproducer surfaced.

🔗 P...

Contributors

inquam, RinZ27, and Azteczek

Assets 9

0 Join discussion

16 May 21:12

mudler

v4.2.6

6a48157

v4.2.6

What's Changed

Other Changes

feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults by @localai-bot in #9852
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9853
chore: ⬆️ Update antirez/ds4 to ef0a4905d05263df8e63689f2dd1efac618a752c by @localai-bot in #9857
chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3e573cfea6e0a332eff822ffbdb1dd3b112e9051 by @localai-bot in #9856
chore: ⬆️ Update leejet/stable-diffusion.cpp to bd17f53b7386fb5f60e8587b75e73c4b2fed3426 by @localai-bot in #9854

Full Changelog: v4.2.5...v4.2.6

Contributors

localai-bot

Assets 9

16 May 16:44

mudler

v4.2.5

661a0c3

v4.2.5

What's Changed

Bug fixes 🐛

fix(ollama): guard nil filter in galleryop.ListModels (#9817) by @localai-bot in #9836
realtime: honor output_modalities to skip TTS in text-only mode by @localai-bot in #9838
fix(ollama): accept float-encoded integer options (fixes #9837) by @localai-bot in #9849

Other Changes

chore: ⬆️ Update ggml-org/llama.cpp to 7f3f843c31cd32dc4adc10b393342dfee071c332 by @localai-bot in #9809
feat(llama-cpp): expose 12 missing common_params via options[] by @localai-bot in #9814
fix(streaming): comply with OpenAI usage / stream_options spec by @localai-bot in #9815
Close Hugging Face scan response body by @massy-o in #9818
Validate video image URLs before download by @massy-o in #9819
feat(swagger): update swagger by @localai-bot in #9824
chore: ⬆️ Update antirez/ds4 to 04b6fda2be395094cbf2d20d921e7a705a4166ef by @localai-bot in #9830
chore: ⬆️ Update ggml-org/whisper.cpp to 46ca43d6399fdeada1b49fb2126ba373bd9ebc38 by @localai-bot in #9829
chore: ⬆️ Update ikawrakow/ik_llama.cpp to 0fcffdb64d21e57f0778f342415754156e01adfa by @localai-bot in #9828
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9825
chore: ⬆️ Update leejet/stable-diffusion.cpp to 0b8296915c4094090cff6bd2e09a5e98288c3c7d by @localai-bot in #9827
chore: ⬆️ Update ggml-org/llama.cpp to 834a243664114487f99520370a7a7b00fc7a486f by @localai-bot in #9826
Validate archive member paths before extraction by @massy-o in #9820
fix(deps): bump gomarkdown/markdown for GHSA-77fj-vx54-gvh7 by @richiejp in #9841
chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.21.0 by @localai-bot in #9846
chore: ⬆️ Update ikawrakow/ik_llama.cpp to 5cc0d86c760e9858e4bed4418400bb39dbe025f2 by @localai-bot in #9845
chore: ⬆️ Update antirez/ds4 to 950e8e6474a1c9fabe04e669d607606a7ef8824f by @localai-bot in #9844
chore: ⬆️ Update ggml-org/whisper.cpp to 968eebe77225d25e57a3f981da7c696310f0e881 by @localai-bot in #9843
chore: ⬆️ Update ggml-org/llama.cpp to 1348f67c58f561808136e8a152a9eddec168f221 by @localai-bot in #9842

New Contributors

@massy-o made their first contribution in #9818

Full Changelog: v4.2.4...v4.2.5

Contributors

richiejp, massy-o, and localai-bot

Assets 9

13 May 22:32

mudler

v4.2.4

42a8db3

v4.2.4

What's Changed

Bug fixes 🐛

fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status by @localai-bot in #9754
fix(http): honor X-Forwarded-Prefix when proxy strips the prefix by @Dennisadira in #9614
fix(agentpool): close truncate-then-read race in agent_jobs.json persistence by @localai-bot in #9811
fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions by @Anai-Guo in #9559

Exciting New Features 🎉

feat: also parse VRAM budget/usage from vulkaninfo by @eglia in #9800
feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page by @richiejp in #9801

Other Changes

chore: ⬆️ Update ggml-org/llama.cpp to a9883db8ee021cf16783016a60996d41820b5195 by @localai-bot in #9796
chore: ⬆️ Update TheTom/llama-cpp-turboquant to 5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403 by @localai-bot in #9740
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9805
chore: ⬆️ Update antirez/ds4 to 0cba357ca1bc0e7510421cc26888e420ea942123 by @localai-bot in #9806
chore: ⬆️ Update ikawrakow/ik_llama.cpp to 949bb8f1d660fc1264c137a6f3dbd619375f6134 by @localai-bot in #9807
chore: ⬆️ Update ggml-org/whisper.cpp to 3e9b7d0fef3528ee2208da3cdb873a2c53d2ae2f by @localai-bot in #9808
ci(image): publish missing :latest-* and :v-* singleton image tags by @localai-bot in #9812

Full Changelog: v4.2.3...v4.2.4

Contributors

richiejp, Dennisadira, and 3 other contributors

Assets 9

12 May 22:41

mudler

v4.2.3

957619a

v4.2.3

What's Changed

Other Changes

chore: ⬆️ Update ggml-org/whisper.cpp to 338cce1e58133261753243802a0e7a430118866d by @localai-bot in #9793
chore: ⬆️ Update antirez/ds4 to f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f by @localai-bot in #9794
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9792
chore: ⬆️ Update ikawrakow/ik_llama.cpp to f9a93c37e2fc021760c3c1aa99cf74c73b7591a7 by @localai-bot in #9795

Full Changelog: v4.2.2...v4.2.3

Contributors

localai-bot

Assets 9

12 May 15:39

mudler

v4.2.2

bc4cd3d

v4.2.2

What's Changed

Bug fixes 🐛

fix: parse vulkan VRAM from text by @eglia in #9669
fix(ollama): accept prompt alias on /api/embed for Ollama parity by @localai-bot in #9780

👒 Dependencies

chore(deps): bump node from 25-slim to 26-slim by @dependabot[bot] in #9769
chore(deps): bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #9770
chore(deps): bump actions/download-artifact from 4 to 8 by @dependabot[bot] in #9771
chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.27.0 to 1.42.0 by @dependabot[bot] in #9772
chore(deps): bump github.com/onsi/gomega from 1.39.1 to 1.40.0 by @dependabot[bot] in #9774
chore(deps): update transformers requirement from >=5.0.0 to >=5.8.0 in /backend/python/transformers by @dependabot[bot] in #9775
chore(deps): bump github.com/fsnotify/fsnotify from 1.9.0 to 1.10.1 by @dependabot[bot] in #9778
chore(deps): update charset-normalizer requirement from >=3.4.0 to >=3.4.7 in /backend/python/vllm by @dependabot[bot] in #9779
chore(deps): bump github.com/mudler/edgevpn from 0.31.1 to 0.32.2 by @dependabot[bot] in #9773
chore(deps): bump the npm_and_yarn group across 1 directory with 3 updates by @dependabot[bot] in #9728

Other Changes

ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 by @localai-bot in #9781
feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options by @localai-bot in #9765

Full Changelog: v4.2.1...v4.2.2

Contributors

eglia, dependabot, and localai-bot

Assets 9

11 May 22:47

mudler

v4.2.1

bc3fb16

v4.2.1

What's Changed

Exciting New Features 🎉

feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache by @localai-bot in #9758

👒 Dependencies

chore(deps): bump the go_modules group across 1 directory with 2 updates by @dependabot[bot] in #9759

Other Changes

docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9762
ci(bump-deps): register ds4 + move version pin into the Makefile by @localai-bot in #9761
chore: ⬆️ Update ikawrakow/ik_llama.cpp to eb570eb96689c235933b813693ca28ab9d3d26de by @localai-bot in #9764
feat(ollama): report model capabilities + details on /api/tags and /api/show by @localai-bot in #9766

Full Changelog: v4.2.0...v4.2.1

Contributors

dependabot and localai-bot

Assets 9

11 May 12:48

mudler

v4.2.0

b9e81db

v4.2.0

🎉 LocalAI 4.2.0 Release! 🚀

LocalAI 4.2.0 is out!

This release teaches LocalAI to see and hear. New /v1/voice/* and /v1/audio/diarization endpoints, a full face-recognition pipeline with antispoofing, word-level timestamps for faster-whisper, and a client-cancellable Whisper. There is also a drop-in Ollama API, video generation in stable-diffusion.ggml, a redesigned chat with i18n and admin-configurable branding, eleven new backends, an interactive model config editor with autocomplete, and a hardened distributed mode v2. vLLM finally hits feature parity with llama.cpp and gets tensor-parallel distributed workers.

📌 TL;DR

Feature	Summary
🎙️ Voice Recognition	New `/v1/voice/*`. Verify, identify, embed and analyze speakers.
👤 Face Recognition + Liveness	1:1 verify, 1:N identify, detect, analyze, embed, and reject spoofed photos.
🎬 Diarization	New `/v1/audio/diarization` endpoint, "who spoke when?" via sherpa-onnx + vibevoice.cpp.
🗣️ Better Transcriptions	Word-level timestamps, client-cancellable Whisper, segments + duration + language on the stream-done event.
🦙 Ollama API	Drop-in compatibility. Point your `ollama` client straight at LocalAI.
🎬 Video Generation	`stable-diffusion.ggml` now generates video (i2v, first-last-frame).
💬 Redesigned UI	Chat redesign, Nord palette, i18n (5 languages), admin-configurable branding.
✏️ Interactive Model Editor	Autocomplete-driven config editor in the UI.
📦 Universal Importer	Imports across most backends, not just llama.cpp.
🚦 Concurrency Groups	Per-model exclusive groups for safe backend loading.
🧪 11 New Backends	sglang, ik-llama-cpp, TurboQuant, sam.cpp, Kokoros, qwen3tts.cpp, tinygrad-multimodal, LocalVQE, vibevoice-cpp, insightface (liveness), voice-rec.
⚡ vLLM @ parity	Feature parity with llama.cpp + tensor-parallel distributed workers + full `engine_args`.
🛰️ Distributed v2	Hardened orchestrator, round-robin replicas, scoped Upgrade All, NATS install/upgrade split.

🚀 New Features & Major Enhancements

🎙️ Voice Recognition

LocalAI is now ears-on. New /v1/voice/* endpoints let you verify, identify, analyze and embed speakers, powered by a SpeechBrain + ONNX Python backend.

1:1 Verify, "is this the same speaker?"
1:N Identify, "who is talking, out of my enrolled users?"
Embeddings, voice fingerprints for your own pipelines
Analyze, age, gender, emotion attributes per segment

🔥 Pairs naturally with the new diarization endpoint for full speaker pipelines.

voice.mp4

👤 Face Recognition & Antispoofing

A complete face-biometrics pipeline, built on InsightFace + ONNX.

1:1 Verify, match two faces
1:N Identify, resolve a face against an enrolled set
Detection & Analysis, find faces, extract attributes (age, gender, emotion, race)
Embeddings, facial fingerprints for your own stack
🆕 Antispoofing (liveness), reject spoofed photos and videos

✅ Samples never leave your machine. They go only to the running backend.

face.mp4

🎬 Diarization & a smarter audio pipeline

Audio is a first-class citizen now.

/v1/audio/diarization, segments speech by speaker turn (sherpa-onnx + vibevoice.cpp)
Word-level timestamps for faster-whisper
Client cancellation for Whisper via the ggml abort_callback. Stop a transcription mid-flight and free the GPU.
Stream-done metadata on /v1/audio/transcriptions. segments, duration and language on the final event.
Audio transformations UI (LocalVQE), explore audio FX directly from the React UI
Transcription error visibility, handler errors land in the access log and on the client

🦙 Ollama drop-in API

Point your existing Ollama client at LocalAI. Everything keeps working. Another front door, same engine.

OLLAMA_HOST=http://localhost:8080 ollama run qwen3

🎬 Video Generation

The stable-diffusion.ggml backend now generates video, with curated gallery entries for Wan 2.1 FLF2V 14B 720P and Wan i2v 720p, plus a new stablediffusion-ggml-development meta backend to track the cutting edge.

🎨 React UI: total refresh

A massive UI cycle landed in 4.2:

💬 Chat redesign, cleaner layout, faster perceived latency, better message density
🎨 Editorial refresh with the Nord palette, calmer, more focused, dark-mode-first
🌍 Multilingual / i18n, English, Italiano, Español, Deutsch, 简体中文
🪪 Brandable instance, admin-configurable name, tagline, and assets (logo, favicon)
✏️ Interactive model config editor, autocomplete over known fields, live validation, automatic file-renaming on save
🧰 Backend management UX, revamped backend list with concrete versions
🛟 Better error UX, distributed backend management errors surface cleanly

💡 Self-host with your branding. The login page, sidebar, footer, and browser tab all pick up the instance name and logo.

chat.mp4

i18n.mp4

🔄 Backend & model lifecycle

Backend versioning with automatic upgrade detection
Pin models so they survive the reaper
On-demand toggle per model to control auto-load
Concurrency groups, per-model exclusive groups so heavy backends won't trample each other
Universal importer, single flow that imports across most backends, with clean multi-shard GGUF handling and dedicated importers for vibevoice-cpp and whisper.cpp HF repos

importer.mp4

model-editor.mp4

🧪 New Backends!

Backend	What it brings
sglang	High-throughput LLM serving + speculative decoding (EAGLE/EAGLE3/DFLASH/MTP)
ik-llama.cpp	ikawrakow's llama.cpp fork
TurboQuant	Quant-focused llama.cpp fork
sam.cpp	Segment Anything detection
Kokoros	Rust-native Kokoro TTS
qwen3tts.cpp	Qwen3 TTS
tinygrad-multimodal (experimental)	tinygrad-powered multimodal
vibevoice.cpp	Diarization-grade speech
LocalVQE	Audio transformations / FX
insightface	Face antispoofing
voice-rec	Speaker recognition / embeddings

⚡ vLLM at parity (and beyond)

vLLM parity with llama.cpp, same feature surface, same ergonomics
vLLM engine_args, the full AsyncEngineArgs exposed via a generic YAML map
Tensor-parallel distributed workers, fan a single model across nodes
CUDA 13 builds for vLLM, vLLM-omni and sglang
L4T arm64 (CUDA 13), vLLM/vLLM-omni/sglang variants for Jetson-class arm64
MLX backend refactored, shared helpers and enhanced functionality
llama.cpp split_mode for explicit multi-GPU placement
Speculative decoding wired through for llama.cpp, Gemma 4 thinking support added
Vision / mtmd marker propagated from the backend via ModelMetadata

🛰️ Distributed Mode v2

Distributed mode keeps maturing. This release was a hardening pass across the orchestration loop:

Orchestrator resilience, auto-upgrade routing, worker bind-wait, RAG-init crash, log-spam fixes
Round-robin across replicas of the same model
Upgrade All scoped to nodes that actually have the backend installed
NATS install / upgrade split, backend.upgrade no longer piggybacks on install
Cached-replica lookup honors NodeSelector, the reconciler no longer scales up empty backends
VRAM/RAM reporting correct on NVIDIA unified-memory hosts
Agent nodes, queue loops stop on teardown, dead-letter cap added
Autoscaling, load-model extracted from Route() and applied during autoscale

🔐 Auth & Security

Settings API, env-supplied ApiKeys are stripped before persisting (no accidental leaks)
grpc-server hardening, removed unsafe sprintf() in the C++ grpc server
OIDC, bumped go-oidc/v3 to 3.18.0
Security hardening pass across the codebase
AI coding assistants policy, LocalAI now follows the Linux kernel's DCO/attribution guidelines (Assisted-by: trailer, no AI co-authors)

🖥️ Hardware & deployment

CUDA 13 for vLLM, vLLM-omni, and sglang
NVIDIA L4T arm64 (CUDA 13) for Jetson-class boards
ROCm 7.x bumped to latest
gfx1151 (Strix Halo / Ryzen AI MAX) support, AMDGPU_TARGETS exposed as a build-arg
Intel GPU, latest oneapi-basekit (b70 support) across Intel images
arm64 CI, cpu-whisperx and cpu-faster-whisper now ship arm64 images
whisperx, ROCm/HIPBLAS target dropped (pinned to rocm6.4 wheels)

🛠️ Under the Hood

Better CLI errors with actionable guidance
golangci-lint baseline (new-from-merge-base) keeps drift in check
Coding-agent discoverability, new APIs let coding agents introspect and configure LocalAI
Autoparser, prefers backend-emitted chat deltas, correct logprob passthrough, strips partial reasoning tags during warm-up
Reasoning + tools, no more empty content from thinking models in retry loops
Streaming hygiene, deduped content, dedup...

Contributors

russell, egyptianbman, and 20 other contributors

Assets 9

3 Join discussion

06 Apr 23:05

mudler

v4.1.3

fdc9f7b

v4.1.3

What's Changed

Bug fixes 🐛

fix(token): login via legacy api keys by @mudler in #9249
fix(anthropic): do not emit empty tokens and fix SSE tool calls by @mudler in #9258
fix(gpu): better detection for MacOS and Thor by @mudler in #9263

👒 Dependencies

chore(deps): bump google.golang.org/grpc from 1.79.3 to 1.80.0 by @dependabot[bot] in #9253
chore(deps): bump github.com/jaypipes/ghw from 0.23.0 to 0.24.0 by @dependabot[bot] in #9250
chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.12 to 1.32.14 by @dependabot[bot] in #9256
chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.64.0 to 0.65.0 by @dependabot[bot] in #9254

Other Changes

chore: ⬆️ Update ggml-org/llama.cpp to d0a6dfeb28a09831d904fc4d910ddb740da82834 by @localai-bot in #9259
docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9260
chore: ⬆️ Update ace-step/acestep.cpp to e0c8d75a672fca5684c88c68dbf6d12f58754258 by @localai-bot in #9261
chore: ⬆️ Update leejet/stable-diffusion.cpp to 8afbeb6ba9702c15d41a38296f2ab1fe5c829fa0 by @localai-bot in #9262

Full Changelog: v4.1.2...v4.1.3

Contributors

mudler, dependabot, and localai-bot

Assets 9

Uh oh!

Releases: mudler/LocalAI

v4.3.1

What's Changed

Other Changes

Contributors

Uh oh!

v4.3.0

🎉 LocalAI 4.3.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

🔐 Signed Backends with Keyless Cosign

⚡ Prompt Cache: On by Default

📊 Per-API-Key Usage Tracking

🛰️ Distributed Mode v3

Contributors

Uh oh!

v4.2.6

What's Changed

Other Changes

Contributors

Uh oh!

v4.2.5

What's Changed

Bug fixes 🐛

Other Changes

New Contributors

Contributors

Uh oh!

v4.2.4

What's Changed

Bug fixes 🐛

Exciting New Features 🎉

Other Changes

Contributors

Uh oh!

v4.2.3

What's Changed

Other Changes

Contributors

Uh oh!

v4.2.2

What's Changed

Bug fixes 🐛

👒 Dependencies

Other Changes

Contributors

Uh oh!

v4.2.1

What's Changed

Exciting New Features 🎉

👒 Dependencies

Other Changes

Contributors

Uh oh!

v4.2.0

🎉 LocalAI 4.2.0 Release! 🚀

📌 TL;DR

🚀 New Features & Major Enhancements

🎙️ Voice Recognition

👤 Face Recognition & Antispoofing

🎬 Diarization & a smarter audio pipeline

🦙 Ollama drop-in API

🎬 Video Generation

🎨 React UI: total refresh

🔄 Backend & model lifecycle

🧪 New Backends!

⚡ vLLM at parity (and beyond)

🛰️ Distributed Mode v2

🔐 Auth & Security

🖥️ Hardware & deployment

🛠️ Under the Hood

Contributors

Uh oh!

v4.1.3

What's Changed

Bug fixes 🐛

👒 Dependencies

Other Changes

Contributors