Releases: mudler/LocalAI
v4.3.1
What's Changed
Other Changes
- Fix kokoros backend build break from Backend trait drift by @Copilot in #9972
- chore: ⬆️ Update antirez/ds4 to
f91c12b50a1448527c435c028bfc70d1b00f6c33by @localai-bot in #9975 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
9f7ba245ab41e118f03aa8dd5134d18a81159d02by @localai-bot in #9973 - chore: ⬆️ Update ggml-org/llama.cpp to
549b9d84330c327e6791fa812a7d60c0cf63572eby @localai-bot in #9974
Full Changelog: v4.3.0...v4.3.1
v4.3.0
🎉 LocalAI 4.3.0 Release! 🚀
LocalAI 4.3.0 is out!
This release hardens the trust boundary and improves defaults for speed. Backend OCI images now ship with keyless cosign signatures and a per-gallery verification: policy, with an opt-in strict mode that fails closed.
The llama-cpp server-side prompt cache works by default: repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) collapse from minutes to seconds without touching YAML. Distributed mode gets rounds of optimizations. Usage tracking grows a per-API-key + per-user Sources view so admins can finally answer "who is burning the GPU?". And, for everyone on a Jetson/DGX box, the L4T13 (cu130/aarch64) backends are back.
📌 TL;DR
| Feature | Summary |
|---|---|
| 🔐 Signed Backends | Keyless cosign + sigstore-go verification for backend OCI images, OCI 1.1 referrers, not_before revocation, opt-in strict mode. |
| ⚡ Prompt Cache by Default | llama-cpp server-side prompt cache works out of the box. Repeated system prompts go from 5-8 min to seconds. |
| 📊 Usage per API Key | New Sources tab attributes traffic to keys and users. Revoked keys stay readable in history. |
| 🛰️ Distributed v3 | Per-request replica routing, cached probeHealth, async per-node installs with streaming progress, unified backend-logs entry point. |
| 🩺 Traces UI Stays Snappy | LOCALAI_TRACING_MAX_BODY_BYTES caps API + backend trace payloads. Admin Traces page stops drowning in 40 MB embeddings. |
| 🧊 Nix Flake | Dockerless setup for NixOS users via flake.nix + dev shell. |
| 🦾 Jetson Thor Restored | vllm / sglang / vllm-omni L4T13 backends switched to PyPI aarch64+cu130 wheels (torch 2.10 ABI fix). |
🚀 New Features & Major Enhancements
🔐 Signed Backends with Keyless Cosign
LocalAI now verifies that backend OCI images came from our CI, not a compromised registry or MITM. This closes a real trust gap: the gallery YAML told LocalAI which image to pull, but nothing checked the bytes.
The producer side (.github/workflows/backend_merge.yml) signs every merged backend image (and every per-arch entry under the manifest list) with sigstore/cosign keyless via Fulcio + Rekor, using OCI 1.1 referrers (no legacy :tag.sig). The consumer side (pkg/oci/cosignverify, built on sigstore-go) verifies signatures against a per-gallery verification: policy:
verification:
issuer_regex: "^https://token\\.actions\\.githubusercontent\\.com$"
identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@.*$"
not_before: "2026-05-22T00:00:00Z"- TUF trusted root cached process-wide, so N backends from one gallery do 1 fetch, not N.
not_beforeis the revocation lever: keyless Fulcio certs are ephemeral, so revocation is policy-side. Advance the date in the gallery YAML and every signature predating the cutoff is invalidated.- Digest pinning closes the TOCTOU window between verify and pull.
- Strict mode:
--require-backend-integrity(orLOCALAI_REQUIRE_BACKEND_INTEGRITY=true) escalates missing policy / empty SHA256 from warn to hard-fail.
Rollout is backward-compatible: until a gallery ships a verification: block, installs proceed with a warning. The default backend/index.yaml will be populated next, and strict mode is opt-in. See .agents/backend-signing.md for the full producer + consumer story.
🔗 PRs: #9823 (consumer + producer + plumbing), #9957 (fix for current cosign releases).
⚡ Prompt Cache: On by Default
llama-cpp ships with a server-side prompt cache, but until now LocalAI was not enabling it by default. Repeated system prompts (agents, Claude-Code-style coding assistants, OpenAI-compatible CLIs with long instructions) were re-prefilled on every call. With this release, the same workload collapses to seconds without no specific configuration on your side.
Two changes, one default flip each:
kv_unified=trueby default ingrpc-server.cpp. The previousfalsewas silently force-disablingcache_idle_slotsat server init (the host prompt cache was being allocated but never written across requests).prompt_cache_alldefaults totrueat the YAML layer, matching upstreamllama.cpp's owncommon.hdefault. The per-requestcache_promptknob is now on out of the box.
You can still opt out with options: ["kv_unified:false"] or prompt_cache_all: false, and there are new option keys (cache_idle_slots, checkpoint_every_nt) for tuning. Docs in docs/content/advanced/model-configuration.md got a worked example for the repeated-system-prompt workload and a proper explanation of how kv_unified, cache_ram, and cache_idle_slots interact.
🔗 PRs: #9925 (kv_unified + cache_idle_slots defaults + docs), #9951 (prompt_cache_all tristate default).
📊 Per-API-Key Usage Tracking
Closes #9862. The usage page now answers "who spent these tokens?", not just "how many tokens were spent".
usage_recordsgainedSource(apikey/web/legacy),APIKeyID,APIKeyName, plus an idempotent backfill of pre-feature rows onInitDB.- Auth middleware plumbs the resolved
*UserAPIKeyand the request source through the Echo context. Usage middleware snapshots the key id + name, so revoked keys stay readable in history (rendered as(revoked)). - New endpoints:
GET /api/auth/usage/sources(self, no legacy) andGET /api/auth/admin/usage/sources(admin, withuser_id/api_key_idfilters, 200-key truncation). - React Usage page gains a Sources tab with a source-mix ribbon, a top-7 + Other time chart, and a searchable/sortable table with drill-in chip.
- Admin view (follow-up in #9935) also rolls up
(source, user_id, user_name)so Web UI session traffic is split per user instead of lumped into one global "Web UI" row, and every named-key row shows the owning account.
Docs: features/authentication.md gained a full Usage Tracking section with the new tab, endpoints, response shape and migration notes.
🔗 PRs: #9920 (core + Sources tab), #9935 (per-user attribution in admin view).
🛰️ Distributed Mode v3
Distributed mode keeps hardening. This release fixes the two things that bit operators hardest in practice and lays the groundwork for the next round of UX.
Per-request routing across replicas (#9968) restores cross-node load balancing. The bug: ModelLoader.Load cached a *Model whose embedded InFlightTrackingClient was bound to a single (nodeID, replicaIndex). After the first request, every subsequent call reused that wrapper and pinned to whichever node won the first pick, even after the reconciler scaled the model out. The reproducer from the report:
dgx-spark1 loaded in_flight=6
nvidia-thor1 loaded in_flight=0 (← idle, never gets traffic)
Now SmartRouter.Route runs per request, the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually fires, and the replica-selection rule lives in one place (PickBestReplica) with a mirror spec asserting the SQL ORDER BY and the Go picker agree on a seeded dataset. probeHealth is now memoized per (nodeID, addr) with a 30s TTL and singleflight coalescing, so a burst of new requests doesn't stall on a HealthCheck that llama.cpp serializes against in-flight Predict.
Async per-node installs via the gallery job queue (#9928). POST /api/nodes/:id/backends/install used to block the request for up to 3 minutes while the worker pulled the image, freezing the React UI's Backends picker. It now returns HTTP 202 + jobID immediately, scoped to a one-element targetNodeIDs allowlist, with a node-scoped opcache row so concurrent installs on different nodes don't collide. The Operations panel surfaces a nodeID field for attribution.
Resilient backend installs with streaming progress (#9958). Two phases:
- Phase 1:
LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT/LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUTenv vars (default 15m, previously hardcoded 3m). A NATS round-trip timeout while the worker is still pulling no longer reports as a hard failure: per-node status becomesrunning_on_worker, the queue row stays alive without bumpingAttempts, andListBackendsproactively clears install rows whose intent is satisfied (so the UI updates instantly instead of waiting up to 15m for the next reconciler tick). - Phase 2: workers publish debounced (~250ms)
BackendInstallProgressEventvalues on a transientnodes.<nodeID>.backend.install.<opID>.progresssubject. The master subscribes for the duration of the request and forwards each event intoOpStatus.UpdateStatus, so the admin UI gets per-byte progress for distributed installs the same way local-mode does, with no UI changes. Backward compatible: old workers stay silent, new masters tolerate silence.
Unified backend-logs entry point (#9949). /app/backend-logs/:modelId is now a single, mode-aware route. In standalone it's the local WebSocket view, unchanged. In distributed it probes nodesApi.getModels, filters by model_name, then routes: 0 hits → empty state with a link to Nodes; 1 hit → <Navigate replace> to the per-node logs URL preserving the ?from= deep-link timestamp; N hits → a picker listing each hosting worker with node id, replica index and load state. Every view that links to backend logs now points at the same URL.
Bug-hunt harness. A new distributed test harness landed in tests/distributed/ to catch the kind of regressions the #9968 reproducer surfaced.
🔗 P...
v4.2.6
What's Changed
Other Changes
- feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults by @localai-bot in #9852
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9853
- chore: ⬆️ Update antirez/ds4 to
ef0a4905d05263df8e63689f2dd1efac618a752cby @localai-bot in #9857 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
3e573cfea6e0a332eff822ffbdb1dd3b112e9051by @localai-bot in #9856 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
bd17f53b7386fb5f60e8587b75e73c4b2fed3426by @localai-bot in #9854
Full Changelog: v4.2.5...v4.2.6
v4.2.5
What's Changed
Bug fixes 🐛
- fix(ollama): guard nil filter in galleryop.ListModels (#9817) by @localai-bot in #9836
- realtime: honor output_modalities to skip TTS in text-only mode by @localai-bot in #9838
- fix(ollama): accept float-encoded integer options (fixes #9837) by @localai-bot in #9849
Other Changes
- chore: ⬆️ Update ggml-org/llama.cpp to
7f3f843c31cd32dc4adc10b393342dfee071c332by @localai-bot in #9809 - feat(llama-cpp): expose 12 missing common_params via options[] by @localai-bot in #9814
- fix(streaming): comply with OpenAI usage / stream_options spec by @localai-bot in #9815
- Close Hugging Face scan response body by @massy-o in #9818
- Validate video image URLs before download by @massy-o in #9819
- feat(swagger): update swagger by @localai-bot in #9824
- chore: ⬆️ Update antirez/ds4 to
04b6fda2be395094cbf2d20d921e7a705a4166efby @localai-bot in #9830 - chore: ⬆️ Update ggml-org/whisper.cpp to
46ca43d6399fdeada1b49fb2126ba373bd9ebc38by @localai-bot in #9829 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
0fcffdb64d21e57f0778f342415754156e01adfaby @localai-bot in #9828 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9825
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
0b8296915c4094090cff6bd2e09a5e98288c3c7dby @localai-bot in #9827 - chore: ⬆️ Update ggml-org/llama.cpp to
834a243664114487f99520370a7a7b00fc7a486fby @localai-bot in #9826 - Validate archive member paths before extraction by @massy-o in #9820
- fix(deps): bump gomarkdown/markdown for GHSA-77fj-vx54-gvh7 by @richiejp in #9841
- chore: ⬆️ Update vllm-project/vllm cu130 wheel to
0.21.0by @localai-bot in #9846 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
5cc0d86c760e9858e4bed4418400bb39dbe025f2by @localai-bot in #9845 - chore: ⬆️ Update antirez/ds4 to
950e8e6474a1c9fabe04e669d607606a7ef8824fby @localai-bot in #9844 - chore: ⬆️ Update ggml-org/whisper.cpp to
968eebe77225d25e57a3f981da7c696310f0e881by @localai-bot in #9843 - chore: ⬆️ Update ggml-org/llama.cpp to
1348f67c58f561808136e8a152a9eddec168f221by @localai-bot in #9842
New Contributors
Full Changelog: v4.2.4...v4.2.5
v4.2.4
What's Changed
Bug fixes 🐛
- fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status by @localai-bot in #9754
- fix(http): honor X-Forwarded-Prefix when proxy strips the prefix by @Dennisadira in #9614
- fix(agentpool): close truncate-then-read race in agent_jobs.json persistence by @localai-bot in #9811
- fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions by @Anai-Guo in #9559
Exciting New Features 🎉
- feat: also parse VRAM budget/usage from vulkaninfo by @eglia in #9800
- feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page by @richiejp in #9801
Other Changes
- chore: ⬆️ Update ggml-org/llama.cpp to
a9883db8ee021cf16783016a60996d41820b5195by @localai-bot in #9796 - chore: ⬆️ Update TheTom/llama-cpp-turboquant to
5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403by @localai-bot in #9740 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9805
- chore: ⬆️ Update antirez/ds4 to
0cba357ca1bc0e7510421cc26888e420ea942123by @localai-bot in #9806 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
949bb8f1d660fc1264c137a6f3dbd619375f6134by @localai-bot in #9807 - chore: ⬆️ Update ggml-org/whisper.cpp to
3e9b7d0fef3528ee2208da3cdb873a2c53d2ae2fby @localai-bot in #9808 - ci(image): publish missing :latest-* and :v-* singleton image tags by @localai-bot in #9812
Full Changelog: v4.2.3...v4.2.4
v4.2.3
What's Changed
Other Changes
- chore: ⬆️ Update ggml-org/whisper.cpp to
338cce1e58133261753243802a0e7a430118866dby @localai-bot in #9793 - chore: ⬆️ Update antirez/ds4 to
f8b4ed635d559b3a5b44bf2df6a77e21b3e9178fby @localai-bot in #9794 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9792
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
f9a93c37e2fc021760c3c1aa99cf74c73b7591a7by @localai-bot in #9795
Full Changelog: v4.2.2...v4.2.3
v4.2.2
What's Changed
Bug fixes 🐛
- fix: parse vulkan VRAM from text by @eglia in #9669
- fix(ollama): accept
promptalias on /api/embed for Ollama parity by @localai-bot in #9780
👒 Dependencies
- chore(deps): bump node from 25-slim to 26-slim by @dependabot[bot] in #9769
- chore(deps): bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #9770
- chore(deps): bump actions/download-artifact from 4 to 8 by @dependabot[bot] in #9771
- chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.27.0 to 1.42.0 by @dependabot[bot] in #9772
- chore(deps): bump github.com/onsi/gomega from 1.39.1 to 1.40.0 by @dependabot[bot] in #9774
- chore(deps): update transformers requirement from >=5.0.0 to >=5.8.0 in /backend/python/transformers by @dependabot[bot] in #9775
- chore(deps): bump github.com/fsnotify/fsnotify from 1.9.0 to 1.10.1 by @dependabot[bot] in #9778
- chore(deps): update charset-normalizer requirement from >=3.4.0 to >=3.4.7 in /backend/python/vllm by @dependabot[bot] in #9779
- chore(deps): bump github.com/mudler/edgevpn from 0.31.1 to 0.32.2 by @dependabot[bot] in #9773
- chore(deps): bump the npm_and_yarn group across 1 directory with 3 updates by @dependabot[bot] in #9728
Other Changes
- ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 by @localai-bot in #9781
- feat(llama-cpp): bump to
1ec7ba0c, adapt grpc-server, expose new spec-decoding options by @localai-bot in #9765
Full Changelog: v4.2.1...v4.2.2
v4.2.1
What's Changed
Exciting New Features 🎉
- feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache by @localai-bot in #9758
👒 Dependencies
- chore(deps): bump the go_modules group across 1 directory with 2 updates by @dependabot[bot] in #9759
Other Changes
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9762
- ci(bump-deps): register ds4 + move version pin into the Makefile by @localai-bot in #9761
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
eb570eb96689c235933b813693ca28ab9d3d26deby @localai-bot in #9764 - feat(ollama): report model capabilities + details on /api/tags and /api/show by @localai-bot in #9766
Full Changelog: v4.2.0...v4.2.1
v4.2.0
🎉 LocalAI 4.2.0 Release! 🚀
LocalAI 4.2.0 is out!
This release teaches LocalAI to see and hear. New /v1/voice/* and /v1/audio/diarization endpoints, a full face-recognition pipeline with antispoofing, word-level timestamps for faster-whisper, and a client-cancellable Whisper. There is also a drop-in Ollama API, video generation in stable-diffusion.ggml, a redesigned chat with i18n and admin-configurable branding, eleven new backends, an interactive model config editor with autocomplete, and a hardened distributed mode v2. vLLM finally hits feature parity with llama.cpp and gets tensor-parallel distributed workers.
📌 TL;DR
| Feature | Summary |
|---|---|
| 🎙️ Voice Recognition | New /v1/voice/*. Verify, identify, embed and analyze speakers. |
| 👤 Face Recognition + Liveness | 1:1 verify, 1:N identify, detect, analyze, embed, and reject spoofed photos. |
| 🎬 Diarization | New /v1/audio/diarization endpoint, "who spoke when?" via sherpa-onnx + vibevoice.cpp. |
| 🗣️ Better Transcriptions | Word-level timestamps, client-cancellable Whisper, segments + duration + language on the stream-done event. |
| 🦙 Ollama API | Drop-in compatibility. Point your ollama client straight at LocalAI. |
| 🎬 Video Generation | stable-diffusion.ggml now generates video (i2v, first-last-frame). |
| 💬 Redesigned UI | Chat redesign, Nord palette, i18n (5 languages), admin-configurable branding. |
| ✏️ Interactive Model Editor | Autocomplete-driven config editor in the UI. |
| 📦 Universal Importer | Imports across most backends, not just llama.cpp. |
| 🚦 Concurrency Groups | Per-model exclusive groups for safe backend loading. |
| 🧪 11 New Backends | sglang, ik-llama-cpp, TurboQuant, sam.cpp, Kokoros, qwen3tts.cpp, tinygrad-multimodal, LocalVQE, vibevoice-cpp, insightface (liveness), voice-rec. |
| ⚡ vLLM @ parity | Feature parity with llama.cpp + tensor-parallel distributed workers + full engine_args. |
| 🛰️ Distributed v2 | Hardened orchestrator, round-robin replicas, scoped Upgrade All, NATS install/upgrade split. |
🚀 New Features & Major Enhancements
🎙️ Voice Recognition
LocalAI is now ears-on. New /v1/voice/* endpoints let you verify, identify, analyze and embed speakers, powered by a SpeechBrain + ONNX Python backend.
- 1:1 Verify, "is this the same speaker?"
- 1:N Identify, "who is talking, out of my enrolled users?"
- Embeddings, voice fingerprints for your own pipelines
- Analyze, age, gender, emotion attributes per segment
🔥 Pairs naturally with the new diarization endpoint for full speaker pipelines.
voice.mp4
👤 Face Recognition & Antispoofing
A complete face-biometrics pipeline, built on InsightFace + ONNX.
- 1:1 Verify, match two faces
- 1:N Identify, resolve a face against an enrolled set
- Detection & Analysis, find faces, extract attributes (age, gender, emotion, race)
- Embeddings, facial fingerprints for your own stack
- 🆕 Antispoofing (liveness), reject spoofed photos and videos
✅ Samples never leave your machine. They go only to the running backend.
face.mp4
🎬 Diarization & a smarter audio pipeline
Audio is a first-class citizen now.
/v1/audio/diarization, segments speech by speaker turn (sherpa-onnx + vibevoice.cpp)- Word-level timestamps for faster-whisper
- Client cancellation for Whisper via the ggml
abort_callback. Stop a transcription mid-flight and free the GPU. - Stream-done metadata on
/v1/audio/transcriptions.segments,durationandlanguageon the final event. - Audio transformations UI (LocalVQE), explore audio FX directly from the React UI
- Transcription error visibility, handler errors land in the access log and on the client
🦙 Ollama drop-in API
Point your existing Ollama client at LocalAI. Everything keeps working. Another front door, same engine.
OLLAMA_HOST=http://localhost:8080 ollama run qwen3🎬 Video Generation
The stable-diffusion.ggml backend now generates video, with curated gallery entries for Wan 2.1 FLF2V 14B 720P and Wan i2v 720p, plus a new stablediffusion-ggml-development meta backend to track the cutting edge.
🎨 React UI: total refresh
A massive UI cycle landed in 4.2:
- 💬 Chat redesign, cleaner layout, faster perceived latency, better message density
- 🎨 Editorial refresh with the Nord palette, calmer, more focused, dark-mode-first
- 🌍 Multilingual / i18n, English, Italiano, Español, Deutsch, 简体中文
- 🪪 Brandable instance, admin-configurable name, tagline, and assets (logo, favicon)
- ✏️ Interactive model config editor, autocomplete over known fields, live validation, automatic file-renaming on save
- 🧰 Backend management UX, revamped backend list with concrete versions
- 🛟 Better error UX, distributed backend management errors surface cleanly
💡 Self-host with your branding. The login page, sidebar, footer, and browser tab all pick up the instance name and logo.
chat.mp4
i18n.mp4
🔄 Backend & model lifecycle
- Backend versioning with automatic upgrade detection
- Pin models so they survive the reaper
- On-demand toggle per model to control auto-load
- Concurrency groups, per-model exclusive groups so heavy backends won't trample each other
- Universal importer, single flow that imports across most backends, with clean multi-shard GGUF handling and dedicated importers for vibevoice-cpp and whisper.cpp HF repos
importer.mp4
model-editor.mp4
🧪 New Backends!
| Backend | What it brings |
|---|---|
| sglang | High-throughput LLM serving + speculative decoding (EAGLE/EAGLE3/DFLASH/MTP) |
| ik-llama.cpp | ikawrakow's llama.cpp fork |
| TurboQuant | Quant-focused llama.cpp fork |
| sam.cpp | Segment Anything detection |
| Kokoros | Rust-native Kokoro TTS |
| qwen3tts.cpp | Qwen3 TTS |
| tinygrad-multimodal (experimental) | tinygrad-powered multimodal |
| vibevoice.cpp | Diarization-grade speech |
| LocalVQE | Audio transformations / FX |
| insightface | Face antispoofing |
| voice-rec | Speaker recognition / embeddings |
⚡ vLLM at parity (and beyond)
- vLLM parity with llama.cpp, same feature surface, same ergonomics
- vLLM
engine_args, the fullAsyncEngineArgsexposed via a generic YAML map - Tensor-parallel distributed workers, fan a single model across nodes
- CUDA 13 builds for vLLM, vLLM-omni and sglang
- L4T arm64 (CUDA 13), vLLM/vLLM-omni/sglang variants for Jetson-class arm64
- MLX backend refactored, shared helpers and enhanced functionality
- llama.cpp
split_modefor explicit multi-GPU placement - Speculative decoding wired through for llama.cpp, Gemma 4 thinking support added
- Vision / mtmd marker propagated from the backend via
ModelMetadata
🛰️ Distributed Mode v2
Distributed mode keeps maturing. This release was a hardening pass across the orchestration loop:
- Orchestrator resilience, auto-upgrade routing, worker bind-wait, RAG-init crash, log-spam fixes
- Round-robin across replicas of the same model
- Upgrade All scoped to nodes that actually have the backend installed
- NATS install / upgrade split,
backend.upgradeno longer piggybacks on install - Cached-replica lookup honors NodeSelector, the reconciler no longer scales up empty backends
- VRAM/RAM reporting correct on NVIDIA unified-memory hosts
- Agent nodes, queue loops stop on teardown, dead-letter cap added
- Autoscaling, load-model extracted from
Route()and applied during autoscale
🔐 Auth & Security
- Settings API, env-supplied
ApiKeysare stripped before persisting (no accidental leaks) - grpc-server hardening, removed unsafe
sprintf()in the C++ grpc server - OIDC, bumped
go-oidc/v3to 3.18.0 - Security hardening pass across the codebase
- AI coding assistants policy, LocalAI now follows the Linux kernel's DCO/attribution guidelines (
Assisted-by:trailer, no AI co-authors)
🖥️ Hardware & deployment
- CUDA 13 for vLLM, vLLM-omni, and sglang
- NVIDIA L4T arm64 (CUDA 13) for Jetson-class boards
- ROCm 7.x bumped to latest
- gfx1151 (Strix Halo / Ryzen AI MAX) support,
AMDGPU_TARGETSexposed as a build-arg - Intel GPU, latest oneapi-basekit (b70 support) across Intel images
- arm64 CI, cpu-whisperx and cpu-faster-whisper now ship arm64 images
- whisperx, ROCm/HIPBLAS target dropped (pinned to rocm6.4 wheels)
🛠️ Under the Hood
- Better CLI errors with actionable guidance
- golangci-lint baseline (
new-from-merge-base) keeps drift in check - Coding-agent discoverability, new APIs let coding agents introspect and configure LocalAI
- Autoparser, prefers backend-emitted chat deltas, correct logprob passthrough, strips partial reasoning tags during warm-up
- Reasoning + tools, no more empty content from thinking models in retry loops
- Streaming hygiene, deduped content, dedup...
v4.1.3
What's Changed
Bug fixes 🐛
- fix(token): login via legacy api keys by @mudler in #9249
- fix(anthropic): do not emit empty tokens and fix SSE tool calls by @mudler in #9258
- fix(gpu): better detection for MacOS and Thor by @mudler in #9263
👒 Dependencies
- chore(deps): bump google.golang.org/grpc from 1.79.3 to 1.80.0 by @dependabot[bot] in #9253
- chore(deps): bump github.com/jaypipes/ghw from 0.23.0 to 0.24.0 by @dependabot[bot] in #9250
- chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.12 to 1.32.14 by @dependabot[bot] in #9256
- chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.64.0 to 0.65.0 by @dependabot[bot] in #9254
Other Changes
- chore: ⬆️ Update ggml-org/llama.cpp to
d0a6dfeb28a09831d904fc4d910ddb740da82834by @localai-bot in #9259 - docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9260
- chore: ⬆️ Update ace-step/acestep.cpp to
e0c8d75a672fca5684c88c68dbf6d12f58754258by @localai-bot in #9261 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
8afbeb6ba9702c15d41a38296f2ab1fe5c829fa0by @localai-bot in #9262
Full Changelog: v4.1.2...v4.1.3
