Skip to content

Releases: mudler/LocalAI

v4.3.1

25 May 07:28
1dcd1ae

Choose a tag to compare

What's Changed

Other Changes

  • Fix kokoros backend build break from Backend trait drift by @Copilot in #9972
  • chore: ⬆️ Update antirez/ds4 to f91c12b50a1448527c435c028bfc70d1b00f6c33 by @localai-bot in #9975
  • chore: ⬆️ Update ikawrakow/ik_llama.cpp to 9f7ba245ab41e118f03aa8dd5134d18a81159d02 by @localai-bot in #9973
  • chore: ⬆️ Update ggml-org/llama.cpp to 549b9d84330c327e6791fa812a7d60c0cf63572e by @localai-bot in #9974

Full Changelog: v4.3.0...v4.3.1

v4.3.0

24 May 20:25

Choose a tag to compare

🎉 LocalAI 4.3.0 Release! 🚀




LocalAI 4.3.0 is out!

This release hardens the trust boundary and improves defaults for speed. Backend OCI images now ship with keyless cosign signatures and a per-gallery verification: policy, with an opt-in strict mode that fails closed.
The llama-cpp server-side prompt cache works by default: repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) collapse from minutes to seconds without touching YAML. Distributed mode gets rounds of optimizations. Usage tracking grows a per-API-key + per-user Sources view so admins can finally answer "who is burning the GPU?". And, for everyone on a Jetson/DGX box, the L4T13 (cu130/aarch64) backends are back.


📌 TL;DR

Feature Summary
🔐 Signed Backends Keyless cosign + sigstore-go verification for backend OCI images, OCI 1.1 referrers, not_before revocation, opt-in strict mode.
Prompt Cache by Default llama-cpp server-side prompt cache works out of the box. Repeated system prompts go from 5-8 min to seconds.
📊 Usage per API Key New Sources tab attributes traffic to keys and users. Revoked keys stay readable in history.
🛰️ Distributed v3 Per-request replica routing, cached probeHealth, async per-node installs with streaming progress, unified backend-logs entry point.
🩺 Traces UI Stays Snappy LOCALAI_TRACING_MAX_BODY_BYTES caps API + backend trace payloads. Admin Traces page stops drowning in 40 MB embeddings.
🧊 Nix Flake Dockerless setup for NixOS users via flake.nix + dev shell.
🦾 Jetson Thor Restored vllm / sglang / vllm-omni L4T13 backends switched to PyPI aarch64+cu130 wheels (torch 2.10 ABI fix).

🚀 New Features & Major Enhancements

🔐 Signed Backends with Keyless Cosign

LocalAI now verifies that backend OCI images came from our CI, not a compromised registry or MITM. This closes a real trust gap: the gallery YAML told LocalAI which image to pull, but nothing checked the bytes.

The producer side (.github/workflows/backend_merge.yml) signs every merged backend image (and every per-arch entry under the manifest list) with sigstore/cosign keyless via Fulcio + Rekor, using OCI 1.1 referrers (no legacy :tag.sig). The consumer side (pkg/oci/cosignverify, built on sigstore-go) verifies signatures against a per-gallery verification: policy:

verification:
  issuer_regex: "^https://token\\.actions\\.githubusercontent\\.com$"
  identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@.*$"
  not_before: "2026-05-22T00:00:00Z"
  • TUF trusted root cached process-wide, so N backends from one gallery do 1 fetch, not N.
  • not_before is the revocation lever: keyless Fulcio certs are ephemeral, so revocation is policy-side. Advance the date in the gallery YAML and every signature predating the cutoff is invalidated.
  • Digest pinning closes the TOCTOU window between verify and pull.
  • Strict mode: --require-backend-integrity (or LOCALAI_REQUIRE_BACKEND_INTEGRITY=true) escalates missing policy / empty SHA256 from warn to hard-fail.

Rollout is backward-compatible: until a gallery ships a verification: block, installs proceed with a warning. The default backend/index.yaml will be populated next, and strict mode is opt-in. See .agents/backend-signing.md for the full producer + consumer story.

🔗 PRs: #9823 (consumer + producer + plumbing), #9957 (fix for current cosign releases).


⚡ Prompt Cache: On by Default

llama-cpp ships with a server-side prompt cache, but until now LocalAI was not enabling it by default. Repeated system prompts (agents, Claude-Code-style coding assistants, OpenAI-compatible CLIs with long instructions) were re-prefilled on every call. With this release, the same workload collapses to seconds without no specific configuration on your side.

Two changes, one default flip each:

  1. kv_unified=true by default in grpc-server.cpp. The previous false was silently force-disabling cache_idle_slots at server init (the host prompt cache was being allocated but never written across requests).
  2. prompt_cache_all defaults to true at the YAML layer, matching upstream llama.cpp's own common.h default. The per-request cache_prompt knob is now on out of the box.

You can still opt out with options: ["kv_unified:false"] or prompt_cache_all: false, and there are new option keys (cache_idle_slots, checkpoint_every_nt) for tuning. Docs in docs/content/advanced/model-configuration.md got a worked example for the repeated-system-prompt workload and a proper explanation of how kv_unified, cache_ram, and cache_idle_slots interact.

🔗 PRs: #9925 (kv_unified + cache_idle_slots defaults + docs), #9951 (prompt_cache_all tristate default).


📊 Per-API-Key Usage Tracking

Closes #9862. The usage page now answers "who spent these tokens?", not just "how many tokens were spent".

  • usage_records gained Source (apikey / web / legacy), APIKeyID, APIKeyName, plus an idempotent backfill of pre-feature rows on InitDB.
  • Auth middleware plumbs the resolved *UserAPIKey and the request source through the Echo context. Usage middleware snapshots the key id + name, so revoked keys stay readable in history (rendered as (revoked)).
  • New endpoints: GET /api/auth/usage/sources (self, no legacy) and GET /api/auth/admin/usage/sources (admin, with user_id / api_key_id filters, 200-key truncation).
  • React Usage page gains a Sources tab with a source-mix ribbon, a top-7 + Other time chart, and a searchable/sortable table with drill-in chip.
  • Admin view (follow-up in #9935) also rolls up (source, user_id, user_name) so Web UI session traffic is split per user instead of lumped into one global "Web UI" row, and every named-key row shows the owning account.

Docs: features/authentication.md gained a full Usage Tracking section with the new tab, endpoints, response shape and migration notes.

🔗 PRs: #9920 (core + Sources tab), #9935 (per-user attribution in admin view).


🛰️ Distributed Mode v3

Distributed mode keeps hardening. This release fixes the two things that bit operators hardest in practice and lays the groundwork for the next round of UX.

Per-request routing across replicas (#9968) restores cross-node load balancing. The bug: ModelLoader.Load cached a *Model whose embedded InFlightTrackingClient was bound to a single (nodeID, replicaIndex). After the first request, every subsequent call reused that wrapper and pinned to whichever node won the first pick, even after the reconciler scaled the model out. The reproducer from the report:

dgx-spark1     loaded   in_flight=6
nvidia-thor1   loaded   in_flight=0       (← idle, never gets traffic)

Now SmartRouter.Route runs per request, the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually fires, and the replica-selection rule lives in one place (PickBestReplica) with a mirror spec asserting the SQL ORDER BY and the Go picker agree on a seeded dataset. probeHealth is now memoized per (nodeID, addr) with a 30s TTL and singleflight coalescing, so a burst of new requests doesn't stall on a HealthCheck that llama.cpp serializes against in-flight Predict.

Async per-node installs via the gallery job queue (#9928). POST /api/nodes/:id/backends/install used to block the request for up to 3 minutes while the worker pulled the image, freezing the React UI's Backends picker. It now returns HTTP 202 + jobID immediately, scoped to a one-element targetNodeIDs allowlist, with a node-scoped opcache row so concurrent installs on different nodes don't collide. The Operations panel surfaces a nodeID field for attribution.

Resilient backend installs with streaming progress (#9958). Two phases:

  • Phase 1: LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT env vars (default 15m, previously hardcoded 3m). A NATS round-trip timeout while the worker is still pulling no longer reports as a hard failure: per-node status becomes running_on_worker, the queue row stays alive without bumping Attempts, and ListBackends proactively clears install rows whose intent is satisfied (so the UI updates instantly instead of waiting up to 15m for the next reconciler tick).
  • Phase 2: workers publish debounced (~250ms) BackendInstallProgressEvent values on a transient nodes.<nodeID>.backend.install.<opID>.progress subject. The master subscribes for the duration of the request and forwards each event into OpStatus.UpdateStatus, so the admin UI gets per-byte progress for distributed installs the same way local-mode does, with no UI changes. Backward compatible: old workers stay silent, new masters tolerate silence.

Unified backend-logs entry point (#9949). /app/backend-logs/:modelId is now a single, mode-aware route. In standalone it's the local WebSocket view, unchanged. In distributed it probes nodesApi.getModels, filters by model_name, then routes: 0 hits → empty state with a link to Nodes; 1 hit → <Navigate replace> to the per-node logs URL preserving the ?from= deep-link timestamp; N hits → a picker listing each hosting worker with node id, replica index and load state. Every view that links to backend logs now points at the same URL.

Bug-hunt harness. A new distributed test harness landed in tests/distributed/ to catch the kind of regressions the #9968 reproducer surfaced.

🔗 P...

Read more

v4.2.6

16 May 21:12
6a48157

Choose a tag to compare

What's Changed

Other Changes

  • feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults by @localai-bot in #9852
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9853
  • chore: ⬆️ Update antirez/ds4 to ef0a4905d05263df8e63689f2dd1efac618a752c by @localai-bot in #9857
  • chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3e573cfea6e0a332eff822ffbdb1dd3b112e9051 by @localai-bot in #9856
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to bd17f53b7386fb5f60e8587b75e73c4b2fed3426 by @localai-bot in #9854

Full Changelog: v4.2.5...v4.2.6

v4.2.5

16 May 16:44
661a0c3

Choose a tag to compare

What's Changed

Bug fixes 🐛

Other Changes

  • chore: ⬆️ Update ggml-org/llama.cpp to 7f3f843c31cd32dc4adc10b393342dfee071c332 by @localai-bot in #9809
  • feat(llama-cpp): expose 12 missing common_params via options[] by @localai-bot in #9814
  • fix(streaming): comply with OpenAI usage / stream_options spec by @localai-bot in #9815
  • Close Hugging Face scan response body by @massy-o in #9818
  • Validate video image URLs before download by @massy-o in #9819
  • feat(swagger): update swagger by @localai-bot in #9824
  • chore: ⬆️ Update antirez/ds4 to 04b6fda2be395094cbf2d20d921e7a705a4166ef by @localai-bot in #9830
  • chore: ⬆️ Update ggml-org/whisper.cpp to 46ca43d6399fdeada1b49fb2126ba373bd9ebc38 by @localai-bot in #9829
  • chore: ⬆️ Update ikawrakow/ik_llama.cpp to 0fcffdb64d21e57f0778f342415754156e01adfa by @localai-bot in #9828
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9825
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 0b8296915c4094090cff6bd2e09a5e98288c3c7d by @localai-bot in #9827
  • chore: ⬆️ Update ggml-org/llama.cpp to 834a243664114487f99520370a7a7b00fc7a486f by @localai-bot in #9826
  • Validate archive member paths before extraction by @massy-o in #9820
  • fix(deps): bump gomarkdown/markdown for GHSA-77fj-vx54-gvh7 by @richiejp in #9841
  • chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.21.0 by @localai-bot in #9846
  • chore: ⬆️ Update ikawrakow/ik_llama.cpp to 5cc0d86c760e9858e4bed4418400bb39dbe025f2 by @localai-bot in #9845
  • chore: ⬆️ Update antirez/ds4 to 950e8e6474a1c9fabe04e669d607606a7ef8824f by @localai-bot in #9844
  • chore: ⬆️ Update ggml-org/whisper.cpp to 968eebe77225d25e57a3f981da7c696310f0e881 by @localai-bot in #9843
  • chore: ⬆️ Update ggml-org/llama.cpp to 1348f67c58f561808136e8a152a9eddec168f221 by @localai-bot in #9842

New Contributors

Full Changelog: v4.2.4...v4.2.5

v4.2.4

13 May 22:32
42a8db3

Choose a tag to compare

What's Changed

Bug fixes 🐛

  • fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status by @localai-bot in #9754
  • fix(http): honor X-Forwarded-Prefix when proxy strips the prefix by @Dennisadira in #9614
  • fix(agentpool): close truncate-then-read race in agent_jobs.json persistence by @localai-bot in #9811
  • fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions by @Anai-Guo in #9559

Exciting New Features 🎉

  • feat: also parse VRAM budget/usage from vulkaninfo by @eglia in #9800
  • feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page by @richiejp in #9801

Other Changes

  • chore: ⬆️ Update ggml-org/llama.cpp to a9883db8ee021cf16783016a60996d41820b5195 by @localai-bot in #9796
  • chore: ⬆️ Update TheTom/llama-cpp-turboquant to 5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403 by @localai-bot in #9740
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9805
  • chore: ⬆️ Update antirez/ds4 to 0cba357ca1bc0e7510421cc26888e420ea942123 by @localai-bot in #9806
  • chore: ⬆️ Update ikawrakow/ik_llama.cpp to 949bb8f1d660fc1264c137a6f3dbd619375f6134 by @localai-bot in #9807
  • chore: ⬆️ Update ggml-org/whisper.cpp to 3e9b7d0fef3528ee2208da3cdb873a2c53d2ae2f by @localai-bot in #9808
  • ci(image): publish missing :latest-* and :v-* singleton image tags by @localai-bot in #9812

Full Changelog: v4.2.3...v4.2.4

v4.2.3

12 May 22:41
957619a

Choose a tag to compare

What's Changed

Other Changes

  • chore: ⬆️ Update ggml-org/whisper.cpp to 338cce1e58133261753243802a0e7a430118866d by @localai-bot in #9793
  • chore: ⬆️ Update antirez/ds4 to f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f by @localai-bot in #9794
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9792
  • chore: ⬆️ Update ikawrakow/ik_llama.cpp to f9a93c37e2fc021760c3c1aa99cf74c73b7591a7 by @localai-bot in #9795

Full Changelog: v4.2.2...v4.2.3

v4.2.2

12 May 15:39
bc4cd3d

Choose a tag to compare

What's Changed

Bug fixes 🐛

  • fix: parse vulkan VRAM from text by @eglia in #9669
  • fix(ollama): accept prompt alias on /api/embed for Ollama parity by @localai-bot in #9780

👒 Dependencies

  • chore(deps): bump node from 25-slim to 26-slim by @dependabot[bot] in #9769
  • chore(deps): bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #9770
  • chore(deps): bump actions/download-artifact from 4 to 8 by @dependabot[bot] in #9771
  • chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.27.0 to 1.42.0 by @dependabot[bot] in #9772
  • chore(deps): bump github.com/onsi/gomega from 1.39.1 to 1.40.0 by @dependabot[bot] in #9774
  • chore(deps): update transformers requirement from >=5.0.0 to >=5.8.0 in /backend/python/transformers by @dependabot[bot] in #9775
  • chore(deps): bump github.com/fsnotify/fsnotify from 1.9.0 to 1.10.1 by @dependabot[bot] in #9778
  • chore(deps): update charset-normalizer requirement from >=3.4.0 to >=3.4.7 in /backend/python/vllm by @dependabot[bot] in #9779
  • chore(deps): bump github.com/mudler/edgevpn from 0.31.1 to 0.32.2 by @dependabot[bot] in #9773
  • chore(deps): bump the npm_and_yarn group across 1 directory with 3 updates by @dependabot[bot] in #9728

Other Changes

  • ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 by @localai-bot in #9781
  • feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options by @localai-bot in #9765

Full Changelog: v4.2.1...v4.2.2

v4.2.1

11 May 22:47
bc3fb16

Choose a tag to compare

What's Changed

Exciting New Features 🎉

  • feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache by @localai-bot in #9758

👒 Dependencies

  • chore(deps): bump the go_modules group across 1 directory with 2 updates by @dependabot[bot] in #9759

Other Changes

  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9762
  • ci(bump-deps): register ds4 + move version pin into the Makefile by @localai-bot in #9761
  • chore: ⬆️ Update ikawrakow/ik_llama.cpp to eb570eb96689c235933b813693ca28ab9d3d26de by @localai-bot in #9764
  • feat(ollama): report model capabilities + details on /api/tags and /api/show by @localai-bot in #9766

Full Changelog: v4.2.0...v4.2.1

v4.2.0

11 May 12:48
b9e81db

Choose a tag to compare

🎉 LocalAI 4.2.0 Release! 🚀




LocalAI 4.2.0 is out!

This release teaches LocalAI to see and hear. New /v1/voice/* and /v1/audio/diarization endpoints, a full face-recognition pipeline with antispoofing, word-level timestamps for faster-whisper, and a client-cancellable Whisper. There is also a drop-in Ollama API, video generation in stable-diffusion.ggml, a redesigned chat with i18n and admin-configurable branding, eleven new backends, an interactive model config editor with autocomplete, and a hardened distributed mode v2. vLLM finally hits feature parity with llama.cpp and gets tensor-parallel distributed workers.


📌 TL;DR

Feature Summary
🎙️ Voice Recognition New /v1/voice/*. Verify, identify, embed and analyze speakers.
👤 Face Recognition + Liveness 1:1 verify, 1:N identify, detect, analyze, embed, and reject spoofed photos.
🎬 Diarization New /v1/audio/diarization endpoint, "who spoke when?" via sherpa-onnx + vibevoice.cpp.
🗣️ Better Transcriptions Word-level timestamps, client-cancellable Whisper, segments + duration + language on the stream-done event.
🦙 Ollama API Drop-in compatibility. Point your ollama client straight at LocalAI.
🎬 Video Generation stable-diffusion.ggml now generates video (i2v, first-last-frame).
💬 Redesigned UI Chat redesign, Nord palette, i18n (5 languages), admin-configurable branding.
✏️ Interactive Model Editor Autocomplete-driven config editor in the UI.
📦 Universal Importer Imports across most backends, not just llama.cpp.
🚦 Concurrency Groups Per-model exclusive groups for safe backend loading.
🧪 11 New Backends sglang, ik-llama-cpp, TurboQuant, sam.cpp, Kokoros, qwen3tts.cpp, tinygrad-multimodal, LocalVQE, vibevoice-cpp, insightface (liveness), voice-rec.
vLLM @ parity Feature parity with llama.cpp + tensor-parallel distributed workers + full engine_args.
🛰️ Distributed v2 Hardened orchestrator, round-robin replicas, scoped Upgrade All, NATS install/upgrade split.

🚀 New Features & Major Enhancements

🎙️ Voice Recognition

LocalAI is now ears-on. New /v1/voice/* endpoints let you verify, identify, analyze and embed speakers, powered by a SpeechBrain + ONNX Python backend.

  • 1:1 Verify, "is this the same speaker?"
  • 1:N Identify, "who is talking, out of my enrolled users?"
  • Embeddings, voice fingerprints for your own pipelines
  • Analyze, age, gender, emotion attributes per segment

🔥 Pairs naturally with the new diarization endpoint for full speaker pipelines.

voice.mp4

👤 Face Recognition & Antispoofing

A complete face-biometrics pipeline, built on InsightFace + ONNX.

  • 1:1 Verify, match two faces
  • 1:N Identify, resolve a face against an enrolled set
  • Detection & Analysis, find faces, extract attributes (age, gender, emotion, race)
  • Embeddings, facial fingerprints for your own stack
  • 🆕 Antispoofing (liveness), reject spoofed photos and videos

✅ Samples never leave your machine. They go only to the running backend.

face.mp4

🎬 Diarization & a smarter audio pipeline

Audio is a first-class citizen now.

  • /v1/audio/diarization, segments speech by speaker turn (sherpa-onnx + vibevoice.cpp)
  • Word-level timestamps for faster-whisper
  • Client cancellation for Whisper via the ggml abort_callback. Stop a transcription mid-flight and free the GPU.
  • Stream-done metadata on /v1/audio/transcriptions. segments, duration and language on the final event.
  • Audio transformations UI (LocalVQE), explore audio FX directly from the React UI
  • Transcription error visibility, handler errors land in the access log and on the client

🦙 Ollama drop-in API

Point your existing Ollama client at LocalAI. Everything keeps working. Another front door, same engine.

OLLAMA_HOST=http://localhost:8080 ollama run qwen3

🎬 Video Generation

The stable-diffusion.ggml backend now generates video, with curated gallery entries for Wan 2.1 FLF2V 14B 720P and Wan i2v 720p, plus a new stablediffusion-ggml-development meta backend to track the cutting edge.


🎨 React UI: total refresh

A massive UI cycle landed in 4.2:

  • 💬 Chat redesign, cleaner layout, faster perceived latency, better message density
  • 🎨 Editorial refresh with the Nord palette, calmer, more focused, dark-mode-first
  • 🌍 Multilingual / i18n, English, Italiano, Español, Deutsch, 简体中文
  • 🪪 Brandable instance, admin-configurable name, tagline, and assets (logo, favicon)
  • ✏️ Interactive model config editor, autocomplete over known fields, live validation, automatic file-renaming on save
  • 🧰 Backend management UX, revamped backend list with concrete versions
  • 🛟 Better error UX, distributed backend management errors surface cleanly

💡 Self-host with your branding. The login page, sidebar, footer, and browser tab all pick up the instance name and logo.

chat.mp4
i18n.mp4

🔄 Backend & model lifecycle

  • Backend versioning with automatic upgrade detection
  • Pin models so they survive the reaper
  • On-demand toggle per model to control auto-load
  • Concurrency groups, per-model exclusive groups so heavy backends won't trample each other
  • Universal importer, single flow that imports across most backends, with clean multi-shard GGUF handling and dedicated importers for vibevoice-cpp and whisper.cpp HF repos
importer.mp4
model-editor.mp4

🧪 New Backends!

Backend What it brings
sglang High-throughput LLM serving + speculative decoding (EAGLE/EAGLE3/DFLASH/MTP)
ik-llama.cpp ikawrakow's llama.cpp fork
TurboQuant Quant-focused llama.cpp fork
sam.cpp Segment Anything detection
Kokoros Rust-native Kokoro TTS
qwen3tts.cpp Qwen3 TTS
tinygrad-multimodal (experimental) tinygrad-powered multimodal
vibevoice.cpp Diarization-grade speech
LocalVQE Audio transformations / FX
insightface Face antispoofing
voice-rec Speaker recognition / embeddings

⚡ vLLM at parity (and beyond)

  • vLLM parity with llama.cpp, same feature surface, same ergonomics
  • vLLM engine_args, the full AsyncEngineArgs exposed via a generic YAML map
  • Tensor-parallel distributed workers, fan a single model across nodes
  • CUDA 13 builds for vLLM, vLLM-omni and sglang
  • L4T arm64 (CUDA 13), vLLM/vLLM-omni/sglang variants for Jetson-class arm64
  • MLX backend refactored, shared helpers and enhanced functionality
  • llama.cpp split_mode for explicit multi-GPU placement
  • Speculative decoding wired through for llama.cpp, Gemma 4 thinking support added
  • Vision / mtmd marker propagated from the backend via ModelMetadata

🛰️ Distributed Mode v2

Distributed mode keeps maturing. This release was a hardening pass across the orchestration loop:

  • Orchestrator resilience, auto-upgrade routing, worker bind-wait, RAG-init crash, log-spam fixes
  • Round-robin across replicas of the same model
  • Upgrade All scoped to nodes that actually have the backend installed
  • NATS install / upgrade split, backend.upgrade no longer piggybacks on install
  • Cached-replica lookup honors NodeSelector, the reconciler no longer scales up empty backends
  • VRAM/RAM reporting correct on NVIDIA unified-memory hosts
  • Agent nodes, queue loops stop on teardown, dead-letter cap added
  • Autoscaling, load-model extracted from Route() and applied during autoscale

🔐 Auth & Security

  • Settings API, env-supplied ApiKeys are stripped before persisting (no accidental leaks)
  • grpc-server hardening, removed unsafe sprintf() in the C++ grpc server
  • OIDC, bumped go-oidc/v3 to 3.18.0
  • Security hardening pass across the codebase
  • AI coding assistants policy, LocalAI now follows the Linux kernel's DCO/attribution guidelines (Assisted-by: trailer, no AI co-authors)

🖥️ Hardware & deployment

  • CUDA 13 for vLLM, vLLM-omni, and sglang
  • NVIDIA L4T arm64 (CUDA 13) for Jetson-class boards
  • ROCm 7.x bumped to latest
  • gfx1151 (Strix Halo / Ryzen AI MAX) support, AMDGPU_TARGETS exposed as a build-arg
  • Intel GPU, latest oneapi-basekit (b70 support) across Intel images
  • arm64 CI, cpu-whisperx and cpu-faster-whisper now ship arm64 images
  • whisperx, ROCm/HIPBLAS target dropped (pinned to rocm6.4 wheels)

🛠️ Under the Hood

  • Better CLI errors with actionable guidance
  • golangci-lint baseline (new-from-merge-base) keeps drift in check
  • Coding-agent discoverability, new APIs let coding agents introspect and configure LocalAI
  • Autoparser, prefers backend-emitted chat deltas, correct logprob passthrough, strips partial reasoning tags during warm-up
  • Reasoning + tools, no more empty content from thinking models in retry loops
  • Streaming hygiene, deduped content, dedup...
Read more

v4.1.3

06 Apr 23:05
fdc9f7b

Choose a tag to compare

What's Changed

Bug fixes 🐛

  • fix(token): login via legacy api keys by @mudler in #9249
  • fix(anthropic): do not emit empty tokens and fix SSE tool calls by @mudler in #9258
  • fix(gpu): better detection for MacOS and Thor by @mudler in #9263

👒 Dependencies

  • chore(deps): bump google.golang.org/grpc from 1.79.3 to 1.80.0 by @dependabot[bot] in #9253
  • chore(deps): bump github.com/jaypipes/ghw from 0.23.0 to 0.24.0 by @dependabot[bot] in #9250
  • chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.12 to 1.32.14 by @dependabot[bot] in #9256
  • chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.64.0 to 0.65.0 by @dependabot[bot] in #9254

Other Changes

  • chore: ⬆️ Update ggml-org/llama.cpp to d0a6dfeb28a09831d904fc4d910ddb740da82834 by @localai-bot in #9259
  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9260
  • chore: ⬆️ Update ace-step/acestep.cpp to e0c8d75a672fca5684c88c68dbf6d12f58754258 by @localai-bot in #9261
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 8afbeb6ba9702c15d41a38296f2ab1fe5c829fa0 by @localai-bot in #9262

Full Changelog: v4.1.2...v4.1.3