Skip to content

bug: Qwen3.5 Cannot Run with GPU Acceleration on Apple M4 #571

@FliPPeDround

Description

@FliPPeDround

Issue description

wen3.5-4B-Q4_K_M.gguf cannot be loaded when gpuLayers is set to any value other than 0, getting InsufficientMemoryError regardless of the context size.

Actual Behavior

The issue is NOT related to context size. The error occurs as long as gpuLayers is set to any value other than 0:

  • gpuLayers: 0 → Works ✅
  • gpuLayers: 'auto' → InsufficientMemoryError ❌
  • gpuLayers: 40 → InsufficientMemoryError ❌

Error Message

InsufficientMemoryError: A context size of XXXXX is too large for the available VRAM
    at resolveContextContextSizeOption (file:///.../node-llama-cpp/dist/gguf/insights/utils/resolveContextContextSizeOption.js:28:19)
    at async GgufInsightsConfigurationResolver.resolveContextContextSize (file:///.../node-llama-cpp/dist/gguf/insights/GgufInsightsConfigurationResolver.js:235:16)
    at async LlamaContext._create (file:///.../node-llama-cpp/dist/evaluator/LlamaContext/LlamaContext.js:581:27)

Additional Issue: Function Calling Causes Context Size Error

When using gpuLayers: 0 (CPU-only mode), there is another issue: regardless of how large the contextSize is set, as long as functions are used, it will throw "Error: The context size is too small to generate a response".

Code:

import path from 'node:path'
import {
  defineChatSessionFunction,
  getLlama,
  LlamaChatSession,
} from 'node-llama-cpp'

const MODEL_PATH = './models/Qwen3.5-4B-UD-Q4_K_XL.gguf'

const functions = {
  getCurrentWeather: defineChatSessionFunction({
    description: 'Gets the current weather in the provided location.',
    params: {
      type: 'object',
      properties: {
        location: {
          type: 'string',
          description: 'The city and state, e.g. San Francisco, CA',
        },
        format: {
          enum: ['celsius', 'fahrenheit'],
        },
      },
    },
    handler({ location, format }) {
      console.warn(`Getting current weather for "${location}" in ${format}`)
      return {
        temperature: format === 'celsius' ? 20 : 68,
        format,
      }
    },
  }),
}

const llama = await getLlama()
const model = await llama.loadModel({
  modelPath: path.resolve(MODEL_PATH),
  gpuLayers: 0,
})

console.warn('Creating context without explicit contextSize...')
const context = await model.createContext()
console.warn('Context created successfully')

const session = new LlamaChatSession({
  contextSequence: context.getSequence(),
})

const q1 = 'What is the weather like in SF?'
console.warn(`User: ${q1}`)

const a1 = await session.prompt(q1, { functions })
console.warn(`AI: ${a1}`)

My Environment

OS: macOS 25.3.0 (arm64)
Node: 22.21.1 (arm64)
TypeScript: 5.9.3

node-llama-cpp: 3.17.1
Prebuilt binaries: b8179

Metal: available

Metal device: Apple M4
Metal used VRAM: 0% (464KB/11.84GB)
Metal free VRAM: 99.99% (11.84GB/11.84GB)

CPU model: Apple M4
Math cores: 4
Used RAM: 99.18% (15.87GB/16GB)
Free RAM: 0.81% (133.38MB/16GB)
Used swap: 78.92% (3.16GB/4GB)
Max swap size: dynamic
mmap: supported

Additional Context

No response

Relevant Features Used

  • Metal support
  • CUDA support
  • Vulkan support
  • Grammar
  • Function calling

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions