-
-
Notifications
You must be signed in to change notification settings - Fork 173
Description
Issue description
wen3.5-4B-Q4_K_M.gguf cannot be loaded when gpuLayers is set to any value other than 0, getting InsufficientMemoryError regardless of the context size.
Actual Behavior
The issue is NOT related to context size. The error occurs as long as gpuLayers is set to any value other than 0:
gpuLayers: 0→ Works ✅gpuLayers: 'auto'→ InsufficientMemoryError ❌gpuLayers: 40→ InsufficientMemoryError ❌
Error Message
InsufficientMemoryError: A context size of XXXXX is too large for the available VRAM
at resolveContextContextSizeOption (file:///.../node-llama-cpp/dist/gguf/insights/utils/resolveContextContextSizeOption.js:28:19)
at async GgufInsightsConfigurationResolver.resolveContextContextSize (file:///.../node-llama-cpp/dist/gguf/insights/GgufInsightsConfigurationResolver.js:235:16)
at async LlamaContext._create (file:///.../node-llama-cpp/dist/evaluator/LlamaContext/LlamaContext.js:581:27)
Additional Issue: Function Calling Causes Context Size Error
When using gpuLayers: 0 (CPU-only mode), there is another issue: regardless of how large the contextSize is set, as long as functions are used, it will throw "Error: The context size is too small to generate a response".
Code:
import path from 'node:path'
import {
defineChatSessionFunction,
getLlama,
LlamaChatSession,
} from 'node-llama-cpp'
const MODEL_PATH = './models/Qwen3.5-4B-UD-Q4_K_XL.gguf'
const functions = {
getCurrentWeather: defineChatSessionFunction({
description: 'Gets the current weather in the provided location.',
params: {
type: 'object',
properties: {
location: {
type: 'string',
description: 'The city and state, e.g. San Francisco, CA',
},
format: {
enum: ['celsius', 'fahrenheit'],
},
},
},
handler({ location, format }) {
console.warn(`Getting current weather for "${location}" in ${format}`)
return {
temperature: format === 'celsius' ? 20 : 68,
format,
}
},
}),
}
const llama = await getLlama()
const model = await llama.loadModel({
modelPath: path.resolve(MODEL_PATH),
gpuLayers: 0,
})
console.warn('Creating context without explicit contextSize...')
const context = await model.createContext()
console.warn('Context created successfully')
const session = new LlamaChatSession({
contextSequence: context.getSequence(),
})
const q1 = 'What is the weather like in SF?'
console.warn(`User: ${q1}`)
const a1 = await session.prompt(q1, { functions })
console.warn(`AI: ${a1}`)My Environment
OS: macOS 25.3.0 (arm64)
Node: 22.21.1 (arm64)
TypeScript: 5.9.3
node-llama-cpp: 3.17.1
Prebuilt binaries: b8179
Metal: available
Metal device: Apple M4
Metal used VRAM: 0% (464KB/11.84GB)
Metal free VRAM: 99.99% (11.84GB/11.84GB)
CPU model: Apple M4
Math cores: 4
Used RAM: 99.18% (15.87GB/16GB)
Free RAM: 0.81% (133.38MB/16GB)
Used swap: 78.92% (3.16GB/4GB)
Max swap size: dynamic
mmap: supported
Additional Context
No response
Relevant Features Used
- Metal support
- CUDA support
- Vulkan support
- Grammar
- Function calling
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, but I don't know how to start. I would need guidance.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status