The test suites exercise TensorSharp.Server's current public compatibility surface:
- Web UI SSE:
/api/chat - Ollama chat compatibility:
/api/chat/ollama - OpenAI Chat Completions compatibility:
/v1/chat/completions
The scripts auto-detect the loaded model architecture and skip thinking or tool-calling checks when the active model does not support those capabilities. They target autoregressive compatibility behavior; DiffusionGemma's Web UI whole-message replace preview frames are not covered until a dedicated diffusion suite is added.
| Surface | Coverage |
|---|---|
| Web UI SSE | Session-scoped streaming, queue-status compatibility events, done event metrics, abort handling |
| Ollama compatibility | Chat streaming/non-streaming, multi-turn history, thinking, tool-call request plumbing |
| OpenAI compatibility | Chat Completions streaming/non-streaming, tool calls, structured outputs, validation errors |
| Operational behavior | Continuous-batching concurrency, queue-status compatibility, mixed API handoff, architecture-aware skips |
| DiffusionGemma | Not covered by the current compatibility scripts beyond generic endpoint shape; live denoising previews need a dedicated Web UI SSE test |
- Start TensorSharp.Server:
./TensorSharp.Server --model ~/models/model.gguf --backend ggml_metalUse --backend cuda or --backend ggml_cuda on Windows/Linux NVIDIA machines, --backend ggml_metal or --backend mlx on macOS, or --backend ggml_cpu / --backend cpu for CPU runs.
- Run either suite:
# Bash suite (requires curl + jq)
bash test_multiturn.sh
# Python suite (standard library only)
python3 test_multiturn.py- Web UI multi-turn SSE streaming and done events
- Ollama chat multi-turn behavior in streaming and non-streaming modes
- OpenAI Chat Completions streaming and non-streaming behavior
- OpenAI structured outputs with both
response_format: {"type":"json_object"}andresponse_format.json_schema - Queue status endpoint shape
- Error handling for missing required fields
- Structured-output validation errors and documented request conflicts
- Thinking-mode tests run only on architectures that currently support thinking in TensorSharp: Gemma 4, Qwen 3, Qwen 3.5, GPT OSS, and Nemotron-H
- Tool-calling tests run only on architectures that currently support tool calling in TensorSharp: Gemma 4, Qwen 3, Qwen 3.5, and Nemotron-H
- GPT OSS thinking is exercised, but GPT OSS tool-call checks are currently skipped by these scripts even though the general parser/API surface supports Harmony tool-call framing.
Unsupported architectures are reported as SKIP, not FAIL.
- System-prompt persistence in the Web UI flow
- Concurrent requests through the continuous-batching engine
- Queue-status compatibility fields
- Long-conversation stress test
- Mixed Ollama/OpenAI handoff
- Abort mid-generation and request cleanup
- Ollama tool-call request plumbing
- Architecture-aware OpenAI tool-call validation
- Separate pass/fail/skip accounting with per-test payload dumps
- The OpenAI coverage in this folder targets Chat Completions compatibility. OpenAI's newer Responses API is not the compatibility surface TensorSharp.Server currently emulates here.
- Structured outputs follow the Chat Completions
response_formatcontract.json_schemarequests combined withtoolsorthinkare expected to return HTTP400. - The Ollama and OpenAI compatibility projects continue to evolve. These scripts are aligned with the server's current contract plus the current documented behavior around thinking, tool calling, and structured outputs.
- DiffusionGemma can return final text through append-oriented compatibility endpoints, but only Web UI
/api/chatexposes the live denoisingreplaceframes.
bash test_multiturn.sh [model_name] [base_url]Examples:
bash test_multiturn.sh
bash test_multiturn.sh gemma-4-E4B-it-Q8_0.gguf
bash test_multiturn.sh gemma-4-E4B-it-Q8_0.gguf http://host:5000python3 test_multiturn.py [--model MODEL] [--url URL] [--max-tokens N]Examples:
python3 test_multiturn.py
python3 test_multiturn.py --model gemma-4-E4B-it-Q8_0.gguf
python3 test_multiturn.py --max-tokens 120