Skip to content

Latest commit

 

History

History
115 lines (81 loc) · 4.47 KB

File metadata and controls

115 lines (81 loc) · 4.47 KB

TensorSharp.Server Integration Tests

English | 中文

The test suites exercise TensorSharp.Server's current public compatibility surface:

  • Web UI SSE: /api/chat
  • Ollama chat compatibility: /api/chat/ollama
  • OpenAI Chat Completions compatibility: /v1/chat/completions

The scripts auto-detect the loaded model architecture and skip thinking or tool-calling checks when the active model does not support those capabilities. They target autoregressive compatibility behavior; DiffusionGemma's Web UI whole-message replace preview frames are not covered until a dedicated diffusion suite is added.

Current Suite Status

Surface Coverage
Web UI SSE Session-scoped streaming, queue-status compatibility events, done event metrics, abort handling
Ollama compatibility Chat streaming/non-streaming, multi-turn history, thinking, tool-call request plumbing
OpenAI compatibility Chat Completions streaming/non-streaming, tool calls, structured outputs, validation errors
Operational behavior Continuous-batching concurrency, queue-status compatibility, mixed API handoff, architecture-aware skips
DiffusionGemma Not covered by the current compatibility scripts beyond generic endpoint shape; live denoising previews need a dedicated Web UI SSE test

Quick Start

  1. Start TensorSharp.Server:
./TensorSharp.Server --model ~/models/model.gguf --backend ggml_metal

Use --backend cuda or --backend ggml_cuda on Windows/Linux NVIDIA machines, --backend ggml_metal or --backend mlx on macOS, or --backend ggml_cpu / --backend cpu for CPU runs.

  1. Run either suite:
# Bash suite (requires curl + jq)
bash test_multiturn.sh

# Python suite (standard library only)
python3 test_multiturn.py

What The Suites Cover

Common coverage

  • Web UI multi-turn SSE streaming and done events
  • Ollama chat multi-turn behavior in streaming and non-streaming modes
  • OpenAI Chat Completions streaming and non-streaming behavior
  • OpenAI structured outputs with both response_format: {"type":"json_object"} and response_format.json_schema
  • Queue status endpoint shape
  • Error handling for missing required fields
  • Structured-output validation errors and documented request conflicts

Capability-gated coverage

  • Thinking-mode tests run only on architectures that currently support thinking in TensorSharp: Gemma 4, Qwen 3, Qwen 3.5, GPT OSS, and Nemotron-H
  • Tool-calling tests run only on architectures that currently support tool calling in TensorSharp: Gemma 4, Qwen 3, Qwen 3.5, and Nemotron-H
  • GPT OSS thinking is exercised, but GPT OSS tool-call checks are currently skipped by these scripts even though the general parser/API surface supports Harmony tool-call framing.

Unsupported architectures are reported as SKIP, not FAIL.

Bash-only operational checks

  • System-prompt persistence in the Web UI flow
  • Concurrent requests through the continuous-batching engine
  • Queue-status compatibility fields
  • Long-conversation stress test
  • Mixed Ollama/OpenAI handoff
  • Abort mid-generation and request cleanup
  • Ollama tool-call request plumbing

Python-specific compatibility checks

  • Architecture-aware OpenAI tool-call validation
  • Separate pass/fail/skip accounting with per-test payload dumps

Notes

  • The OpenAI coverage in this folder targets Chat Completions compatibility. OpenAI's newer Responses API is not the compatibility surface TensorSharp.Server currently emulates here.
  • Structured outputs follow the Chat Completions response_format contract. json_schema requests combined with tools or think are expected to return HTTP 400.
  • The Ollama and OpenAI compatibility projects continue to evolve. These scripts are aligned with the server's current contract plus the current documented behavior around thinking, tool calling, and structured outputs.
  • DiffusionGemma can return final text through append-oriented compatibility endpoints, but only Web UI /api/chat exposes the live denoising replace frames.

Usage

Bash

bash test_multiturn.sh [model_name] [base_url]

Examples:

bash test_multiturn.sh
bash test_multiturn.sh gemma-4-E4B-it-Q8_0.gguf
bash test_multiturn.sh gemma-4-E4B-it-Q8_0.gguf http://host:5000

Python

python3 test_multiturn.py [--model MODEL] [--url URL] [--max-tokens N]

Examples:

python3 test_multiturn.py
python3 test_multiturn.py --model gemma-4-E4B-it-Q8_0.gguf
python3 test_multiturn.py --max-tokens 120