Skip to content

endee-io/endee-data-migration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Endee Migration Tool

Migrate vector collections from Qdrant or Milvus to Endee using a Dockerized producer-consumer pipeline with checkpoint resume support.


Supported Migrations

Migration Type Command
Qdrant (Dense) → Endee qdrant-to-endee-dense
Qdrant (Hybrid: Dense + Sparse) → Endee qdrant-to-endee-hybrid
Milvus (Dense) → Endee milvus-to-endee-dense
Milvus (Hybrid: Dense + Sparse) → Endee milvus-to-endee-hybrid

Project Structure

.
├── cmd.sh                          # Build and run script
├── .env                            # Configuration (copy from .env.example)
├── data/
│   └── checkpoints/               # Checkpoint files (auto-created, mount as volume)
├── scripts/
│   ├── qdrant_to_endee_dense_migration.py
│   ├── qdrant_to_endee_hybrid_migration.py
│   ├── milvus_to_endee_dense_migration.py
│   └── milvus_to_endee_hybrid_migration.py
├── entrypoint.sh                   # Docker entrypoint
└── Dockerfile

Quick Start

1. Configure your .env file

Copy and edit the environment file:

cp .env.example .env

Set your source database, target Endee credentials, and migration type. See Configuration Reference below for all options.

2. Run the migration

Run cmd.sh:

bash cmd.sh

Configuration Reference

All settings can be provided via .env file or as environment variables passed to Docker.

Migration Type

# Choose one:
# qdrant-to-endee-dense
# qdrant-to-endee-hybrid
# milvus-to-endee-dense
# milvus-to-endee-hybrid
MIGRATION_TYPE=qdrant-to-endee-dense

Source — Qdrant

SOURCE_URL=http://your-qdrant-host
SOURCE_PORT=6333
SOURCE_API_KEY=                      # Leave empty if no auth
SOURCE_COLLECTION=your_collection
USE_HTTPS=false

Source — Milvus

SOURCE_URL=http://your-milvus-host
SOURCE_PORT=19530
SOURCE_API_KEY=your_milvus_token     # Leave empty if no auth
SOURCE_COLLECTION=your_collection
IS_MULTIVECTOR=false

Target — Endee

TARGET_URL=http://your-endee-host:8080   # Omit for Endee Cloud
TARGET_API_KEY=your_endee_api_key
TARGET_COLLECTION=your_index_name

Performance

BATCH_SIZE=1000        # Records fetched per batch from source
UPSERT_SIZE=1000       # Records upserted per chunk to Endee
MAX_QUEUE_SIZE=5       # Max batches buffered in memory between producer and consumer

Index Parameters (Milvus only — auto-detected from schema)

# These are read automatically from Milvus collection schema.
# Override only if needed:
# SPACE_TYPE=cosine    # cosine | l2 | ip
# M=16
# EF_CONSTRUCT=128

Checkpoint / Resume

CHECKPOINT_FILE=/app/data/checkpoints/migration.json
# CLEAR_CHECKPOINT=true    # Uncomment to start fresh

Filter Fields

# Comma-separated list of payload fields to use as Endee filter fields.
# All other fields go to meta.
# Endee filter fields must be scalar types (str, int, float, bool).
# Lists and dicts must go to meta — do not include them here.
FILTER_FIELDS=company,region,sector,document_type

Debug

DEBUG=false    # Set true for verbose logging

Full .env Example

# ── Migration ────────────────────────────────────────────────────
MIGRATION_TYPE=qdrant-to-endee-hybrid

# ── Source (Qdrant) ──────────────────────────────────────────────
SOURCE_URL=
SOURCE_PORT=6333
SOURCE_API_KEY=
SOURCE_COLLECTION=
USE_HTTPS=false

# ── Target (Endee) ───────────────────────────────────────────────
TARGET_API_KEY=your_endee_api_key
TARGET_COLLECTION=my_endee_index

# ── Performance ──────────────────────────────────────────────────
BATCH_SIZE=1000
UPSERT_SIZE=1000
MAX_QUEUE_SIZE=5

# ── Filter fields ────────────────────────────────────────────────
FILTER_FIELDS=company,region,sector,document_type,page_number

# ── Checkpoint ───────────────────────────────────────────────────
CHECKPOINT_FILE=/app/data/checkpoints/migration.json

# ── Debug ────────────────────────────────────────────────────────
DEBUG=false

cmd.sh Reference

#!/bin/bash
docker network create vector-net 2>/dev/null || true
docker build -t vector-migration:latest .
docker compose up --build

Checkpoint & Resume

Migration progress is saved after every successfully upserted batch. If the migration is interrupted for any reason (network error, Ctrl+C, container restart), simply rerun the same command — it will resume from where it left off automatically.

# Resume from last checkpoint (default — just rerun):
bash cmd.sh

# Start fresh (discard checkpoint):
# Set in .env:
CLEAR_CHECKPOINT=true

The checkpoint file is stored at CHECKPOINT_FILE (default: /app/data/checkpoints/migration.json). Since /app/data is mounted as a volume, the checkpoint persists across container restarts.

Checkpoint file example:

{
  "processed_count": 50000,
  "last_offset": "abc123-uuid-...",
  "batch_number": 50
}

Filter Fields vs Meta Fields

Endee has two payload buckets per record:

Bucket Purpose Allowed types
filter Used for filtering search results str, int, float, bool only
meta Stored metadata, not filterable Any type including list, dict

Important: If any field in FILTER_FIELDS contains a list or dict value, Endee will reject the record with MDBX_BAD_VALSIZE. Always use scalar values in filter fields.

# ✓ Safe — scalar fields
FILTER_FIELDS=company,region,sector,page_number

# ✗ Will fail — 'product' is a list in this dataset
FILTER_FIELDS=company,product

If FILTER_FIELDS is empty, all payload fields go to filter. Fields with non-scalar values should always be excluded from FILTER_FIELDS and will automatically land in meta.


Architecture

Each migration runs a producer-consumer pipeline inside asyncio:

migrate()                     ← sync setup: connect, detect schema, create index
    └── asyncio.run(async_migrate())
            ├── async_producer()   ← fetches batches from source into bounded queue
            └── async_consumer()   ← reads from queue, upserts to Endee, saves checkpoint
  • Bounded queue (MAX_QUEUE_SIZE=5) prevents memory overflow — producer pauses when queue is full.
  • All blocking SDK calls (Qdrant scroll, Milvus query, Endee upsert) run in loop.run_in_executor() so the event loop is never frozen.
  • Parallel upsert — chunks within a batch are upserted concurrently via asyncio.gather().
  • Exponential backoff retry — failed chunks are retried up to 3 times (1s, 2s, 4s).
  • Graceful shutdownSIGINT/SIGTERM (Ctrl+C or docker stop) saves checkpoint and exits cleanly.

Troubleshooting

Migration hangs after a failure

The consumer failed and the queue is full — the producer is blocked in queue.put(). Kill the container and rerun. The fix is to add a queue drain in the consumer's failure path (see source code comments).

To debug a hang:

# Inside the container
pip install py-spy
py-spy dump --pid 1

MDBX_BAD_VALSIZE error

A filter field contains a non-scalar value (usually a list). Remove it from FILTER_FIELDS — it will go to meta instead.

# If 'product' is a list:
FILTER_FIELDS=company,region,sector   # ← remove 'product'

Cannot allocate memory on index creation

The Endee server is out of memory — too many indexes open. Delete unused indexes on the Endee server before retrying.

free -h          # check available RAM on Endee server
docker stats     # check container memory usage

Qdrant client version warning

UserWarning: Qdrant client version 1.16.2 is incompatible with server version 1.13.6

Downgrade the client to match your server version, or add check_compatibility=False to the QdrantClient constructor. Migration will still work in most cases despite the warning.

URL has trailing space

Failed to resolve 'your-host%20'

Check TARGET_URL or SOURCE_URL in .env for trailing whitespace.


Requirements

  • Docker
  • Source database accessible from the Docker network (--network vector-net or host network)
  • Endee instance running and accessible
  • Sufficient disk space for checkpoint file (tiny — JSON, a few KB)
  • Sufficient RAM for MAX_QUEUE_SIZE × BATCH_SIZE records in memory at once

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors