Skip to content

Improve our embedding splitting #35366

@wezell

Description

@wezell

Description

It seems it's still best practices to have a roughly 500 token count split for the embeddings. And we are also splitting on sentences, which has been good. But what we haven't done is had overlap. And generally speaking, it's good to have 15 to 20% overlap between splits so that you maintain context. This does that.

Acceptance Criteria

  • Only affects future embeddings
  • existing Embeddings continue to work

Priority

None

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions