Adds literal decoding variant with a per-stream LDS cache to coalesce memory writes through transposition.#72
Draft
pm4rtx wants to merge 2 commits intomicrosoft:developmentfrom
Draft
Conversation
… memory writes through transposition.
…e for consistency with other kernels.
Collaborator
|
Getting back LDS space is goodness all around. Looks great to me. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR makes literal decoding a bit more memory friendly and avoids scattered one byte per thread writes into N destination location, one per processed stream. Instead, it accumulates four decoded bytes with aligned destination addresses into a dword and then stores dwords from each processed stream into LDS. When it becomes full or the last full dword is formed, dwords from are flushed from LDS to memory cooperatively by the entire threadgroup making coalesced writes.
This new variant of the shader also reduce LDS usage to store Huffman table by a half (from 2048 to 1024 dwords). This is still not ideal (768 dwords), but better and allows to recuperate some LDS space to put per-stream data cache there.