Enable the bsddb deadlock detector#90
Draft
otargowski wants to merge 2 commits intosio2project:masterfrom
Draft
Conversation
59768a3 to
4e0b474
Compare
Also wrap transactions to retry after terminations from the deadlock detector. The reason for these changes is bsddb's use of page-level locks. According to the underlying library's documentation: "[deadlock detection] is necessary for almost all applications in which more than a single thread of control will be accessing the database at one time. Even when Berkeley DB automatically handles database locking, it is normally possible for deadlock to occur." Source (with more info): https://docs.oracle.com/database/bdb181/html/programmer_reference/transapp_deadlock.html https://web.archive.org/web/20260227161711/https://docs.oracle.com/database/bdb181/html/programmer_reference/transapp_deadlock.html
4e0b474 to
b16b792
Compare
Contributor
Author
Nvm, SZKOpuł's filetracker just went down and at the start of the unusual part of the logs is: These WORKER TIMEOUTs look just like a deadlock... and cause those processes to exit ungracefully, which corrupts the bsddb and requires a restart and an absurdly slow recovery. Start of previous downtime: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
bsddb seems to use page-level locks, which necessitate running a deadlock detector, as two parallel transactions making e.g. two writes each can deadlock regardless of whether any two of those four writes access the same key, since they may be on the same page.
From the documentation:
"[deadlock detection] is necessary for almost all applications in
which more than a single thread of control will be accessing the
database at one time. Even when Berkeley DB automatically handles
database locking, it is normally possible for deadlock to occur."
Almost all of my encounters with this issue were on fresh (a few days old) SIO2 instances.
My guess as to why it didn't (to my best knowledge) happen on SZKOpuł is that when the migration from filetracker 1 was done, the database was already quite big, so the chances of two keys being on the same page were quite low, especially since usually not many PUTs are done in parallel.
I added a test demonstrating the issue. On my machine, it reproduced the deadlocks quite reliably.
The fix involves running a deadlock detector pass every time a lock isn't immediately available, which perhaps kills one of the transactions. This required adding retries to transactions.
The performance penalty from this should be negligible, especially since we don't have many conflicts anyway (thanks to using per-link and per-blob file locks).
Documentation of the needed setting: https://bsdwatch.net/docs/sharedocs/db5/api_reference/C/envset_lk_detect.html.