-
Notifications
You must be signed in to change notification settings - Fork 15
[APMSP-1961] Add fork support by dropping runtime #1056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
5b07491
7e37d00
2eabbed
8b79aea
590a60a
736916b
efc38ea
f7e5c6e
9a213ea
ad49ce7
da6dee5
845781d
104d270
17ba3b8
e507732
f403d1e
5b721f5
a85c21c
87d96ad
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,172 @@ | ||
| // Copyright 2025-Present Datadog, Inc. https://www.datadoghq.com/ | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| //! Defines a pausable worker to be able to stop background processes before forks | ||
|
|
||
| use ddcommon::worker::Worker; | ||
| use std::fmt::Display; | ||
| use tokio::{ | ||
| runtime::Runtime, | ||
| select, | ||
| task::{JoinError, JoinHandle}, | ||
| }; | ||
| use tokio_util::sync::CancellationToken; | ||
|
|
||
| /// A pausable worker which can be paused and restarted on forks. | ||
| /// | ||
| /// Used to allow a [`ddcommon::worker::Worker`] to be paused while saving its state when dropping | ||
| /// a tokio runtime to be able to restart with the same state on a new runtime. This is used to | ||
| /// stop all threads before a fork to avoid deadlocks in child. | ||
| /// | ||
| /// # Time-to-pause | ||
| /// This loop should yield regularly to reduce time-to-pause. See [`tokio::task::yield_now`]. | ||
| /// | ||
| /// # Cancellation safety | ||
| /// The main loop can be interrupted at any yield point (`.await`ed call). The state of the worker | ||
| /// at this point will be saved and used to restart the worker. To be able to safely restart, the | ||
| /// worker must be in a valid state on every call to `.await`. | ||
| /// See [`tokio::select#cancellation-safety`] for more details. | ||
| #[derive(Debug)] | ||
| pub enum PausableWorker<T: Worker + Send + Sync + 'static> { | ||
| Running { | ||
| handle: JoinHandle<T>, | ||
| stop_token: CancellationToken, | ||
| }, | ||
| Paused { | ||
| worker: T, | ||
| }, | ||
| InvalidState, | ||
| } | ||
|
|
||
| #[derive(Debug)] | ||
| pub enum PausableWorkerError { | ||
| InvalidState, | ||
| TaskAborted, | ||
| } | ||
|
|
||
| impl Display for PausableWorkerError { | ||
| fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
| match self { | ||
| PausableWorkerError::InvalidState => { | ||
| write!(f, "Worker is in an invalid state and must be recreated.") | ||
| } | ||
| PausableWorkerError::TaskAborted => { | ||
| write!(f, "Worker task has been aborted and state has been lost.") | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl core::error::Error for PausableWorkerError {} | ||
|
|
||
| impl<T: Worker + Send + Sync + 'static> PausableWorker<T> { | ||
| /// Create a new pausable worker from the given worker. | ||
| pub fn new(worker: T) -> Self { | ||
| Self::Paused { worker } | ||
| } | ||
|
|
||
| /// Start the worker on the given runtime. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the public functions, should we have more thorough rustdoc comments with examples? Or perhaps, just more thorough rustdoc comment for the module or enum that explains why this is necessary and when it should be used? If someone outside of our team needs to implement a worker in the future, do they have enough information to do so independently? |
||
| /// | ||
| /// The worker's main loop will be run on the runtime. | ||
| /// | ||
| /// # Errors | ||
| /// Fails if the worker is in an invalid state. | ||
| pub fn start(&mut self, rt: &Runtime) -> Result<(), PausableWorkerError> { | ||
| if let Self::Running { .. } = self { | ||
| Ok(()) | ||
| } else if let Self::Paused { mut worker } = std::mem::replace(self, Self::InvalidState) { | ||
| // Worker is temporarily in an invalid state, but since this block is failsafe it will | ||
| // be replaced by a valid state. | ||
| let stop_token = CancellationToken::new(); | ||
| let cloned_token = stop_token.clone(); | ||
| let handle = rt.spawn(async move { | ||
| select! { | ||
| _ = worker.run() => {worker} | ||
| _ = cloned_token.cancelled() => {worker} | ||
| } | ||
| }); | ||
|
|
||
| *self = PausableWorker::Running { handle, stop_token }; | ||
| Ok(()) | ||
| } else { | ||
| Err(PausableWorkerError::InvalidState) | ||
| } | ||
| } | ||
|
|
||
| /// Pause the worker saving it's state to be restarted. | ||
| /// | ||
| /// # Errors | ||
| /// Fails if the worker handle has been aborted preventing the worker from being retrieved. | ||
| pub async fn pause(&mut self) -> Result<(), PausableWorkerError> { | ||
| match self { | ||
| PausableWorker::Running { handle, stop_token } => { | ||
| stop_token.cancel(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a possible race condition here? If the task is already shutting down and pause is called can we wind up in an InvalidState when we don't want to?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we're safe since pause takes a mutable reference.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, which is why I wasn't concerned with
Looking at this some more, I agree that we are probably ok. |
||
| if let Ok(worker) = handle.await { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if we have workers that don't frequently yield, or wind up deadlocked? Are we going to await forever? Or, if not forever, long enough where it causes problems?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, I'll add a warning in doc and a timeout in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually tokio timeout won't work as it will be checked when the task yields so we'll have the same issue. I think we can keep the warning for now and add timeout as a follow-up item.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For what it's worth, tokio supports "cancellation tokens", which is how in profiling we stop tokio if folks e.g. ctrl+c and the profiler is uploading a profile on a slow connection.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The pausable worker uses a cancellation token internally but we still have to wait for the worker to yield. I don't think there's a way force the worker to yield earlier though.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think it's possible, but would require a big refactor and add significant complexity to our workers. I don't think it's worth pursuing until we see evidence this solution isn't sufficient. I think the only real problem is high-frequency forking apps. We need to support that situation. I guess the only way to do that is to ensure the workers are as efficient as possible and yield appropriately? I don't think there is anything that can be practically done in |
||
| *self = PausableWorker::Paused { worker }; | ||
| Ok(()) | ||
| } else { | ||
| // The task has been aborted and the worker can't be retrieved. | ||
| *self = PausableWorker::InvalidState; | ||
| Err(PausableWorkerError::TaskAborted) | ||
| } | ||
| } | ||
| PausableWorker::Paused { .. } => Ok(()), | ||
| PausableWorker::InvalidState => Err(PausableWorkerError::InvalidState), | ||
| } | ||
| } | ||
|
|
||
| /// Wait for the run method of the worker to exit. | ||
| pub async fn join(self) -> Result<(), JoinError> { | ||
| if let PausableWorker::Running { handle, .. } = self { | ||
| handle.await?; | ||
| } | ||
| Ok(()) | ||
| } | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use tokio::{runtime::Builder, time::sleep}; | ||
|
|
||
| use super::*; | ||
| use std::{ | ||
| sync::mpsc::{channel, Sender}, | ||
| time::Duration, | ||
| }; | ||
|
|
||
| /// Test worker incrementing the state and sending it with the sender. | ||
| struct TestWorker { | ||
| state: u32, | ||
| sender: Sender<u32>, | ||
| } | ||
|
|
||
| impl Worker for TestWorker { | ||
| async fn run(&mut self) { | ||
| loop { | ||
| let _ = self.sender.send(self.state); | ||
| self.state += 1; | ||
| sleep(Duration::from_millis(100)).await; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_restart() { | ||
| let (sender, receiver) = channel::<u32>(); | ||
| let worker = TestWorker { state: 0, sender }; | ||
| let runtime = Builder::new_multi_thread().enable_time().build().unwrap(); | ||
| let mut pausable_worker = PausableWorker::new(worker); | ||
|
|
||
| pausable_worker.start(&runtime).unwrap(); | ||
|
|
||
| assert_eq!(receiver.recv().unwrap(), 0); | ||
| runtime.block_on(async { pausable_worker.pause().await.unwrap() }); | ||
| // Empty the message queue and get the last message | ||
| let mut next_message = 1; | ||
| for message in receiver.try_iter() { | ||
| next_message = message + 1; | ||
| } | ||
| pausable_worker.start(&runtime).unwrap(); | ||
| assert_eq!(receiver.recv().unwrap(), next_message); | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to impl
Drop? What happens ifPausableWorkergoes out of scope? Are tokio tasks going to continue to run in the background? Should you be canceling thestop_token?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task is going to continue in the background until the runtime is dropped. I think it is fine to not cancel the token. Either way this should be described in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a very strong opinion, but why is it ok for the task to continue? Is it not likely that workers can go out of scope but the runtime stick around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion either. The way I see it the PausableWorker is more of a handle to the worker which is running in the runtime.