The History of Async Rust

I'm writing this post because I've seen a lot of backlash and misunderstanding regarding async/await in Rust, both from a syntactic perspective but also direct criticisms of the implementation itself. I think it's important to understand the reasons behind the implementation details to understand why async Rust exists the way it does. It's certainly not perfect, and not the only way to implement async in modern languages, but my hope is that you'll understand why it had to be implemented in Rust the way that it was.

⚠️

Disclaimer: This post is a work in progress and is not complete. If you are reading this and catch any errors, please let me know! @ielm

Introduction

The async/await syntax was merged into the Rust language in May 2018 after a massive effort by many brilliant people, introduced by RFC #2394 and RFC #2395, which was later revised in RFC #2418.

Asynchronous programming is an architecture that allows a process to initialize a long-running operation that doesn't block other operations from running at the same time. Non-blocking async is often implemented using the async/await syntactic pattern, which allows asynchronous functions to be defined in a similar way to synchronous functions. The details of the syntax vary slightly between languages, but almost all implementations use the notion of a "future" or a "promise" which are a proxy for the result of an async process until it resolves.[1,2]¹

Async/await in rust is implemented using a mechanism called cooperative scheduling[3]; there's another multi-threading approach called preemptive scheduling[4,5]. The basic difference between the two is that in the preemptive model, the OS thread scheduler can step in and hand control from one thread to another at any time, such that tasks can be forcibly suspended. In the cooperative model, once a thread is given control, it continues to run until it explicitly yields control, or until it blocks. This difference in scheduling models carries significant implications for async in Rust.

A naive approach to developing an application that handles multiple tasks concurrently is by creating a new thread² for each task. While this method is fine when dealing with a small number of tasks, it becomes problematic as the number of tasks increases. The more threads created, the greater the strain on system resources. Different programming languages tackle this issue in different ways, but the basic idea is the same: swap out the task that is running on each thread with tasks that are waiting to run, such that all tasks get the chance to run.

The underlying mechanism for asynchronous computation in Rust is the notion of a "task", which are lightweight threads of execution, such that many tasks can be cooperatively scheduled onto a single operating system thread.³

The async/await syntax in Rust provides the following mechanism for function definitions:

async fn function(argument: &str) -> usize {
     // ...
}

This provides an equivalent definition to:

fn function<'a>(argument: &'a str) -> _Anonymous<'a, usize> {
     // ...
}

As we'll see later, the goal of this syntax is for the compiler to transform async functions into functions that evaluate to futures so that users don't have to write futures themselves.

pub trait Future {
    /// The type of value produced on completion.
    type Output;

    fn poll(self: PinMut<Self>, cx: &mut task::Context) -> Poll<Self::Output>;
}

In order to cooperatively schedule tasks, blocking tasks schedule themselves for later wakeup by the executor that's driving the future to completion. This wakeup by the executor is called polling, and it always returns a Poll value:

pub enum Poll<T> {
    /// Represents that a value is immediately ready.
    Ready(T),
    /// Represents that a value is not ready yet.
    Pending,
}

Futures do nothing unless polled by an executor - and there are many in the Rust ecosystem - but the most popular executor is Tokio. Executors are defined in std::core as a trait:

trait Executor {
    fn spawn(&mut self, task: Future<Output = ()> + Send) -> Result<(), SpawnErrorKind> {
        self.spawn_obj(TaskObj::new(task))
    }
    // ...
}

Now that we have a basic idea of how async works, let's dive a little bit into how we got here.

Towards External Iterators

The origins of async rust seem to trace back to a 2013 mailing list post on external iterators by Daniel Micay. The relationship between external iterators and futures in Rust is deeply connected, with roots in the evolution of Rust's ownership and borrowing model. At that time, Rust's borrowing model was still developing; references had only recently been promoted to typed entities that could be embedded in structs by Niko Matsakis with the first implementation of lifetime analysis in 2012, laying the groundwork for external iterators and subsequently futures.

Previously, iterators would traverse a collection one item at a time until told to stop. But there was a problem: the language couldn't guarantee that the iterator would actually stop when you wanted it to. This was risky because you couldn't be certain about the safety of the memory being used, making it impossible to do certain things like returning references from a loop, because the loop might not actually stop when expected. This also meant that these iterators couldn't be used to implement generic combinators that interleaved multiple iterators, because you couldn't reliably alternate between them. Micay's proposal to shift towards external iterators solved these problems and led to the interface we're used to today:

trait Iterator {
    type Item;

    fn next(&mut self) -> Option<Self::Item>;
}

External iterators integrated perfectly with Rust's ownership and borrowing model, with external iterators being compiled into structs that encapsulate the state of iteration. This encapsulation not only facilitated the maintenance of safe references to the data structures being traversed but also enhanced compilation efficiency. Thanks to monomorphization⁴, a complex iterator built by assembling multiple combinators also compiled into a single struct, making it transparent to the optimizer.

Back to the Future

This same principle was later applied to futures, with Rust's type system ensuring that asynchronous computations can be composed and executed while maintaining memory safety. By conceptualizing asynchronous operations as state machines, Rust enables these operations to be polled for their readiness. This model eliminates the need for callbacks and unnecessary memory allocations, leading to more efficient and safer asynchronous code.

This representation of async processes as state machines led to one of the big criticisms of async in rust: the language chose a readiness model instead of a completeness model. The readiness (or demand-driven) approach essentially tells a process "let me know when you're ready to do this work," as opposed to the completeness (or callback) approach which says "let me know when you're done with this work."

The criticism here is that the implementation of futures under the readiness model leads to down-stream problems such as the "cancellation problem." I'll write more deeply about the cancellation problem in a different post, but the basic idea is that if you have an async process and you want to stop it for any reason, you've cancelled that particular amount of work. The issue with this is that if, for example, you're in the middle of a process and you've unlocked a Mutex and taken the state out, and you cancel the process, the destructor of the Mutex will assume that the function is over and the Mutex will relock itself. But now it's locked itself over a None and there's nothing there, so when the next task comes along, it'll find that the state was lost.

struct SomeState {
	state: Mutex<Option<State>>
}

impl SomeState {
	async fn advance(&self) {
		let mut system_state = self.state.lock().await;
		let current_state = system_state
			.take()
			.expect("invariant violated: 'state' was None!");
		*system_state = Some(current_state.advance().await);
	}
}

While cancellation is a valid criticism of the readiness model, a lot of people don't realize that the initial implementation of Futures in Rust was actually based on the completion model. Aaron Turon and Alex Crichton began by copying the futures API used in many other languages based on what's known as a "continuation passing style." These types of futures take a callback as an argument, called the continuation, and calls the callback for notification that the future is complete. This type of API would have looked similar to this:

trait Future {
    type Output;

    fn schedule(self, continuation: impl FnOnce(Self::Output));
}

In his blog post on the subject, Turon calls this iteration a "false start," indicating that "this approach nevertheless forces allocation at almost every point of future composition." He gives the example of joining two futures and running them both concurrently:

fn join<F, G>(f: F, g: G) -> impl Future<Item = (F::Item, G::Item)>
where
    F: Future,
    G: Future,
{
    ...
}

This example requires that the joined future's continuation be owned by both F and G, and F and G also require their own callbacks. In order for the join continuation to be executable by F and G, an allocation would be required using an Arc. This went directly against the "zero-cost" composition of futures.

Thus, "after much soul-searching," Turon and Crichton took inspiration from non-blocking IO implementations of asynchronicity in C: building a state machine. This is what led to the readiness model, where instead of storing a continuation, a future is polled by an external executor to drive it to completion. When a future is pending, it stores a way to wake this executor, which it will execute when it's ready to be polled again. epoll on Linux is based on this same readiness model. This shift from a callback approach to an external driver as well as the compilation of combinators into a single state machine mirrors the shift that happened from internal to external iterators.

The Development of the Async/Await Syntax (WIP)

From 2017 to 2019, the development of the async/await syntax was led by Boats. Until that point, async/await was implemented as a macro library by Crichton. The macro implementation led to some gnarly errors if a user referenced a future state that was held over an await point, because futures could not hold a reference to its own state while awaiting. This is a similar issue to the self-referential problem with the original implementation of futures.

The goal of the futures project was to be able to transform async functions into functions that evaluate to futures to avoid users having to write these themselves. One of the issues Boats and the team ran into was that the compilation of futures to state machines was a "perfectly-sized stack" - that is, that the stack didn't need to be resizeable. But the state machine was represented as a struct, and structs in Rust are safe to move, even though there's no need to move a future while it's being executed. This led to the requirement of tagging a future as "immovable." The original attempt at this solution was to implement a Move trait which excluded coroutines from APIs which can move them. The next idea was to make the poll method unsafe, and define an invariant where once you have started polling a future, you cannot move it again. This would have led to every future being unsafe, and certainly would have generated a lot of controversy. The final attempt at a solution introduced what we now know as the Pin type, which allowed enforcing that an encapsulated object not be moveable. According to the std::pin documentation: "a Pin ensures that the pointee of any pointer type P has a stable location in memory, meaning it cannot be moved elsewhere and its memory cannot be deallocated until it gets dropped."

Conclusion (TODO)

...

Footnotes

¹There is a difference between a future and a promise - a future is considered a read-only view of a variable, whereas a promise is a writable container that sets the value of the future.

²A simple mental model of threads is that they are an abstraction over the cores in a processor, where each core is an independend processing unit that can work semi-independently. Obviously, threads and cores aren't 1:1; there are a finite number of cores, and threads are a virtual abstraction built on top of them.

³Threads are OS specific; Windows and GNU/Unix treat thread spawning pretty differently. (I know very little about Windows, sorry...) In Linux, threads are basically just processes with a shared address space, and Linux implements the max number of threads per process using the following formula: number of threads = total virtual memory / ( stack size * 1024 * 1024 ). Processes are usually created with fork, and threads (which can be considered lightweight processes) are usually created with clone. (There also exist 1:N thread models, but we won't get into that here.) Under the hood, Linux doesn't distinguish between threads and processes; there are only "tasks" which differ in how much context they share with a parent task. You can set a ton of flags on the clone call, and you have a task that works like a process, or you can clear all those flags on the clone call and have a task that works like a thread. You can even do things in between. Windows processes are very different, where processes indeed are the heaviest thing in the universe compared to threads (get got node_modules). Both fork and clone map to the same kernel function do_fork internally. This function creates either a lightweight process that shares the address space with the parent, or a separate process (along with many other options) depending on the flags you feed to it. The clone syscall is effectively a direct forwarding of that kernel function, whereas fork wraps it into broader functionality. The important difference is that fork guarantees that a complete, separate copy of the address space is created, which is done with a copy-on-write. When you create a thread, it reuses the original address space and the same memory.

⁴Monomorphization is a compile-time process where the compiler creates a different copy of the code of a generic function for each concrete type that's need.

References

[1] Baker Jr, Henry C., and Carl Hewitt. "The incremental garbage collection of processes." ACM SIGART Bulletin 64 (1977): 55-59.

[2] Wise. "Aspects of applicative programming for parallel processing." IEEE Transactions on Computers 100.4 (1978): 289-296.

[3] Saewong, Saowanee, and Ragunathan Rajkumar. "Cooperative scheduling of multiple resources." Proceedings 20th IEEE Real-Time Systems Symposium (Cat. No. 99CB37054). IEEE, 1999.

[4] Gonzalez, Teofilo, and Sartaj Sahni. "Preemptive scheduling of uniform processor systems." Journal of the ACM (JACM) 25.1 (1978): 92-101.

[5] Zhao, Wei, Krithi Ramamritham, and John A. Stankovic. "Preemptive scheduling under time and resource constraints." IEEE Transactions on computers 100.8 (1987): 949-960.