Async stack traces in folly: Synchronous and asynchronous stack traces

September 23, 2021ByLee Howes

This article was written by Lee Howes and Lewis Baker from Facebook.

This is the second in a series of posts covering how we have used C++ coroutines at Facebook to regain stack traces for dependent chains of asynchronous waiting tasks. In the previous blog post we talked about the work we have done to implement stack traces for asynchronous coroutine code. Here we’ll go into more detail on the technical differences and challenges involved in implementing async stack traces on top of C++ coroutines, compared with traditional stack traces.

Normal synchronous stack traces

With normal stacks, when you call a function the compiler generates code that automatically maintains a linked list of stack frames and this list represents the call stack. At the start of each frame is a structure that (at least on Intel architectures) looks like this:

struct stack_frame {
  stack_frame* nextFrame;
  void* returnAddress;
};

This structure is usually filled out by specialised assembly instructions. e.g. in x86_64, a caller will execute a call instruction, which pushes the return address on the stack and jumps to the function entry point. Then the first instructions of the callee pushes the ebp register (which usually holds the pointer to the current stack_frame structure) onto the stack and then copies the esp register (which now contains the pointer to the stack_frame structure we just populated) into the ebp register.

For example:

caller:
  ...
  call callee         # Pushes address of next instruction onto stack,
                      # populating 'returnAddress' member of 'stack_frame'.
                      # Then jumps to 'callee' address. 
  mov rsp[-16], rax   # Save the result somewhere
  ...
 
callee:
  push rbp           # Push rbp (stack_frame ptr) onto stack (populates 'nextFrame' member)
  mov rbp, rsp       # Update rbp to point to new stack_frame
  sub rsp, 16        # Reserve an additional 16 bytes of stack-space
  ...
  mov rax, 42        # Set return-value to 42
  leave              # Copy rbp -> rsp, pop rbp from stack
  ret                # Pop return address from top of stack and jump to it

When a debugger or profiler captures a stack trace for a given thread, it obtains a pointer to the first stack frame from the thread's ebp register and then starts walking this linked list until it reaches the stack root, recording the return addresses it sees along the way in a buffer.

Subsequent profiling tools may then translate the addresses to function names and/or file+line numbers using a symbolizer that makes use of debug info for the binary, and this information may be logged or displayed as useful for the tool in question.

Usually these stack frames live in a single contiguous memory region and the data structure looks a bit like this:

If we want to walk the async-stack trace instead of the normal stack trace then we still want to first start walking normal stack frames just like we do for a normal stack trace. A coroutine may call normal functions and we want to include the frames for these normal function calls in the stack trace. However, when we get to the frame corresponding to the top-most coroutine (in this case coro_function_1) we do not want to follow the 'nextFrame' link into the coroutine_handle::resume method as the normal stack walking would. Instead we need a link to the waiting coroutine. At this point in the trace the histories for the normal stack trace and the async-stack trace diverge. To walk an async-stack trace involves answering a few questions:

How do we identify which stack frames correspond to async frames?
How do we find the address of the next async frame?
How and where do we allocate storage for the async frame data?

Before we can implement any of this, we need to understand a bit more about how coroutines are structured.

How do coroutine “stacks” differ from normal stacks?

When you call a coroutine in C++, this allocates storage for a coroutine frame. The allocation is usually obtained from the heap although the compiler is free to optimise out this allocation in some circumstances, for example by inlining the allocation into the frame of the caller.

The compiler uses the coroutine frame storage to store all of the state that needs to be preserved when the coroutine is suspended so that it is available when the coroutine is later resumed. This usually includes storage for function parameters, local variables, temporaries and any other state the compiler deems necessary, such as at which suspend-point the coroutine is suspended.

The coroutine frame also includes storage of a special object, the coroutine promise, which controls the behaviour of the coroutine. The compiler lowers your coroutine function into a sequence of calls to methods on the coroutine promise object at certain key points within the coroutine body, in addition to the user-written code of the coroutine. The coroutine promise controls the behaviour of the coroutine by implementing the desired behaviour in these methods. For more details about the promise type see the blog-post Understanding the promise type.

The promise type is determined based on the signature of the coroutine function and for most coroutine types is based solely on the return-type of the coroutine. This allows coroutines that return a given return type (eg. folly::coro::Task<T>) to store additional per-coroutine-frame data within the coroutine frame by adding data members to the promise type.

For the clang implementation, the layout of a coroutine frame for a given coroutine function looks a bit like this:

struct __foo_frame {
  using promise_type = typename std::coroutine_traits<Ret, Arg1, Arg2>::promise_type;
  void(*resumeFn)(void*);  // coroutine_handle::resume() function-pointer
  void(*destroyFn)(void*); // coroutine_handle::destroy() function-pointer
  promise_type promise;    // coroutine promise object
  int suspendPoint;        // keeps track of which suspend-point coroutine is suspended at
  char extra[458];         // extra storage space for local variables, parameters,
                           // temporaries, spilled registers, etc.
};

When a coroutine is suspended, all of the state for that coroutine invocation is stored in the coroutine frame and there is no corresponding stack frame. However, when a coroutine is resumed on a given thread, this activates a stack frame for that coroutine on that thread's stack (like a normal function) and this stack frame is used for any temporary storage whose lifetime does not span a suspend point.

When a coroutine body is currently executing, the pointer to the current coroutine frame is usually held in a register, which allows it to quickly reference state within the coroutine frame. However, this address may also be spilled into the stack frame in some cases, in particular when the coroutine is calling another function.

Thus, when a coroutine is active and has called another function, the memory layout will look something like this:

Note in particular that walking the asynchronous version of the stack means diverging into heap memory by following the framePointer in coro_function_1. Unlike the pointer to the current stack frame, which can generally be assumed to be stored in the rbp register, there is no standard location for the pointer to the coroutine frame. This has implications for how we are able to navigate from a stack-frame to its corresponding async-frame data.

Chaining asynchronous stack frames

To be able to produce a stack trace that represents the async-call chain instead of the normal synchronous call chain, we need to be able to walk the chain of coroutine frames, recording the return/continuation address of each coroutine as we walk the chain.

The first piece of the puzzle to be solved is storing the state needed to be able to walk the async stack-frames during an async stack trace. For each async-stack frame we need to be able to determine the address of the next async-stack frame and we need to be able to determine the return address of the current stack-frame.

One of the constraints to be aware of here is that the code that is going to be walking the stack trace is not necessarily going to have access to debug information for the program. Profiling tooling, for example, may want to sample function offsets only and symbolize later, as it does for synchronous stack traces. We must be able to walk the async stack without needing additional complex data structures which could make the stack walk overly expensive. For example, profiling tools built on the Linux eBPF facility must be able to execute in a deterministic, finite amount of time.

There is technically enough information in the folly::coro::Task‘s promise type, folly::coro::TaskPromise, to be able to walk to the next frame, as it's already storing the coroutine_handle for the continuation and the coroutine frame of that continuation already encodes information about which suspend point of which coroutine function is awaiting it in its resumeFn and suspendPoint members. However, there are some challenges with trying to use this information directly in walking an async-stack trace.

Representing a chain of async stack-frames

If we have a pointer to a coroutine-frame stored in a coroutine_handle then in theory, if we know the layout of the promise object, we can calculate the address of the 'continuation' member of the promise which contains the address of the next coroutine-frame simply by adding a constant offset to the coroutine-frame pointer.

One approach is that we might require that all promise types store a coroutine_handle with the continuation as their first data-member:

template<typename T>
struct TaskPromise {
  std::coroutine_handle<void> continuation;
  Try<T> result;
  ...
};
 
struct __some_coroutine_frame {
  void(*resumeFn)(void*);
  void(*destroyFn)(void*);
  TaskPromise<int> promise;
  int suspendPoint;
};

Then, even if we do not know the concrete promise type, we know that its first member is a coroutine_handle and that the promise is placed immediately after the two function pointers. From the perspective of a debugger walking the async-stack trace it could assume that coroutine frames look like:

struct coroutine_frame {
  void(*resumeFn)(void*);
  void(*destroyFn)(void*);
  coroutine_frame* nextFrame;
};

Unfortunately, this approach breaks down when the promise-type is overaligned (that is, it has an alignment larger than 2 pointers: 32 bytes or larger on 64-bit platforms). This can happen if the folly::coro::Task<T> type is instantiated for an overaligned type, T, for example, a matrix type optimised for use with SIMD instructions.

In such cases the compiler inserts padding between the function pointers and the promise object in the structure to ensure that the promise is correctly aligned. This variation in layout makes it much more difficult to determine what offset to look at for the next coroutine-frame address because it is type dependent; the debugger needs to know something about the layout of the promise type to be able to calculate the offset.

In theory we could look at the value of the resumeFn/destroyFn pointers to lookup the promise type that corresponds to the coroutine body in a translation table, but this would either require debug information or by modifying the compiler to encode this information in the binary. We cannot assume the availability of debug info, and modifying the compiler is a much larger project. Other approaches are possible, such as changing the ABI of coroutine-frames to eliminate the padding, but these also require compiler changes, and would make the implementation compiler dependent.

The approach we took instead is to insert a new `folly::AsyncStackFrame` data-structure as a member of the coroutine promise and use these to form an intrusive linked list of async frames. That is, a structure that looks something like this:

namespace folly {
  struct AsyncStackFrame {
    AsyncStackFrame* parentFrame;
    // other members...
  };
}

that can then be added as a member to the coroutine promise objects:

namespace folly::coro {
  class TaskPromiseBase {
    ...
  private:
    std::coroutine_handle<> continuation_;
    AsyncStackFrame asyncFrame_;
    ...
  };

Whenever we launch a child coroutine by co_awaiting it we can hook up that child coroutine's AsyncStackFrame so that its parentFrame member points to the parent coroutine's AsyncStackFrame object.

Using a separate data structure gives us a lot of flexibility in how we represent async-stack traces. It insulates the data structures from any dependence on compiler internals, and will allow us to reuse AsyncStackFrame objects for non-coroutine async operations in future. It comes at a small memory and run time cost as we now effectively have two pointers to the parent coroutine to store and maintain. This decision can be revisited in future if we later want to squeeze some more performance by making some of the previously mentioned compiler changes.

Now we have a way to represent a chain of async-frames that can be walked by a debugger without needing to know anything about the concrete promise type. In the next post in the series we will look at how to determine the return-address of a coroutine and how to use these data structures to hook coroutine frames into a chain at runtime.

Other Blogs in this Series

The first post gives a high level background.
The third post shows how we tie async stack frames together into a chain.
The fourth post, and the last to discuss the infrastructure of async stack traces, covers walking async and synchronous stacks together and how all of this connects together.
The final post in the series is about integration work we have done to add the folly support discussed above: crash traces, on-demand printing, gdb integration and exception tracing.

To learn more about Facebook Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter and Facebook.

Interested in working with open source technologies at Facebook? Check out our open source-related job postings on our career page.

Tags:

Developer Tools Open Source 2021 Open Source Guides

More News

Async stack traces in folly: Synchronous and asynchronous stack traces

Normal synchronous stack traces

How do coroutine “stacks” differ from normal stacks?

Chaining asynchronous stack frames

Representing a chain of async stack-frames

Other Blogs in this Series

Async stack traces in folly: Improving debugging in the developer lifecycle

Async Stack Traces for C++ Coroutines in Folly: Walking the async stack

Async stack traces in folly: Forming an async stack from individual frames