Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

Skip to main content

Exploratory Group - Embedded Heterogeneous Communication

For Software Engineers Who are NOT Focused on Communication, but Require It

Khronos is considering an open standardization initiative to unify point-to-point communication into a simple API with the aim of reducing application complexity, minimizing development costs, and improving time-to-market for high- performance embedded products. If successful, this new standard could transform the way applications are developed for heterogeneous systems and edge computing.

Khronos Exploratory Group Process

Current Issues With Existing APIs

No Single API Works With All Endpoint Variations

Edge computing applications typically need multiple point-to- point APIs to collect various types of sensor data, and distribute computing across processors, processes and threads. Development and maintenance are not trivial.

SOLUTION

A unified communication API should fit all endpoint variations.

Existing APIs are Not Simple or Intuitive

Engineers who are focused on algorithm development, but not communication, struggle with high learning curves of communication APIs. Setting up communication endpoints and invoking efficient transfers can take weeks or months to get it right, but should take hours or days.

SOLUTION

The functionality of a unified communication API should only includes a few simple and intuitive concepts:

  • Create / Destroy Endpoint
  • Send --> Receive
  • Read / Write
Common
APIs
Function
Count
Typical
Drawbacks
Sockets ~20 Confusing naming/ concepts
Difficult to tune for performance
RDMA ~100 Confusing naming / concepts
High learning curve, experts are rare
libFabrics ~100 Confusing naming / concepts
High learning curve, experts are rare
MPI ~300 Confusing transfer variations
‘mpirun’ not portable and difficult to tune

Existing APIs are Missing Critical Features

Embedded / edge computing has much different requirements than large homogeneous server farm computing. Lack of features may severely limit the choice of API, which may in turn force a much larger learning curve. For example:

  • MPI at face value is very simple, but if the engineer is designing for real-time (needs determinism) and must allow for dropped data (UDP-like protocol), then MPI can't be used, and the engineer may be forced to use the very complex RDMA.
  • Sockets can't pre-register memory addresses at creation time so a TCP sender, of a large message, will need to block until the TCP receiver gets called.

SOLUTION

A unified communication API should support these critical features:

  • Reliable transfers (every byte matters)
  • Unreliable transfers (allow dropping data)
  • Fault tolerant hooks to allow application to be fault tolerant (timeouts, disconnect detection, create endpoints on the fly)
  • All endpoint localities: inter-thread, inter-process, inter-device
  • All hardware: CPUs, GPUs, FPGAs, etc.
  • Zero-copy and one-way (not just one or the other)
  • Non blocking (so CPU is not bogged down waiting)
  • Two-sided transfers (a coordination between send and recv)
  • One-sided transfers (read/write/atomics) where the remote endpoint is not involved
  • Multiple memory blocks per message (mix CPU and GPU)
  • Connect to 3rd party endpoints

Some underlying interconnects may not support all the features, but the API should not limit the interconnects that can.

Unrealized Performance

Performance is a combination of throughput, latency, and determinism. Some communication APIs may reduce performance provided by the underlying interconnect. For example, if the destination address of data is not know until transfer time, then the transfer may be zero- copy but it can't be one way; i.e. a round trip is required to know where to place the data.

SOLUTION

A unified API should provide best performance via:

  • Zero-copy AND one-way: when possible, data addresses need to be pre-registered (and pinned) when the endpoint is initialized
  • Non-blocking: when possible, transfers should be unloaded to a DMA engine that can free up main processing resources
  • Minimize implicit activity at transfer time: The application should be responsible for synchronization of memory buffer use

Proposals

Proposals are welcome – please contact us at .(JavaScript must be enabled to view this email address) if you would like to discuss getting involved.

 

Only 8 Function Calls!

Group Functions Details
Create takyonCreate()
takyonDestroy()
Dynamically create and destroy endpoints
Two-Sided takyonSend()
takyonIsSent()
takyonPostRecvs()
takyonIsRecved()
Both endpoints involved with transfer via coordinated send -> recv
One-Sided takyonOneSided()
takyonIsOneSidedDone()
One endpoint does all the work, and the other endpoint is not involved

 

100% Open Source

GitHub Presentation takyon.h

 

Recently Adopted By

Lockheed Martin
Ametek / Abaco Systems