When performance matters, it can matter greatly. For example:
- Highly-distributed, high-demand web services need to deliver sub-second responses.
- Artificial intelligence algorithms must correctly analyze complex signals on the fly.
- Media and 3-d games need to deliver high-quality, immersive experiences without pauses or interruptions.
A significant part of Cone's performance story comes from obvious architectural choices. Cone's statically-typed programs are compiled to native executables supporting all popular platforms: Windows, Linux, Macintosh, Android, iOS, and WebAssembly. Cone accomplishes this using LLVM's award-winning backend technology and its powerful suite of code optimizers. This means there is no performance loss that results from interpretation or JIT translation. Further, Cone's static types are aligned with the way CPUs operate on data, eliminating the waste of CPU cycles that arises from doing operations on needlessly complex data structures.
This is not enough, however, as high-performance programs often grow complex more quickly than CPUs get faster. To win at this performance game, we must get smarter about architecting programs for performance. This means taking better advantage of proven performance strategies, such as cache locality, multi-core machines, distributed computing, and more.
This is why a top design goal for Cone is making it easier for knowledgeable programmers to leverage high-performance strategies to improve throughput and reduce latency.
Memory and Throughput
Cache locality: In modern CPUs, memory is glacially slow, as fetch speeds are roughly 100x slower than L1 cache. As a result, cache-friendly approaches to data layout and access can result in significant thoughput gains. Like other systems programming languages, Cone facilitates explicit control over the physical size, layout and representation of objects in memory, so as to maximize cache locality. When data that is processed together can be located together in small packets, operations unfold far more quickly.
Region memory management: Cone's region approach to memory management enables significant reductions in runtime overhead associated with allocating and freeing memory. For example, a program that exclusively uses ref-counting or tracing GC can improve throughput by switching some or all of its memory allocations to use of arenas or pools. (Throughput for arenas/pools can be up to 10-20x faster than general-purpose memory allocation and free.)
Additional performance improvements can be realized by transitioning most region-managed references to borrowed references, thereby eliminating the runtime overhead of the region-managed references. In addition to this direct benefit, the fact that borrowed references can point to data structures inlined in larger data structures facilitates cache-friendly data layout, further improving throughput.
Region-based memory management doesn't just improve throughput. It can also be used to reduce (or eliminate) stop-the-world lag that can negatively impact systems that need to continuously deliver in real-time.
Thread communications
Another source of unnecessary throughput loss can result from not fully leveraging the parallelism and concurrency capabilities supported by many modern devices. This can happen when programs do not spread work across all available CPUs, block on I/O (rather than scheduling other work while waiting), or when synchronized communications between concurrent work consume too much time. Concurrency can be really hard to get right sometimes.
Cone is designed to make actors, one of the fastest models of concurrency, the easiest to use. What are the performance benefits?
- Actors communicate via messages delivered using a FIFO queue. FIFO queues have very low synchronization costs as compared to other synchronization methods (e.g., lock permissions and concurrent data structures).
- Because actors share memory, messages can be short, comprised largely of references to existing, accessible objects
- Message references typically use static permissions, whose safety guarantees carry no runtime lock cost.
- Actors built on top of green threads minimize context switching costs, as compared to OS processes or threads.