Modern scientific computing is no longer just about running simulations. It is also about moving, transforming, sorting, joining, and preparing enormous datasets before any model can learn from them. That data engineering phase is becoming a serious bottleneck in fields like genomics, climate science, astronomy, and large-scale AI. The paper Radical-Cylon tackles exactly this problem by proposing a runtime system that brings together high-performance data processing and flexible heterogeneous execution on HPC platforms.
Scientific workflows increasingly depend on both data engineering and machine learning. Yet those two worlds are often disconnected. Data pipelines may use one framework, while deep learning and distributed execution depend on another. This fragmentation causes overhead, weak resource utilization, and difficulty deploying the same workflow across clusters, clouds, and supercomputers. The paper argues that what researchers need is a more unified execution model that can support both compute-intensive and data-intensive workloads without forcing major code rewrites.
Radical-Cylon is designed as that bridge. It combines:
Cylon, a distributed dataframe and data engineering framework
RADICAL-Pilot, a flexible runtime for task execution and resource management on HPC systems
Together, they form a system that can launch heterogeneous data tasks efficiently across many nodes while maintaining performance close to bare-metal execution.
The central design choice is surprisingly elegant: keep the two systems loosely coupled.
Instead of deeply rewriting Cylon to fit a new runtime, the authors connect Cylon and RADICAL-Pilot through their native Python APIs. That means Cylon keeps doing what it does best, distributed dataframe operations, while RADICAL-Pilot handles task execution, scheduling, and construction of private communicators needed for distributed runs. This keeps the integration flexible, portable, and easier to evolve as either system changes. This loose coupling also improves resilience. If one component changes or fails, the whole system does not need to be redesigned.
Cylon is the data engineering side of the system. It provides distributed table abstractions and supports both local and distributed operators. It is built around Apache Arrow’s columnar data model, which improves interoperability with other systems and helps make distributed data operations more efficient. It also supports multiple communication backends such as MPI, UCX, and GLOO, making it suitable for heterogeneous environments.
The paper’s layered architecture figure is helpful here. Figure 1 shows how Cylon sits above the hardware and communication layers and exposes a structured data-processing stack that can connect with AI and ML workflows. Figure 2 then illustrates the communicator model, emphasizing that Cylon can operate across multiple communication frameworks instead of being locked into one.
RADICAL-Pilot is the runtime engine. It provides scalable execution of heterogeneous workloads across HPC resources. The paper explains that RADICAL-Pilot uses three main components:
PilotManager
TaskManager
RemoteAgent
These work together to acquire resources, manage tasks, and execute them efficiently on remote compute nodes. A major strength of RADICAL-Pilot is that it separates resource management from the application layer, so applications like Cylon do not need to be rewritten for every platform.
Even more importantly, RADICAL-Pilot can create private MPI communicators at runtime, which is exactly what Cylon needs for its distributed tasks.
The integration is explained through Figure 3 and Figure 4, which are among the most important visuals in the paper.
Figure 3 presents the modular Radical-Cylon architecture. From top to bottom, it shows how user-facing workflows and data engineering APIs connect into RADICAL-Pilot, which then manages scheduling and execution across compute nodes. The figure highlights the layered handoff from notebooks and workflows, to pilot/task management, to schedulers and executors, and finally to node-level CPU/GPU resources.
Figure 4 shows the control flow and data flow of heterogeneous execution. This is where the system becomes especially interesting. A user submits a traced program made of multiple computations. RADICAL-Pilot then manages the front-end orchestration, while the execution engine launches separate task pipelines. The actual Cylon tasks operate as SPMD-style jobs beneath this orchestration layer. In other words, Radical-Cylon enables a higher-level MPMD execution model by coordinating multiple lower-level SPMD computations.
The implementation also uses the RAPTOR subsystem, based on mpi4py, to support concurrent execution of heterogeneous MPI and non-MPI functions across nodes. This is crucial because it allows distinct groups of ranks to be isolated, bundled into private communicators, and delivered to the tasks dynamically during runtime.
The paper evaluates Radical-Cylon on two HPC systems:
UVA Rivanna
ORNL Summit
The experiments test both join and sort operations under:
Weak scaling: 35 million rows per rank
Strong scaling: 3.5 billion total rows divided across ranks
The study compares Radical-Cylon with Bare-Metal Cylon (BM-Cylon), focusing on two metrics:
total execution time
Radical-Cylon overheads, especially communicator construction and task deserialization
These experiments are summarized in Table 1.
One of the strongest findings in the paper is that Radical-Cylon performs very similarly to BM-Cylon in both strong and weak scaling experiments.
For join operations, the paper shows the comparison in Figures 5 and 6 for Rivanna and Summit. In strong scaling, execution time drops significantly as ranks increase, and Radical-Cylon remains close to BM-Cylon. In weak scaling, execution time rises gradually as expected, but the gap between the two systems remains small.
For sort operations, the same trend appears in Figures 7 and 8. Strong scaling improves nicely as more ranks are used, while weak scaling shows some increase in execution time due to added data shuffling and merge costs. Still, Radical-Cylon tracks BM-Cylon closely.
This is important because it shows that adding a flexible runtime layer does not destroy performance.
The overhead numbers in Table 2 are especially revealing. The system’s overheads remain small and roughly constant even as parallelism increases. The discussion section notes that constructing an MPI communicator for 518 ranks takes only about 3.4 seconds on average, which is minor compared with total execution times in the tens or hundreds of seconds.
That is a meaningful systems result. It means the runtime abstraction is not only flexible, but also cheap enough to be practical at scale.
The most compelling part of the paper may be the multi-pipeline experiments on Summit.
Instead of running join and sort as separate batch jobs, Radical-Cylon treats them as distinct tasks within a single heterogeneous execution. This matters because released resources from one task can be reused by another. In traditional batch mode, each operation gets its own allocation, and completed resources may sit idle.
This advantage is shown in Figures 9 and 10. The heterogeneous pipeline executes multiple sort and join configurations together and manages resource release more effectively than separate batch jobs. The outcome is better resource utilization and lower total time.
Then Figure 11 makes the performance difference explicit: Radical-Cylon improves over batch execution by 4% to 15% across various scaling configurations on ORNL Summit.
That is the paper’s biggest practical message: a smarter execution framework can outperform conventional batch submission even when using the same underlying hardware.
This paper is not just about joins and sorts. It points toward a broader future in which scientific computing systems need to support:
data engineering
machine learning
multiple concurrent pipelines
fine-grained resource sharing
heterogeneous hardware
The discussion section makes that forward-looking vision clear. The authors note that Cylon tasks can be represented as a DAG, and that future optimizations could schedule independent branches in parallel. They also discuss extending the model to CPU-GPU unified execution and multi-tenancy scenarios involving memory and network bandwidth tracking.
So Radical-Cylon is best viewed as a foundation. It already proves that flexible runtime management can be added with minimal cost, and it opens the door to more advanced orchestration for ML and data-intensive HPC workloads.
The paper is strong, but it is also clear about its current boundaries.
First, the experiments are limited to CPU clusters, even though the design aims to support CPU-GPU execution. The authors explain that dependencies on CUDA-aware MPI and platform support prevented full heterogeneous CPU-GPU evaluation at this stage.
Second, while the performance improvements are convincing, the benchmarks are still centered around data operations like join and sort. Future work will need to show how well the approach extends to broader end-to-end ML and DL pipelines.
Third, the paper mentions challenges in resource allocation within the RAPTOR module on Summit, which suggests that some runtime issues remain under active development.
Radical-Cylon is a strong systems paper because it solves a real integration problem without overcomplicating the architecture. Its main contribution is not inventing a new dataframe engine or a new scheduler in isolation. Instead, it shows how to combine the strengths of an HPC data framework and a pilot-based runtime into a unified, scalable execution model.
The most important takeaways are:
It gives Cylon access to heterogeneous HPC execution without requiring major rewrites.
It preserves near bare-metal performance for distributed join and sort workloads.
It introduces only small, stable overheads.
It outperforms traditional batch execution by 4–15% in multi-pipeline settings.
It lays the groundwork for more advanced unified ML/DL/data-engineering pipelines on HPC systems.
In a world where scientific computing increasingly depends on both large-scale data preparation and large-scale learning, that is a meaningful step forward.
@inproceedings{sarker2024radical,
title={Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing},
author={Sarker, Arup Kumar and Alsaadi, Aymen and Perera, Niranda and Staylor, Mills and von Laszewski, Gregor and Turilli, Matteo and Kilic, Ozgur Ozan and Titov, Mikhail and Merzky, Andre and Jha, Shantenu and others},
booktitle={Job Scheduling Strategies for Parallel Processing},
pages={84--102},
year={2024},
doi = {10.1007/978-3-031-74430-3_5},
url = {https://doi.org/10.1007/978-3-031-74430-3_5},
organization={Springer Nature Switzerland}
}
@inproceedings{sarker2025drc,
author = {Sarker, Arup Kumar and Alsaadi, Aymen and Halpern, Alexander James and Tangella, Prabhath and Titov, Mikhail and Perera, Niranda and Staylor, Mills and von Laszewski, Gregor and Jha, Shantenu and Fox, Geoffrey},
title = {Deep RC: A Scalable Data Engineering and Deep Learning Pipeline},
year = {2025},
isbn = {978-3-032-10506-6},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
url = {https://doi.org/10.1007/978-3-032-10507-3_11},
doi = {10.1007/978-3-032-10507-3_11},
booktitle = {Job Scheduling Strategies for Parallel Processing: 28th International Workshop, JSSPP 2025, Milan, Italy, June 3–4, 2025, Revised Selected Papers},
pages = {205–223},
numpages = {19},
location = {Milan, Italy}
}