Reimagining Infrastructure and Systems for Scientific Discovery and AI Collaboration

By IEEE Computer Society Team on

November 7, 2025

An interview with Ewa Deelman, recipient of the 2025 Sidney Fernbach Memorial Award.

Dr. Ewa Deelman is the Research Director at USC’s Information Sciences Institute and Research Professor in the USC Computer Science Department, whose pioneering work in workflow automation has transformed how large-scale scientific research is conducted.

Pegasus has become a cornerstone for reproducible science across disciplines. What were some of the early challenges you faced in designing a workflow system that could scale across such diverse scientific domains, and how did those challenges shape Pegasus’s evolution?

We faced a number of challenges, both technical and social. Since we have written many papers on the technical challenges of Pegasus ([1] gives a good overview of the system) here I can focus on the socio-technical aspects.

In the beginning, we needed to understand the requirements of the domain scientists. In 2000, we started working with gravitational physicists from LIGO (Laser Interferometer Gravitational-Wave Observatory) , high-energy physicists from CMS (Compact Muon Solenoid) and Atlas, as well as astronomers. I was just starting a position at USC/ISI after my Postdoc at UCLA. These were new collaborations for me, new problems to tackle. My background was in high-performance computing and parallel discrete event simulation. In the NSF-funded GriPhyN project, we needed to address issues related to distributed systems (the data and computational resources were distributed geographically), and we were exploring the concept of virtual data in that context. The idea was that the scientist could ask for data, like a mosaic of an area of the sky, and the system would determine if this image was already available or whether it needed to be materialized. To support this idea, we were developing new computer science concepts, abstractions, and systems while aiming to understand the scientists' needs to ensure we provided them with the right tools. It also took some time to understand the nomenclature across the science and CS domains, and then we needed to find the right abstractions, develop prototypes, and test them with our science collaborators.

The first Pegasus prototype for LIGO, demonstrated at SC in 2001, was not fully successful. We raised the level of abstraction too high--we provided the scientists with a web-based interface where they could conduct a pulsar search using only metadata attributes such as the location on the sky and the time period they wanted to analyze. The system would then understand the request, and if the desired data was not readily available, Pegasus would launch a workflow to create it and deliver it to the user. The feedback from our LIGO collaborators was that the interface was nice but they wanted to be able to directly create and manipulate the workflows to make sure that their scientific intent was accurately reflected in the computations. As a result, Pegasus focused on providing the abstraction of a resource-independent workflow description and becoming a “compiler” to map this description onto the distributed data and compute resources available to the LIGO scientists. At that time, we were also leveraging the capabilities of DAGMan (workflow engine) and HTCondor (job scheduler), developed by Miron Livny at the University of Wisconsin-Madison, who was also part of GriPhyN. This started a collaboration between the two teams that is still active today. Utilizing this robust software stack allows us to focus on issues of abstraction, compilation, scalability and optimization.

This also brings me to another socio-technical challenge: it often takes time and commitment to see the results of collaborations come to life. LIGO is again a great example. We started working together in 2001, the first important science result was obtained in 2011, when LIGO passed the “blind injection test”. Even though the LIGO interferometers were not sensitive enough to detect a gravitational wave, the test proved that the science application and the software stack that supported it (including Pegasus) could detect the wave should it be captured by the instruments. In 2016, after an upgrade to the LIGO instruments and six months of data analysis, LIGO confirmed the first-ever direct detection of a gravitational wave and in 2017, three LIGO scientists received a Nobel Prize for their work. During that span of time, one of our LIGO colleagues, Duncan Brown, received his PhD, completed a postdoc at Caltech, became faculty at the University of Syracuse, and is now the Vice President for Research at the university. We still work very closely with him and other LIGO colleagues, and when we are ready to release new versions of Pegasus, they are one of our key testers. We want to make sure that the latest versions of the software do not impede their scientific pursuits.

Collaborations both with domain scientists and computer scientists are crucial. Engaging with domain scientists allows us to ground our solutions and gives us the satisfaction of making an impact on others’ research. Collaborating with computer scientists with complementary expertise brings in new ideas and solutions. We often view the same problem from different angles, and we often have different sets of domain science collaborators that drive novel solutions. For example, in an NSF-funded project, we were working with Michael Zink, Anirban Mandal, Eric Lyons, and Komal Thareja on issues of end-to-end orchestration of edge to cloud resources that enabled us to support the analysis of data coming off weather radars deployed in the Dallas-Fort Worth area. On typical days when the weather is good, edge computing resources close to the radars can handle the computational load; however, when severe weather approaches, additional computing is needed to support more complex weather prediction ensembles. To develop operational solutions we needed to leverage our collective expertise in weather forecasting, edge computing, cloud and network provisioning, and workflow management.

Today, as AI is permeating many facets of our lives, we formed a collaboration of AI and systems experts: Prasanna Balaprakash, Anirban Mandal, Krishnan Raghavan, Franck Cappello, Hongwei Jin, Jean Luca Bez, Fred Sutter, Komal Thareja, and several students, including Prachi Jadhav and Shixun Wu, on reimagining what workflow systems will look like in the future. We envision fully distributed agents that coordinate to ensure that the workflows submitted by users, sensors, and instruments are executed efficiently and reliably. This DOE-funded research project, SWARM, brings expertise in AI, systems, resources, and workflow management, networking, and infuses AI agents with algorithms from distributed and high-performance computing (for example, multi-party consensus and scheduling ). At the same time, we are adding AI to heuristics, such as those for building network overlays to enable more efficient agent communications.

Meanwhile, with Anirban Mandal, Sai Swaminathan, Michela Taufer, and Michael Zink we are enhancing production versions of Pegasus with AI, improving user experience in terms of workflow composition, monitoring, and debugging, as well as thinking of ways we can improve the type and quality of decisions that Pegasus makes while mapping the workflow to resources and during execution.

I mentioned collaborators, but the core Pegasus work has been done so far by my team. I am grateful for their work and dedication, particularly Karan Vahi, Mats Rynge, and Rajiv Mayani, who have been with me for decades. For example, Karan started in my group in 2002 as a graduate student and is leading the Pegasus architecture. Rajiv and Mats joined 6 and 7 years later, respectively, and are working on Pegasus at the edge, developing, testing, optimizing, and customizing Pegasus and working with users. We had a number of other staff as well who contributed to the Pegasus software and algorithms, for example, Gaurang Mehta who was there at the very beginning and more recently Ryan Tanaka, postdocs, like Rafael Ferreira da Silva and Loïc Pottier, Ph.D. students, most notably Gurmeet Singh, Gideon Juve, Weiwei Chen, Tu Do, and George Papadimitriou, and many MSc, undergraduate, and even one high school student (https://scitech.group/people).

When we take on large-scale projects, such asCI Compass, the NSF Cyberinfrastructure Center of Excellence, which has many programmatic components, including significant outreach and undergraduate Spring and Summer programs, I rely on the expertise of Nicole Virdone, who can manage a large set of complex interdependent tasks, communications with NSF facilities partners, and organization of the student recruitment and programmatic activities.

As the Director of CI Compass, you’re helping NSF facilities navigate complex cyberinfrastructure needs. How do you balance the rapidly evolving landscape of cloud and hybrid computing with the long-term sustainability and reproducibility goals of large-scale science?

Managing change is inherently complex, particularly within NSF Major Facilities. These facilities often take years—sometimes decades—to design and build and are then expected to operate for another 30 to 40 years. While evolution is inevitable, each facility must balance the need to adopt new technologies with the responsibility to provide stable, reliable capabilities for its science and engineering communities. This creates a persistent tension between innovation and continuity. Because of their mission-critical focus on scientific integrity, facilities tend to be deliberate and cautious when implementing change—prioritizing reproducibility, reliability, and stewardship of their scientific outputs.

You are bringing up cloud and hybrid computing. These are important capabilities that bring in scalability, resilience, and mature built-in services. NSF facilities can benefit from them and a number of facilities have moved to use primarily

cloud infrastructures. For example NEON (National Ecological Observation Network) has migrated most of their on-prem services to Google Cloud. For NEON, the move made sense because their data processing is relatively lightweight and resulted in on-prem infrastructure cycles being left unused. Pay-per-use cloud capabilities made financial sense. Some, like OOI (Ocean Observatory Initiative) take a hybrid approach mixing on-prem and commercial clouds. As NSF facilities use clouds, they need to carefully manage costs, as cloud data storage and computational costs can be high and they can change over time, which makes budgeting difficult. Thus, the adoption of cloud technologies is carefully evaluated and deliberated. Facilities make decisions about what data to retain long term, how to manage data and compute use within their communities, and how to mitigate risks associated with the temporary unavailability of cloud services or the long-term loss of data storage. Facilities can also leverage the National Science Foundation funded resources such as PATh and ACCESS that can and are used for facilities data processing. Some like NHERI (Natural Hazards Engineering Research Infrastructure), have built in cyberinfrastructure participants like TACC (Texas Advanced Computing Center) that provide the needed services.

There are also some interesting cases of managing change, as facilities sometimes operate in constrained environments, where even if one wanted to make changes, the physical environment makes it challenging. For example, IceCube operates in a remote location, at the South Pole, with limited network bandwidth. Bringing in new hardware and deploying new solutions must be carefully managed. As a result, IceCube has a mirror cyberinfrastructure at the University of Wisconsin-Madison, where they can introduce new hardware and software, test them, and deploy only when they are confident it will work as expected. Ocean-going Research Vessels have different challenges; there, the physical space on board is limited, and so is network connectivity, thus careful planning is needed when making changes and upgrades. Research Vessels need to make sure that the needed data collection, processing, and storage capabilities are fully operational during science cruises.

Clouds are not just platforms, but they come with a set of services; today, these services increasingly involve AI and thus provide additional opportunities to enhance facilities’ capabilities. Facilities are already exploring the use of machine learning, for example, for object detection in images captured by the OOI underwater cameras, or for detecting anomalies with instruments, such as the case with the CMS high-energy project, where quick detection and correction of detector issues can improve the quality of the collected data.

Exploring new cyberinfrastructure solutions is not straightforward, thus in CI Compass (https://ci-compass.org/people) we organize community workshops, panels at conferences, and topical working groups on Clouds, FAIR data, AI, etc.—topics related to the entire data lifecycle within facilities. We invite cyberinfrastructure practitioners, developers, and researchers from Major and Midscale facilities, large-scale CI projects, and other relevant communities to discuss challenges and solutions, and we publish reports aimed at disseminating information across global research infrastructures. Through our CI Compass Fellows program (https://ci-compass.org/student-fellowships) led by Angela Murillo, Don Brower and Nicole Virdone, we also teach undergraduate students across the U.S. about real-world cyberinfrastructure challenges and how to begin solving them for NSF Major and Midscale Facilities. The students also have an opportunity to participate in internships at the facilities in the Summer. So far, we have seen many students become interested in pursuing careers related to CI and research infrastructures directly as the result of our program.

You founded the WORKS Workshop nearly two decades ago. How have you seen the community’s understanding and use of scientific workflows evolve over time, and what emerging trends or technologies are you most excited about today?

Indeed, I organized the first WORKS Workshop in 2006 in conjunction with the HPDC conference in Paris. In 2008, we proposed the workshop to SC, which has since hosted the workflow community. Over the years, the workshop has been chaired by a number of workflow researchers, this year by Silvina Caino-Lores and Anirban Mandal.

In the past two decades, we have seen an increase in the use of workflow technologies by scientists who must orchestrate ever more complex, multi-step applications and data flows. Over time, the workflow systems community has explored ease of workflow composition, workflow scheduling, resource provisioning, provenance, and reproducibility. Many workflow systems have been developed in the last 20+ years, with some focusing on specific application domains or scientific disciplines. Some were designed specifically for clouds, some focus on data rather than tasks for control. Scientists can also be using workflows without knowing they are doing that- for example when workflow systems operate in the backend of portals and other user interfaces.

I think that we are entering a renaissance in workflow management systems. With AI, we can make systems —not only workflow systems, but the entire cyberinfrastructure — more robust to anomalies and failures, and more intelligent in how it learns user, application, service, and resource behavior and adapts to a changing environment. For example, knowing that certain anomalies lead to job failures and why, you can choose different resources for application execution if the cause is insufficient memory. If these anomalies indicate node failure, then administrators can also be notified to take the node offline and schedule maintenance.

Agentic AI is another area where innovation is moving fast, driven primarily by industry, which is developing agentic workflows to schedule human workers, orchestrate industrial workflows, use agents to control robots and other assets, and collaborate on orchestration. Although there is enthusiasm for using agents, there are still unresolved challenges in safety assurance, cybersecurity, reliability, transparency, and governance of agentic systems operating in complex real-world environments.

Agentic technologies and AI methods are increasingly being applied in scientific domains—for instance, in weather prediction, wildlife detection, and in the emerging concept of self-driving laboratories. The latter are capturing the imagination of researchers across disciplines as tools for automating the design, characterization, and experimentation of new materials, proteins, batteries, and more. Even today, AI is beginning to be integrated directly into experimental facilities (as in CMS, I mentioned earlier). However, realizing the full vision of self-driving labs will require addressing not only the technical challenges of robustness and cybersecurity already mentioned, but also critical issues such as result validation and verification, model interpretability, ethical considerations, and the evolving role of humans within the automated scientific lifecycle.

[1] Deelman, E., Vahi, K., Rynge, M., Mayani, R., Ferreira da Silva, R., Papadimitriou, G., & Livny, M. (2019). The Evolution of the Pegasus Workflow Management Software. Computing in Science Engineering, 21(4), 22–36. https://doi.org/10.1109/MCSE.2019.2919690

Dr. Ewa Deelman will receive her award at SC 2025.