Larry Smarr on The Rise of Supernetwork Data Intensive Computing

Sam Fried

“So even though we had just done the SIGGRAPH satellite hookup in ’89, by ’91 we were able to do these remote collaborations demos for real. I want to acknowledge that the annual SC conferences have always been critical to our progress, because we used them to drive things forward. […]

“So even though we had just done the SIGGRAPH satellite hookup in ’89, by ’91 we were able to do these remote collaborations demos for real. I want to acknowledge that the annual SC conferences have always been critical to our progress, because we used them to drive things forward. For instance, SC ’95 was in San Diego, with Sid Karin as the overall SC95 Chair while I served as Program Chair. A group of us, including Tom DeFanti, Ian Foster and Rick Stevens organized the I-WAY project to run at the conference. Linda Winkler was absolutely critical for the making this all work. I-WAY featured the first interconnection of the telecos 155-megabit/s networks. Researchers across the country carried out 65 different science projects on this distributed system using what we called in those days “high-speed networking”. More importantly, the I-Soft software system that Ian Foster developed to control that distributed cyberinfrastructure led directly to the GLOBUS project that still is used everywhere,” said Smarr.

“I remember sitting on the couch with Miron Livny in my office when we made the decision to name this the National Technology Grid. The NSF-funded vBNS directly led to the formation of Internet2, which is now the dominant network connecting all the universities in the country,” said Smarr.

Smarr credits the building of Roadrunner, a COTS-based Linux “supercluster” at the university of New Mexico and placing it on the National Technology Grid as a watershed moment. He noted, David Bader, who led its development, received the 2021 Sidney Fernbach Award at SC21, in part for work on Roadrunner. “This was really the golden spike in my view on creating the distributed systems we use today.”

Another critical development was the deployment of I-Wire in Illinois, and at that same time, Indiana set up I-Light, both of which were dark fiber networks in the States. “These really were very inspirational, they were copied widely,” said Smarr. “When I got to California (~2000), a year later,” said Smarr who’d transitioned from U Illinois to UCSD, I found that CENIC (Corporation for Education Network Initiatives in California) was still buying bandwidth from Verizon and AT&T and did not have a dark fiber system. I joined the nonprofit board of CENIC, and within a few years, we started buying our own fiber. Today CENIC has 8,000 miles of owned and managed fiber. So, this growth over the last 20 years of these regional optical networks is a critical part of what lets us build today’s distributed systems, such as the Pacific Research Platform.”

“We put together a team to create a proposal to that ITR program, and we were able to win one called the Optiputer. This was the real question at that time; we had a lot of clusters, Beowulf clusters and others, like David Bader had developed, but their backplanes were much faster than the Wide Area Network. So compute clusters were basically digital islands. The OptIPuter team had the idea that because of the fiber optics, we would be able to make the wide area bandwidth equal to the cluster bandwidth and remove the locality issue,” recalled Smarr.

Working with the funds from the OptIPuter grant, we were able to build a prototype and interconnect with the NSF TeraGrid over the then National LambdaRail. This happened as 4K video streaming was just emerging. “We had the first 4K projector in the United States at Calit2 which I founded in 2000.” said Smarr.

“As all this developed, DOE and particularly ESNET, were able to codify the notion of what it meant to have this kind of high-speed networking for scientific computing and instrumental data on a campus. This led to their ScienceDMZ concept in 2010, in which you needed a Data Transfer Node (DTN) to terminate the high-speed optical fiber, as we had learned from the gigabit testbeds, with network performance monitoring being carried out by perfSONAR,” said Smarr.

“NSF heard about this and this created a second great example of NSF adopting from DOE (the first being setting up the NSF Supercomputer Centers). NSF created the campus cyberinfrastructure (CC*) program that Kevin Thompson has been running since 2012; [it] has made over 340 awards, funding these optically fiber connected ScienceDMZ cyberinfrastructure systems on campuses across the country,” said Smarr.

The next obvious step, said Smarr, was to take all the campus ScienceDMZ deployments that NSF had funded, and hook them together. By 2015 CENIC had [its own] elaborate fiber optic system that connected all the California campuses and supercomputer centers (NASA, DOE and NSF) together. “We put a proposal together six years ago called the Pacific Research Platform [PRP] to see if we could create a regional Science DMZ,” he said.

“We’ve created a PRP website over the last few years that has everything you need to get started on the PRP. It includes all the technical details of how to use Jupyter notebooks, how to containerize, and so forth. I recommend you go there if you’re interested,” said Smarr.

To solve the host I/O bottleneck that the Gigabit Testbeds discovered, “We’ve developed these DTNs we call Flash I/O Network Appliances, FIONAs. Flash made all the difference. It solved the host I/O data bottleneck, the Achilles heel that Bob Kahn talked about with four gigabit testbeds, but now at 10-40-or-100 times the speed that the gigabit testbeds had. They are just PCs with terabytes of Flash memory, multicore CPUs. But then you can put in eight gaming 32-bit GPUs and turn it into a machine learning platform. So, these FIONAs are the endpoints that the fiber optics terminate in and they’re on your campus. Because they’re commodity if you just get the specs, you can order them yourself,” said Smarr.

Of course, the science community was not the only one vigorously pursuing distributed computing. Commercial cloud vendors also have racks of PC systems tied together with fiber optics and faced similar challenges with regard to managing user access and software execution.

“The Google Cloud developed Kubernetes, and then made it open-source, as a way to orchestrate containers across a global set of optically fiber connected PCs. Now if you’ve containerized your application, you have this very global digital fabric on which Kubernetes can move your things around, and you can run Ceph as the storage system under Rook,  which runs under Kubernetes. That means we can manage petabytes of distributed storage and GPUs for data science use,” said Smarr, crediting work by John Graham to accomplish this.

The PRP kept expanding beyond the West Coast and across the country. “Now we have 184 of these [FIONAs] located on over 25 campuses and they’re all interconnected at 10 to 100 gigabits per second. Together the PRP FIONAs contain nearly 7,500 CPU-cores of over 500 32-bit GPUs, as well as four petabytes of rotating storage all running over CENIC, the Quilt regional networks, and Internet2. The extended PRP runs all the way from Korea and out in Guam, to Hawaii, across the U.S., to Amsterdam in Europe, Amsterdam” said Smarr.

There were speedbumps along the way – this is science after all – and Smarr recalled that early efforts to use the FIONAs widely had problems. “When we got started in 2016 we tried to do10 gigabyte file transfers between all and all of the FIONAs and then measure it. However, you can see back in 2016, the orange [on this slide] means we couldn’t even get the thing to read; we got basically no throughput, it was broken. But by a year and a half later, you can see the green, indicating that we had accomplished disk-to-disk file transfer at five gigabits per second and this was pretty much uniform. We also added TraceRoute to determine the state of all the network segments between the PRP FIONAs. These measurements have by now all been containerized.”

There was a great deal more to Smarr’s talk. Maybe SC21 will make a video of it widely available. That said, it’s important to remember how transformational it is for enabling the pursuit of science using powerful distributed computing infrastructure. Smarr cited few examples. Here’s one.

“We built this [research platform] to enhance multi-campus, data-intensive applications. We brought in a bunch of application teams from particle physics, telescope surveys, atmosphere sciences, biomedical, and virtual reality as drivers for this,” said Smarr. “Just to give you an example, this is a global precipitation map from the MERRA NASA satellite archives that sends data to a center in Irvine for hydrometeorology to see all the developing precipitation systems,” said Smarr.

“As you’ve been reading, the atmospheric rivers just wiped out a lot of the Northwest US and Western Canada. You need to be able to predict those extreme precipitation events and you want to be able to do it in real time,” he said. “So, there’s a center for Western weather that looks at all of this in San Diego. We wanted to link the NASA satellite archives to the Irvine Center and then to the San Diego center. Well, the workflow was taking 19 days to do one round and that’s not predictive. But by using the PRP distributed workflow with its machine learning, and optimizing that on the FIONAs, we were able to get it from 19 days down to 52 minutes, which means it goes from being retrospective to predictive.”

So where are we going in the future, asked Smarr?

“The first thing is we’ve just gotten two awards from NSF that are going to allow us to increase participation by a number of Minority Serving Institutions, by building FIONAs on those campuses, as well as bringing in more EPSCoR states, the least funded by NSF states, in South Carolina, up in Nebraska, as well as to increase the spread of the of the PRP across the country,” he said.

Secondly, he said, “The NSF XSEDE program that funds the supercomputer centers, has funded a distributed version of SDSC called Expanse that uses the same kind of composable systems [and] Ceph storage as the PRP and as the Open Science Grid. We have already federated the containerized applications on the PRP, Expanse, and the Open Science Grid. Expanse is a five-year proposal.”

“Third, the NSF has just funded Frank Wuerthwein, a PRP co-PI, with his NRP, a prototype of the National Research Platform. This is going to put a lot of emerging computational systems like FPGAs – as well as 64-bit GPUs like are used on the supercomputers – and more distributed data systems building on the work he’s done as executive director of the Open Science Grid.”

“I think you’re going to see this type of cyberinfrastructure become a permanent part of the research landscape – a completely distributed high-performance computing, networking, storage and analysis facility,” said Smarr.

“As I said, even though NSF is funding it in our country, we already have seen that it’s being interconnected to Europe and the Pacific Rim as well. In closing, I wanted to say that this talk is not meant as a scholarly historical analysis of the last 40 years. Rather, it’s my personal recollection of key moments that I was personally involved in.”

Dr. Larry Smarr is a University of California San Diego Distinguished Professor Emeritus in UCSD’s Department of Computer Science and Engineering. For the last 20 years he served as the founding Director of the California Institute for Telecommunications and Information Technology (Calit2). He was earlier the Founding Director of the National Center for Supercomputing Applications (NCSA). In 2006, he received the IEEE Computer Society Tsutomu Kanai Award for his lifetime achievements in distributed computing. Smarr continues to provide national leadership in advanced cyberinfrastructure (CI), currently serving as Principal Investigator on three NSF CI research grants: the NSF Pacific Research Platform, Cognitive Hardware and Software Ecosystem Community Infrastructure, and Toward the National Research Platform. He is a member of the National Academy of Engineering and a Fellow of the American Physical Society and has served on top level advisory committees to NIH, NASA, NSF, and DOE.

SC21: Larry Smarr on The Rise of Supernetwork Data Intensive Computing

Next Post

Deloitte's take on quantum computing: Great potential but hard to execute

Analysts say that the technology is not relevant for most companies despite millions of dollars of investment from governments and VCs over the last year. Image: Shutterstock/Production Perig Quantum computing is attracting significant investments from governments and venture capitalists, but the technology remains “uneconomic for solving real-world problems,” according to […]

Subscribe US Now