At Oregon State University’s College of Engineering, interest in graphics processing unit (GPU) computing has been growing steadily. Students, staff, and faculty agreed that the University should fully embrace the emerging trend.
“Artificial intelligence, machine learning, parallel programing: those are all really hot items right now,” says Todd Shechter, Director of Information Technology at Oregon State University College of Engineering. We are seeing a lot of interest in the undergraduate curriculum space.”
As the College of Engineering developed a strategy for provisioning and providing high-power GPU computational capabilities, it worked with NVIDIA® and Microway, seeking a cutting-edge computational solution. The resulting investment in six new supercomputers has improved the school’s educational and research capabilities, helping the machine learning and artificial intelligence group to grow into an increasingly important presence.
Six NVIDIA DGX-2 systems support machine learning and artificial intelligence
The College of Engineering makes up about a third of Oregon State University, with about 10,000 students, staff, and faculty. At the outset of the decision-making process, faculty and administrators gathered together to define the characteristics of an ideal campus-wide computing resource. The University required enough GPU capacity to serve the diverse needs of undergraduate classes as well as research workloads, plus super-fast storage. The solution had to scale, yet it also had to represent a dramatic leap in capability.
The team opted for the NVIDIA DGX-2™ enterprise AI research system, partly because of Docker images that have NVIDIA’s containerized software, and partly due to technical support considerations. Mostly, the decision was made for its computational horsepower. Each appliance delivers AI performance of up to 2 petaFLOPS: an entire supercomputer worth of computational horsepower.
Once they identified the DGX-2 as the right fit, they had to determine how many would be needed. The University hosted workshops with faculty, administrators, and NVIDIA to learn of plans to use the new technology for such areas as medical imaging, nuclear research, bridge construction, robotics, and driverless vehicles. “What we learned is there is a lot of interest in GPU computing, but we had a hardware gap,” Shechter says.
Oregon State was not entirely without GPU capabilities before the upgrade, but its single-precision consumer-model GPUs were built for gaming and lacked both double precision compute capabilities matched to critical scientific applications and an effective way to stitch GPUs and systems together.
Relatively scattered and without any means deployed to efficiently connect the resources, the preexisting GPU infrastructure was simply unwieldy. There was no existing means to even unlock the true aggregate power of the available systems.
All that would change. In addition to adding raw hardware to Oregon State’s GPU resources, the new DGX-2 systems would solve common scaling predicaments by leveraging NVSwitch scalable architecture and NVLink™ communication for high-speed GPU-to-GPU interconnects inside the appliances.
Each DGX-2 packs 16 fully-connected Tesla® V100 GPUs stitched together by these technologies. The result is an individual system that offers equivalent capability to dozens of existing GPU nodes or hundreds of CPU-only servers – enough for many individual jobs run on campus.
After factoring in myriad use cases from researchers, enrollment figures for technology-enabled undergraduate courses, and an analysis of the overuse of existing infrastructure, Oregon State calculated that a major upscale was warranted. “So that the experience remains authentic and we aren’t trying to cram everyone onto a single machine, we came up with the number six,” says Shechter.
The proposed increase in scale was not only about capacity planning – it was also focused on unlocking new possibilities. The diverse needs of the community included requests to dramatically upscale their computational science.
That is where clustering of the DGX-2 systems came into play. Newly arrived Mellanox InfiniBand would be used to bridge between DGX-2s to enable jobs that required multiple supercomputers. .
The cluster of six DGX-2s with InfiniBand connectivity offered Oregon State a linked network of 96 of the world’s most powerful computational engines.
The improved fabric infrastructure stitched together computing infrastructure more completely, with higher bandwidth and lower latency, than had been available in past computational resources. Larger datasets, more experimental attempts, higher precision in simulation, and requests for greater accuracy all became feasible across the campus community.
Making plan a reality calls for expert installation
Deciding on a solution and getting it up and running are two separate and distinct achievements. Oregon State’s IT professionals had a number of questions regarding installation and setup of the new systems, particularly about power needs. What types of power delivery should be used? How do we make sure the amperages, volts, watts, and other details match the systems’ requirements?
These were not trivial questions. Each DGX-2 consumes about 10 kilowatts (kW) of power, the equivalent of over 8 homes or 3-4 traditional server racks full of systems. While extremely efficient for the throughput delivered, it would require very careful planning to install the DGX-2s in the datacenter.
Microway, Inc., NVIDIA Partner Network HPC Partner of the Year, installed the DGX-2 deployment at Oregon State. “Microway was really great at helping us through the nitty-gritty details,” Shechter says.
Supercomputing as an education tool
For Oregon State, the new DGX-2s optimize functionality and flexibility. They handle single-precision and double-precision workloads, and crunch data in essentially any form. They are tied together in a cohesive computing unit when large jobs are required, yet they are also easily partitioned for smaller projects.
Oregon State’s six DGX-2 systems will not be reserved for researchers, graduate students, and faculty. The world-class computational offerings will also be utilized as a pedagogical tool to undergraduates interested in experimenting with the most current technologies.
“When we teach an undergraduate class in parallel programming, machine language, or artificial intelligence, we have the processing power to back up what we teach,” says Shechter.
Embracing campus’s new supercomputers
“I’ve had faculty members who have run their simulations on existing hardware, and then run it on DGX hardware, and the difference just blows your mind away,” Shechter says. “We’re really hoping that this helps our faculty members produce results that they can then share broadly in their communities.”
From the student perspective, the resource provides opportunities that extend beyond the classroom. A student-led team developing a driverless electric car uses the DGX-2s for a large number of the simulations the young engineers must run to hone their designs and code. The student project had previously worked only on standard combustion engine vehicles, but the new computational abilities have enabled them to turn toward tackling the future of mobility.
Oregon State’s bold decision to invest in six new supercomputers is designed to improve its educational community. “We want to attract the very best to our campus, whether student, staff, or faculty, and we think the DGX-2 is going to go a long way to show how serious we are about that,” Shechter explains.
We’d love to stay in touch, sign up for the Tech & Learning University team to contact you with great news, content and offers.