U.S. Department of Energy Office of Biological and Environmental Research

BER Research Highlights

A Game Changer: Protein Clustering Powered by Supercomputers
Published: April 23, 2018
Posted: January 24, 2019

New algorithm lets biologists harness massively parallel supercomputers to make sense of a “data deluge.”

The Science
In the world of big data, biologists create data sets containing hundreds of millions of proteins and other cellular components. They apply clustering algorithms to the datasets to identify key patterns. Many of the techniques have been widely used for more than a decade. But they can’t keep up with the torrent of biological data. In fact, few clustering algorithms can handle a biological network with millions of nodes (proteins) and edges (connections). Researchers from Lawrence Berkeley National Laboratory and the Joint Genome Institute took on one of the most popular clustering approaches in modern biology—the Markov Clustering (MCL) algorithm. They modified it to run quickly, efficiently, and at scale on distributed memory supercomputers.

The Impact
The team’s high-performance algorithm—called HipMCL—handles massive biological networks. These networks were impossible to cluster with MCL. With HipMCL, biologists can identify and characterize novel aspects of microbial communities. It works without sacrificing the sensitivity or accuracy of MCL. Using HipMCL, scientists processed a network with about 70 million nodes and 68 billion edges in a few hours. To do this, HipMCL used about 140,000 processor cores at the National Energy Research Scientific Computing Center. As an added benefit, HipMCL runs seamlessly on any computing system.

Given an arbitrary graph or network, it is difficult to know the most efficient way to visit all of the nodes and links. A random walk gets a sense of the footprint by exploring the entire graph randomly; it starts at a node and moves arbitrarily along an edge to a neighboring node. Because there are many different ways of traveling between nodes in a network, this step repeats numerous times. Algorithms such as MCL will continue running this random walk process until there is no longer a significant difference between the iterations. Performing random walks is by far the most computationally and memory intensive step in a cluster analysis. The best way to execute a random walk simultaneously from many nodes of the graph is with sparse matrix-matrix multiplication.

The unprecedented scalability of HipMCL comes from its use of state-of-the-art algorithms for sparse matrix manipulation. Berkeley Lab computer scientists developed some of the most scalable parallel algorithms for GraphBLAS’s sparse matrix-matrix multiplication and modified one of their state-of-the-art algorithms for HipMCL.  

Ariful Azad 
Lawrence Berkeley National Laboratory 

Aydin Buluç 
Lawrence Berkeley National Laboratory 

Nikos Kyrpides 
DOE’s Joint Genome Institute and Lawrence Berkeley National Laboratory 

Development of HipMCL was primarily supported by the Department of Energy’s (DOE’s) Office of Science via the Exascale Solutions for Microbiome Analysis (ExaBiome) project, which is developing exascale algorithms and software to address current limitations in metagenomics research. The development of the fundamental ideas behind this research was also supported by DOE’s Office of Advanced Scientific Computing Research’s Applied Math Early Career program. The team used resources at the Joint Genome Institute and the National Energy Research Scientific Computing Center, both DOE Office of Science user facilities.

A. Azad, G.A. Pavlopoulos, C.A. Ouzounis, N.C. Kyrpides, and A. Buluç, “HipMCL: A high-performance parallel implementation of the Markov clustering algorithm for large-scale networks.” Nucleic Acids Research gkx1313 (2018). [DOI: 10.1093/nar/gkx1313]

Related Links
Lawrence Berkeley National Laboratory press release: A Game Changer: Metagenomic Clustering Powered by HPC

Topic Areas:

  • Research Area: Genomic Analysis and Systems Biology
  • Research Area: Microbes and Communities
  • Research Area: DOE Joint Genome Institute (JGI)
  • Research Area: Computational Biology, Bioinformatics, Modeling
  • Cross-Cutting: Scientific Computing and SciDAC

Division: SC-23.2 Biological Systems Science Division, BER


BER supports basic research and scientific user facilities to advance DOE missions in energy and environment. More about BER

Recent Highlights

May 10, 2019
Quantifying Decision Uncertainty in Water Management via a Coupled Agent-Based Model
Considering risk perception can improve the representation of human decision-making processes in age [more...]

May 09, 2019
Projecting Global Urban Area Growth Through 2100 Based on Historical Time Series Data and Future Scenarios
Study provides country-specific urban area growth models and the first dataset on country-level urba [more...]

May 05, 2019
Calibrating Building Energy Demand Models to Refine Long-Term Energy Planning
A new, flexible calibration approach improved model accuracy in capturing year-to-year changes in bu [more...]

May 03, 2019
Calibration and Uncertainty Analysis of Demeter for Better Downscaling of Global Land Use and Land Cover Projections
Researchers improved the Demeter model’s performance by calibrating key parameters and establi [more...]

Apr 22, 2019
Representation of U.S. Warm Temperature Extremes in Global Climate Model Ensembles
Representation of warm temperature events varies considerably among global climate models, which has [more...]

List all highlights (possible long download time)