Democratizing Access to Cancer-Related Proteomic Data - BioProcess InternationalBioProcess International

The US National Cancer Institute (NCI) has launched the Proteomic Data Commons (PDC), a next-generation proteomic data repository that will facilitate data access, sharing, and analysis to speed development of precision-medicine therapies for cancer. Housed within the NCI’s Cancer Research Data Commons (CRDC), PDC hosts several multiomic data sets that have corresponding genomics and imaging data elsewhere, simplifying their access to enable integrative research.

Increase Access to Enable Discovery
Mass spectrometry (MS)–based proteomic profiling for pancancer analyses could enhance our understanding of cancers’ underlying genomic events and help identify potentially meaningful changes at the proteomic level that otherwise are missed at the genomic level. Currently, researchers’ ability to access such diverse data sets and perform robust and reproducible analyses is stifled by the siloed nature of the informatics infrastructure. Given that public money funds most cancer research, it is imperative to make these data not only freely available, but also rendered in a form that encourages reusability. Doing so eventually could maximize return on those public investments.

PDC provides the largest collection of freely available cancer proteomic data in a highly scalable cloud-based infrastructure that facilitates bringing analytical tools to the data instead of the other way around. The CRDC ecosystem eliminates the need for researchers to download and store extremely large data sets by allowing them to use best-practice tools and pipelines that already have been implemented. Alternatively, researchers may bring their own tools to the data in the cloud, instead of the traditional process of bringing the data to the tools on local hardware.

Harmonizing the Data
Data standards are instrumental in the advancement of cancer research and are crucial for public data sharing. Through a rigorous harmonization process, PDC ensures that all data are standardized using community-based ontologies and controlled vocabularies and formats, thus providing a great opportunity for researchers to use, reuse, reprocess, and repurpose data to drive new discoveries.

In the past, researchers analyzed data sets with disparate computational pipelines that raised the complexity of comparing different data sets, but the PDC harmonizes proteomic data with a common set of analytic pipelines that makes it easier to compare different samples, cancer types, and experimental approaches. Such harmonization could revolutionize precision medicine.

PDC By the Numbers

The Proteomic Data Commons puts researchers in touch with:

46 studies
1,906 cases listed by disease, including ovarian serous cystadenocarcinoma, breast invasive carcinoma, and pediatric and adolescent/young-adult brain tumors
71,627 data files
>357,000,000 MS–MS spectra (recorded at a 1% false discovery rate, FDR)
>1,000,000 distinct peptides (identified by spectral match analysis at 1% FDR)
14,890 distinct proteins (identified in at least one analytical sample with a maximum 1% FDR).

_{*Numbers as of 1 September 2020}

Technical Capability
ESAC, Inc., which provides research data management and, bioinformatics information technology (IT) for government, commercial, and academic researchers, partnered with the NCI in 2017 to design and build the PDC. Now, the repository is making biomedical data sets accessible and connected at an unprecedented scale to facilitate creative ways to combine, analyze, and ask questions that drive precision medicine.

Hosted on the Amazon Web Services (AWS) cloud platform, the PDC provides access to highly curated and standardized biospecimen, clinical, and proteomic data through an intuitive interface that can filter, query, search, visualize, and download data and metadata. Robust application programming interfaces (APIs) are available to ensure that bioinformaticians and other researchers can access data programmatically for analysis with cloud-native and -agnostic applications alike. Thus, PDC users can access data from anywhere in the world quickly, easily, and securely.

Access the PDC at https://pdc.cancer.gov. For more CRDC resources, visit https://datascience.cancer.gov/data-commons.

Ratna Rajesh Thangudu is a bioinformatics scientist at ESAC, Inc., 1801 Research Boulevard, Suite 500, Rockville, MD 20850; ratna.thangudu@esacinc.com.