SHREC 2020: Track on multi-domain protein shape retrieval


Envisioned task

The aim of this track is to assess the performance of shape retrieval algorithms on a dataset of related multi-domains protein surfaces.

Proteins are complex macro-molecular molecules constituted of hundreds to millions of atoms, and are usually classified according to their function in the cellular environment. Most of proteins are modular as they are constituted of multiple domains, each domain bearing a specific function (the ability to bind DNA, as instance). Proteins usually display motions reflecting (1) the relative motion of their atoms and (2) their ability to undergo small to large conformational changes in order to perform their cellular activities through surficial binding notably with other proteins (Protein-Protein Interaction, PPIs). Proteins can be described as non-rigid surfaces representing their solvent-excluded surface as defined by Connoly (Connoly et al, J Appl Cryst. 1983). Detecting partial similarities and/or dissimilarities between a large number of related protein surfaces (all surfaces from all proteins of a cell, for instance) is of main importance in drug discovery pipelines, adverse drug event prediction and in the characterization of molecular processes and diseases.

This track proposes a set of 588 surfaces (provided as .off files) representing the conformational space of 7 proteins. Compared to the previous Protein Shape Retrieval contests, we focus on the evaluation of the performance in retrieval of multi-domains protein surfaces from 26 orthologous proteins (proteins having the same activity in different organisms, i.e. the human and murine haemoglobin proteins) in addition to the usual evaluation of the performance in retrieval of the different conformers of a given protein.

Dataset and Ground Truth

The dataset is based on the evolutionary and structural relationships as defined in the the SCOPe v2.07 database (Fox et al, Nucleic Acids Research, 2014; Chandonia et al, Nucleic Acids Research, 2019). We retained only the entries from X-ray crystallographic Protein Data Bank (Berman et al, Nucleic Acids Research, 2000) structures. For each protein, only the biggest domain is considered so that each protein belongs to only one class.

The structures were retrieved and protonated using propka (Sondergaard et al. Journal of Chemical Theory and Computation, 2011; Olsson et al. Journal of Chemical Theory and Computation, 2011). All solvent-excluded surfaces (SES) were calculated using EDTSurf (Xu et al, Plos One, 2009).

The participants are asked to produce a distance-to-the-query dissimilarity matrix, using each provided .off file as a query. The ground truth is derived from the SCOPe v2.07 database hierarchical classification; only the two lowest levels of the database (Species and Proteins, respectively) are used to generate the ground truth, and will be analyzed for the final report.

SHREC2020_proteins.cla

SHREC2020_species.cla

OFF files can be downloaded here

SHREC20.tar.gz

Evaluation

Standard metrics of previous shape retrieval experiments will be used: precision - recall evaluation, Nearest Neighbor, first-tier and second-tier, e-measure and Discounted Cumulative Gain. The participants are expected to return their results as distance matrix file in binary format. The evaluation scripts will be published alongside with the dataset.

It is important for the participants to provide runtimes of their calculations since it is a critical information for processing large datasets notably in this particular context of molecular shapes.

Expected number of participants

All 3DOR experts interested into treating non-conventional shapes with inherent complexity such as molecular shapes could be interested. To render the track easily accessible to most participants, we provide the meshes of the proteins Solvent excluded surfaces in the .off format. The track proposal will also be forwarded to previous SHREC17’, SHREC18’ and SHREC19’ protein shapes retrieval tracks participants to broaden the audience.

 

 Schedule timeline

Feb 24, 2020 - The dataset is made available on shrec2020.drugdesign.fr. The participants are allowed to run their calculations.

Mar 9, 2020 - Registration deadline. Registration must be sent to Matthieu Montès and Florent Langenfeld.

April 20, 2020 - Submission deadline of the results to the organizers. Each participant is allowed a maximum of up to 3 distance matrices. A brief summary to be included in the track report is written by each participant and submitted with the results.

April 22, 2020 - The organizers circulate the evaluation of all participants of the tracks, and release the ground truth.

April 27, 2020 - The organizers send a draft of the track report to the participants for reviews, comments and feedback.

May 03, 2020 - The track review is submitted for review.

Sep 4-5, 2020 - Eurographics Workshop on 3D Object Retrieval 2020 (3DOR)


Organizers 

Matthieu Montès - Conservatoire National des Arts-et-Métiers 
Florent Langenfeld - Conservatoire National des Arts-et-Métiers