NK Landscape Protein IFP Benchmark Suite

Introduction

This benchmark suite consists of nominal NK Landscape model instances designed to mimic the properties of one challenging optimisation problem from biology: the Inverse Folding Problem (IFP), here focusing on a simpler secondary structure version. The IFP problem consists in finding sequences that fold into a defined structure and is an important research problem that is at the heart of most rational protein design approaches. With a simple definition based on the well known NK Model, the motivation is to make the IFP problem more accessible to algorithm specialists and model experts contrary to being a problem solved mostly by bioinformaticians with main expertise in other fields.

Inverse Folding Problem (IFP)

The structure of a protein can be divided into different levels. The primary structure is the protein sequence of N amino acids {aai} where 1 ≤ i ≤ N is the residue position. The secondary structure defines or annotates the organisation of helices, sheets and loops of the tertiary structure and can be expressed by a type {Ti} ∈ {H, E, L} for each position i in the protein. The tertiary structure completely describes the arrangement of all atoms of a single sequence in three-dimensional space:

The solution to an instance of the protein Inverse Folding Problem (IFP) is ideally the complete set of all sequences that fold into the given reference protein structure. In practice researchers focus on finding a limited number of new matching sequences which are as different as possible from the original reference sequence.

The NKL Model

The NK Model introduced by Kaufmann is a tunable rugged fitness function designed to model complex epistatic links among variables, to study topics such as gene-interaction. A central feature of the model is its stochastic design which opens up possibilities for statistical analysis of its properties without exact knowledge of all underlying epistatic interactions. The nominal discrete NKL model proposed by Li et al. is of interest, as L denotes the possible values at each allele location with L = 2 defining the binary case corresponding to the original NK Model.

The proposed NKL model

The proposed NK model is a variation of the standard NKL model. It omits the contribution of the ith position in x, hence K for an identical neighborhood will be one larger than in the original model and the maximum K becomes K = N. This is a minor change that allows to re-create epistatic link effects of the target IFP problem. In addition, the model uses a single function F₀ instead of N different Fi functions:

The novelty is the combination of two NKL Models, F^A(x) and F^B(x), with different K and different neighborhood definitions by a simple multiplication:

Acknowledgments

Work funded by the National Research Fund of Luxembourg (FNR) as part of the EVOPERF project at the University of Luxembourg with the AFR contract no. 1356145.