Index-Predict the ADMET of peptides

What is pepADMET?

In the process of drug development, efficacy and safety are the major reasons for failure. Evaluating the absorption, distribution, metabolism, excretion, and toxicity (ADMET) of agents at an early stage plays a vital role in R&D. Here, we have developed pepADMET, the first comprehensive peptide-ADMET prediction platform. Based on an extensive database of 36,643 entries, pepADMET provides users with predictions for 29 ADMET endpoints and systematic analyses to accelerate peptide drug discovery.

Features

Data mining for over 10 years
More than 36,000 entries
29 key properties of ADMET
Support cyclic, linear, natural and modified peptides
Support different cell lines, organs and species
Freely available
Easy to use

Data overview

The cleaned and specialized datasets that fed into the models were summarized as follows:

Properties	Total (positive/Negative)	training set (positive/Negative)	test set (positive/Negative)	valuation set (positive/Negative)
LogD_7.4	257	192	65	-
F	305 (117/188)	228 (87/141)	77 (30/47)	-
Caco-2	886	664	222	-
PAMPA	6698	5023	1675	-
RRCK	181	135	46	-
BBB	850 (425/425)	636 (318/318)	214 (107/107)	-
T_1/2	970	727	243	-
Toxicity	14660 (8197/6465)	11729 (6557/5170)	1467 (820/647)	1466 (820/646)
Cytolysis	121	96	13	12
GPCR toxin	130	104	13	13
Neurotoxin	848	678	85	85
Cytotoxicity	621	497	62	62
Hemostasis	148	118	15	15
Hemolysis	6656	5324	666	666
AChR inhibitor	195	156	20	19
Ca²⁺ inhibitor	124	99	13	12
K⁺ inhibitor	274	219	28	27
Na⁺ inhibitor	295	236	30	29
HC₅₀	2423	1938	243	242

LogD_7.4

LogD_7.4 refers to the logarithm of the distribution coefficient of a compound at pH 7.4 (physiological pH). Since pH 7.4 is the typical pH of human blood, logD7.4 is one of the commonly used parameters in drug development to assess the balance between lipophilicity and hydrophilicity of a drug under physiological conditions.

Methods
We collected 257 data from databases and literature. PyBioMed and modlAMP were used to calculate peptide descriptors and small molecular descriptors. Six advanced machine learning algorithms, including RF, XGBoost, SVR, DT, GBT, and LightGBM were used to construct the regression models.

Model Performance
The best performance was achieved by using the GBT algorithms with R² = 0.820 for CV and R² = 0.818 for the test.

Figure 1. The predicted LogD_7.4 versus the experimental values for 5-fold cross-validation and test set.

Bioavailability (F)

Bioavailability is a crucial factor in drug development, as it determines the extent to which a drug reaches its target site in the body and exerts its therapeutic effect.

Methods
By setting appropriate keywords, 305 data were collected from public publications. PyBioMed and modlAMP were used to calculate small molecular descriptors and peptide descriptors. After recursive feature elimination by random forest algorithm, a subset of features containing 141 descriptors was retained as model input. Five advanced machine learning algorithms, including RF, XGBoost, SVM, DT, and LightGBM were used to construct models.

Model Performance
By combining different algorithms with descriptors, the best model achieved an AUC of 0.901 and 0.90 on 5-CV and the test.

Figure 2. The AUC curve of the best models for bioavailability prediction.

Permeability

Permeability is the ability of a peptide molecule to cross cellular or biological membranes and is one of the key determinants of intestinal absorption and oral bioavailability. Commonly used in vitro models such as Caco-2, PAMPA and RRCK can be used to assess permeability. During peptide drug development, the assessment of permeability helps to ensure that the drug is able to overcome biological barriers for effective delivery and therapeutic efficacy.

Dataset
Through systematic data collection, filtering, and de-duplication, we gathered 7,765 high-quality entries from the CycPeptMPDB database and public publications. These entries were categorized by different cell lines and structural types, resulting in five distinct datasets, as shown in Figure 3.

Figure 3. The data distribution across the five permeability datasets.

Methods
Different molecular descriptors and molecular fingerprints were computed to represent the structural and physicochemical properties of the molecules, including MOE2D descriptors, peptide descriptors, small molecular descriptors and MACCS fingerprints. RF, XGBoost, SVR, DT, GBT, and LightGBM were used to construct the models. In addition, GNN was employed to further improve the model’s predictive performance on large datasets. By taking SMILES and sequences as inputs, the GNN integrates molecular graph and descriptor information from two pathways and extracts key features through the attention mechanism, thereby generating the final prediction results.

Figure 4. Overview of GNN framework for permeability prediction across diverse barriers.

Model performance
The best performance of the five models was achieved by MOE2D-SVR, MOE2D-LightGBM, MOE2D-RF, MOE2D-RF and MOE2D-RF. The details were shown below. Considering that MOE2D descriptors rely on external software for computation, we chose to deploy and use the suboptimal models for prediction. Their predictive performance on 5-CV and test set was as follows: R² = 0.411 and 0.435 for the Caco2-L model, R² = 0.582 and 0.527 for the Caco2-C model, R² =0.500 and 0.476 for the Caco2-A model, R² =0.646 and 0.657 for the PAMPA-C model, R² =0.691 and 0.623 for the RRCK-C model.

Figure 5. Predictive performance of the five optimal models.

BBB Penetration

Blood-brain barrier peptides (BBBPs) are peptides that can cross the blood-brain barrier (BBB) and enter the central nervous system. These molecules serve as efficient drug delivery tools for treating central nervous system diseases, while also promoting the development of novel drug carriers and more precise brain-targeted therapeutic strategies.

Methods
We collected 850 data from public publications and databases, including 425 positive and 425 negative samples. PyBioMed and modlAMP were used to calculate small molecular descriptors and peptide descriptors. After recursive feature elimination by random forest algorithm, a subset of features containing 121 descriptors was retained as model input. Five advanced machine learning algorithms, including RF, XGBoost, SVM, DT, and LightGBM were used to construct models.

Model Performance
The PEP-RF model reported the best performance for the BBB prediction with a cross-validation AUC of 0.894 and an external test AUC of 0.889.

Figure 6. The AUC curve of the best models for BBB prediction.

Half-life (T_1/2)

Half-life refers to the time required for the maximum concentration of a peptide drug in plasma to decrease by half. This parameter reflects the metabolic and clearance rates of peptides in the body and serves as an important reference for drug design and treatment planning.

Dataset
A strict pipeline was customized for data collection and pre-processing. The data were sourced from the PEPlife database, PepTherDia database, THPdb database, and public publications. After data cleaning, filtering, and de-duplication, a total of 970 peptide half-life records were obtained. These records were categorized into five datasets based on different species and organs, as shown in Figure 7.

Figure 7. The data distribution across the five half-life datasets. The HBN represents the half-life of natural peptides in human blood. The HBM represents the half-life of modified peptides in human blood. The MBN represents the half-life of natural peptides in mouse blood. The MBM represents the half-life of modified peptides in mouse blood. The MIM represents the half-life of modified peptides in the mouse intestine.

Methods
First, PyBioMed and moldAMP were used to calculate small molecular descriptors and peptide descriptors. However, traditional peptide descriptors make it difficult to describe the behavior patterns of highly flexible peptides in complex biological processes. Consequently, enzymatic cleavage features (such as the total number of cleavages and cleavage sites) were introduced to realize the comprehensive representation of molecules for further model building. After performing feature selection with RFECV-RF, we constructed regression models using three representative algorithms (RF, XGBoost, and SVR). The results showed that introducing enzymatic cleavage features could improve model performance, but the enhancement was relatively limited. To further enhance predictive capability, we applied a transfer learning strategy: first, a large retention time dataset containing 350,000 entries was used for pre-training, followed by fine-tuning the model on the five peptide half-life datasets to achieve accurate predictions of half-life.

Figure 8. Workflow of the transfer learning strategy.

Model performance
Results showed that the introduction of enzymatic cleavage features can improve model performance but not so obviously, the application of transfer learning significantly enhanced the model’s predictive power, and the optimal models achieved excellent R²of 0.84, 0.90, 0.984, 0.93, and 0.94 for the test, respectively, which improved about 15% in terms of the correlation compared to related work. The details were shown below.

Figure 9. Combined plots of the best model’s results.

Toxicity

Peptide toxicity refers to the ability of peptide molecules to induce adverse biological reactions in the body, typically manifested as cytotoxicity, immunotoxicity, and hematotoxicity. Specifically, cytotoxicity refers to the ability of peptides to damage cellular function or induce cell death; hematotoxicity can lead to hemolysis or affect the normal function of blood; immunotoxicity involves peptides triggering allergic or immune responses, potentially causing allergic reactions or immune system abnormalities.

Dataset
First, we collected 14,660 data from the UniProt database, DBAASP v3 database, Hemolytik database, and relevant public publications. Then, these data were categorized into three subsets based on the toxicity presence, various toxicity types, and types of neurotoxicity, with details showed in Figure 10. Furthermore, 2,423 hemolytic peptides with accurate HC50 values were extracted from the aforementioned toxicity records to develop a regression model for HC₅₀ prediction.

Figure 10. Data distribution for different toxicity datasets.

Methods
MLR-GAT uses sequences and smiles of peptides as inputs, on one path, smiles are converted to molecular graphs and the information of atoms and bonds are extracted by RGCN. On the other path, molecular descriptor information is computed from smiles and sequences and the descriptor information is extracted by a 2-layer MLP. The information from both paths is aggregated, and different attention mechanisms are applied to different tasks, enabling predictions for specific tasks.

Figure 11. The overview of the MLR-GAT framework.

Model performance
The best model achieved an AUC of 0.885 for distinguishing toxic and non-toxic peptides, 0.949 for peptide toxicity type classification, 0.905 for neurotoxin classification, and an R²of 0.474 for predicting the HC₅₀ value of hemolytic peptides.

Figure 12. The proposed new stepwise prediction of toxicity and the performance of the best models.

What is pepADMET?

ADMET properties

What is pepADMET?

Features

Data overview

LogD7.4

Bioavailability (F)

Permeability

BBB Penetration

Half-life (T1/2)

Toxicity

LogD_7.4

Half-life (T_1/2)