Model-based Analysis of Protein Degradability

Introduction

MAPD (Model-based Analysis of Protein Degradability) is random forest model designed for predicting tractable targets of protein degraders. Here, we developed a R package, which enables users to:

Reproduce our MAPD model for benchmarking
Investigate protein features predictive of protein degradability
Extend the MAPD model by incorporating new protein degradability data and/or protein feature data

Installation

Install required packages

BiocManager::install(c("caret", "doParallel", "pROC", "PRROC"))
devtools::install_github('liulab-dfci/MAPD')

Load required packages

suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(MAPD))

Reproducing MAPD for benchmarking

Build models for comparison

The MAPD model predicts protein degradability based on five protein-intrinsic features, including in order of importance: ubiquitination potential, phosphorylation potential, protein half-life, acetylation potential and protein length. You can build new models and compare their performance.

Load pre-trained MAPD model

data("MAPD", package = "MAPD")

Train a new model

cl <- makePSOCKcluster(16); registerDoParallel(cl) # Parallel Processing
# Load feature data
data("featureDat", package = "MAPD")
# Load kinase degradability labels
data("ProteinsForTrain", package = "MAPD")
# Use buid-in data or customized data to train a new model (not run)
MAPD.new = MAPD.train(class = ProteinsForTrain, featureDat = featureDat, features = results$optVariables)
stopCluster(cl) # Close the parallel cluster

Evaluate the performance of the final model based on cross-validation

The MAPD.CV function allows users to perform cross-validation, collect prediction scores of proteins in each fold, and finally evaluate the performance based on the prediction scores of all training proteins.

# Comparing `MAPD` and `MAPD.new` based on cross-validation (not run)
cl <- makePSOCKcluster(16); registerDoParallel(cl) # Parallel Processing
# Precison-Recall Curve
prc_1 = MAPD.CV(MAPD, metric = "PRC") 
prc_2 = MAPD.CV(MAPD.new, metric = "PRC")
PRC.plot(prcList = list(MAPD = prc_1, MAPD.new = MAPD.new))

# ROC Curve
roc_1 = MAPD.CV(MAPD, metric = "ROC")
roc_2 = MAPD.CV(MAPD.new, metric = "ROC")
ROC.plot(rocList = list(MAPD = roc_1, MAPD.new = MAPD.new))
stopCluster(cl) # Close the parallel cluster

Investigating features associated with protein degradability

With extensive proteomic profiling of protein degradability and protein features, you can further explore protein features predictive of degradability and develop a model to predict proteome-wide degradability.

Load build-in data

# Load feature data
data("featureDat", package = "MAPD")
head(featureDat[, 1:5])

##       Length  ArgRatio   LysRatio   SerRatio   ThrRatio
## H3C1     136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C10    136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C11    136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C12    136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C2     136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C3     136 0.1323529 0.09558824 0.03676471 0.07352941

# Load kinase degradability labels
data("ProteinsForTrain", package = "MAPD")
head(ProteinsForTrain)

## ACVR1B  BRSK1  BRSK2 CAMK1D CAMK2D CAMKK2 
##    low    low    low    low    low    low 
## Levels: low high

Incorporate new data

You can incorporate new feature data and/or increased protein degradability data into downstream analysis. The data format should be the same as the build-in data.

# ?featureDat: a matrix or a data frame, each row represents a gene/protein, each column represents a feature.
# ?ProteinsForTrain: a factor or a vector, specifying genes/proteins with high or low degradability for training the model. The names of the `ProteinsForTrain`  match the row names of `featureDat`.

Feature selection

Based on the new feature data and/or customized protein degradability data, you can investigate features predictive of protein degradability by performing feature selection using the function caret::rfe.

# Parallel Processing
cl <- makePSOCKcluster(16)
registerDoParallel(cl)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method = "cv", number = 20)
# rfe (recursive feature elimination) 
results <- rfe(featureDat[names(ProteinsForTrain), ], ProteinsForTrain,
               sizes=seq(2,10,2), rfeControl=control)
# summarize the results
print(results)
# plot the results
plot(results, type=c("g", "o"))
stopCluster(cl)

# In the MAPD manuscript, we performed forward feature selection and seleted five important features for predicting degradability, including Ubiquitination_2, Phosphorylation_2, Acetylation_1, Length, Zecha2018_Hela_Halflife.

Extending the MAPD model for predicting protein degradability

With the newly-selected features, you can use the MAPD.train function to train a final model for predicting protein degradability.

# You can input your own protein degradability labels and protein feature data to train a new model (not run)
cl <- makePSOCKcluster(16); registerDoParallel(cl) # Parallel Processing
MAPD = MAPD.train(class = ProteinsForTrain, featureDat = featureDat, features = results$optVariables)
stopCluster(cl) # Close the parallel cluster

Predicting proteome-wide degradability

You can see our predictions of proteome-wide degradability at http://mapd.cistrome.org/.

Predict degradability using the final model

# (not run)
preds = predict.train(MAPD, newdata = featureDat, type = "prob")
preds = preds[order(-preds[,1]), ]
head(preds)

Collect predictions from cross-validation

For proteins included in the training, we recommend collecting their prediction scores from cross-validation.

# (not run)
cl <- makePSOCKcluster(4); registerDoParallel(cl) # Parallel Processing

# Collect predictions from cross-validation
preds_cv = MAPD.CV(MAPD, out = "pred")
head(sort(preds_cv, decreasing = TRUE))

# Update the predictions for proteins in the training
preds[rownames(preds_cv), 1] = preds_cv
preds[rownames(preds_cv), 2] = 1-preds_cv
stopCluster(cl) # Close the parallel cluster

Wubing Zhang

2021-08-30