vignettes/MAPD.Rmd
MAPD.Rmd
MAPD (Model-based Analysis of Protein Degradability) is random forest model designed for predicting tractable targets of protein degraders. Here, we developed a R package, which enables users to:
BiocManager::install(c("caret", "doParallel", "pROC", "PRROC"))
devtools::install_github('liulab-dfci/MAPD')
The MAPD model predicts protein degradability based on five protein-intrinsic features, including in order of importance: ubiquitination potential, phosphorylation potential, protein half-life, acetylation potential and protein length. You can build new models and compare their performance.
data("MAPD", package = "MAPD")
cl <- makePSOCKcluster(16); registerDoParallel(cl) # Parallel Processing
# Load feature data
data("featureDat", package = "MAPD")
# Load kinase degradability labels
data("ProteinsForTrain", package = "MAPD")
# Use buid-in data or customized data to train a new model (not run)
MAPD.new = MAPD.train(class = ProteinsForTrain, featureDat = featureDat, features = results$optVariables)
stopCluster(cl) # Close the parallel cluster
The MAPD.CV
function allows users to perform cross-validation, collect prediction scores of proteins in each fold, and finally evaluate the performance based on the prediction scores of all training proteins.
# Comparing `MAPD` and `MAPD.new` based on cross-validation (not run)
cl <- makePSOCKcluster(16); registerDoParallel(cl) # Parallel Processing
# Precison-Recall Curve
prc_1 = MAPD.CV(MAPD, metric = "PRC")
prc_2 = MAPD.CV(MAPD.new, metric = "PRC")
PRC.plot(prcList = list(MAPD = prc_1, MAPD.new = MAPD.new))
# ROC Curve
roc_1 = MAPD.CV(MAPD, metric = "ROC")
roc_2 = MAPD.CV(MAPD.new, metric = "ROC")
ROC.plot(rocList = list(MAPD = roc_1, MAPD.new = MAPD.new))
stopCluster(cl) # Close the parallel cluster
With extensive proteomic profiling of protein degradability and protein features, you can further explore protein features predictive of degradability and develop a model to predict proteome-wide degradability.
## Length ArgRatio LysRatio SerRatio ThrRatio
## H3C1 136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C10 136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C11 136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C12 136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C2 136 0.1323529 0.09558824 0.03676471 0.07352941
## H3C3 136 0.1323529 0.09558824 0.03676471 0.07352941
# Load kinase degradability labels
data("ProteinsForTrain", package = "MAPD")
head(ProteinsForTrain)
## ACVR1B BRSK1 BRSK2 CAMK1D CAMK2D CAMKK2
## low low low low low low
## Levels: low high
You can incorporate new feature data and/or increased protein degradability data into downstream analysis. The data format should be the same as the build-in data.
# ?featureDat: a matrix or a data frame, each row represents a gene/protein, each column represents a feature.
# ?ProteinsForTrain: a factor or a vector, specifying genes/proteins with high or low degradability for training the model. The names of the `ProteinsForTrain` match the row names of `featureDat`.
Based on the new feature data and/or customized protein degradability data, you can investigate features predictive of protein degradability by performing feature selection using the function caret::rfe
.
# Parallel Processing
cl <- makePSOCKcluster(16)
registerDoParallel(cl)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method = "cv", number = 20)
# rfe (recursive feature elimination)
results <- rfe(featureDat[names(ProteinsForTrain), ], ProteinsForTrain,
sizes=seq(2,10,2), rfeControl=control)
# summarize the results
print(results)
# plot the results
plot(results, type=c("g", "o"))
stopCluster(cl)
# In the MAPD manuscript, we performed forward feature selection and seleted five important features for predicting degradability, including Ubiquitination_2, Phosphorylation_2, Acetylation_1, Length, Zecha2018_Hela_Halflife.
With the newly-selected features, you can use the MAPD.train
function to train a final model for predicting protein degradability.
# You can input your own protein degradability labels and protein feature data to train a new model (not run)
cl <- makePSOCKcluster(16); registerDoParallel(cl) # Parallel Processing
MAPD = MAPD.train(class = ProteinsForTrain, featureDat = featureDat, features = results$optVariables)
stopCluster(cl) # Close the parallel cluster
You can see our predictions of proteome-wide degradability at http://mapd.cistrome.org/.
# (not run)
preds = predict.train(MAPD, newdata = featureDat, type = "prob")
preds = preds[order(-preds[,1]), ]
head(preds)
For proteins included in the training, we recommend collecting their prediction scores from cross-validation.
# (not run)
cl <- makePSOCKcluster(4); registerDoParallel(cl) # Parallel Processing
# Collect predictions from cross-validation
preds_cv = MAPD.CV(MAPD, out = "pred")
head(sort(preds_cv, decreasing = TRUE))
# Update the predictions for proteins in the training
preds[rownames(preds_cv), 1] = preds_cv
preds[rownames(preds_cv), 2] = 1-preds_cv
stopCluster(cl) # Close the parallel cluster