Get a Quote

# random forest classifier sklearn

## how to save and load random forest from scikit-learn in python? | mljar automated machine learning

In this post I will show you how to save and load Random Forest model trained with scikit-learn in Python. The method presented here can be applied to any algorithm from sckit-learn (this is amazing about scikit-learn!).

Lets save the Random Forest. Im using joblib.dump method. The first argument of the method is variable with the model. The second argument is the path and the file name where the resulting file will be created.

To load the model back I use joblib.load method. It takes as argument the path and file name. I will load the forest to new variable loaded_rf. Please notice that I dont need to initilize this variable, just load the model into it.

While saving the scikit-learn Random Forest with joblib you can use compress parameter to save the disk space. In the joblib docs there is information that compress=3 is a good compromise between size and speed. Example below:

## learn and build random forest algorithm model in python - intellipaat

In this random forest tutorial blog, we will learn what random forest algorithm is? We will see how to build random forest models with the help of random forest classifier and random forest regression functions. This blog highlights the implementation of random forest in Python and Sklearn.

As the name suggests, random forest is nothing but a collection of multiple decision tree models. Random forest is a supervised Machine Learning algorithm. This algorithm creates a set of decision trees from a few randomly selected subsets of the training set and picks predictions from each tree. Then by means of voting, the random forest algorithm selects the best solution.

Also, automatic feature selection reduces the complexity of the model but does not necessarily increase the accuracy. In order to get the desired accuracy, we have to perform the feature selection process manually.

In this random forest tutorial blog, we answered the question, what is random forest algorithm? We also learned how to build random forest models with the help of random forest classifier and random forest regressor functions. Additionally, we talked about the implementation of the random forest algorithm in Python and Scikit-Learn.

## plot trees for a random forest in python with scikit-learn - stack overflow

Because this question asked for trees, you can visualize all the estimators (decision trees) from a random forest if you like. The code below visualizes the first 5 from the random forest model fit above.

The important thing to while plotting the single decision tree from the random forest is that it might be fully grown (default hyper-parameters). It means the tree can be really depth. For me, the tree with depth greater than 6 is very hard to read. So if the tree visualization will be needed I'm building random forest with max_depth < 7. You can check the example visualization in this post.

By clicking Accept all cookies, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

## bagging and random forest ensemble algorithms for machine learning

The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest to understand if the quantity is a descriptive statistic such as a mean or a standard deviation.

Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn the predictions can be quite different.

When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characteristics of sub-models when combining predictions using bagging.

The only parameters when bagging decision trees is the number of samples and hence the number of trees to include. This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing improvement (e.g. on a cross validation test harness). Very large numbers of models may take a long time to prepare, but will not overfit the training data.

A problem with decision trees like CART is that they are greedy. They choose which variable to split on using a greedy algorithm that minimizes error. As such, even with Bagging, the decision trees can have a lot of structural similarities and in turn have high correlation in their predictions.

It is a simple tweak. In CART, when selecting a split point, the learning algorithm is allowed to look through all variables and all variable values in order to select the most optimal split-point. The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search.

Where m is the number of randomly selected features that can be searched at a split point and p is the number of input variables. For example, if a dataset had 25 input variables for a classification problem, then:

These drops in error can be averaged across all decision trees and output to provide an estimate of the importance of each input variable. The greater the drop when the variable was chosen, the greater the importance.

These outputs can help identify subsets of input variables that may be most or least relevant to the problem and suggest at possible feature selection experiments you could perform where some features are removed from the dataset.

Thanks for your clear and helpful explanation of bagging and random forest. I was just wondering if there is any formula or good default values for the number of models (e.g., decision trees) and the number of samples to start with, in bagging method? Is there any relation between the size of training dataset (n), number of models (m), and number of sub-samples (n) which I should obey?

You need to pick data with replacement. This mean if sample data is same training data this mean the training data will increase for next smoking because data picked twice and triple and more. The best thing is pick 60% for training data from sample data to make sure variety of output will occurred with different results. I repeat. Training data must be less than sample data to create different tree construction based on variety data with replacement.

Sir, I have to predict daily air temperature values using random forest regression and i have 5 input varibales. Could you please explain how splitting is performed in regression? exactly what is done at each split point?

Hi Jason, its not true that bootstrapping a sample and computing the mean of the bootstrap sample means improves the estimate of the mean. The standard MLE (I.e just the sample mean) is the best estimate of the population mean. Bootstrapping is great for many things but not for giving a better estimate of a mean.

Hi, Jason! Im reading your article and helped me understand the context about bagging. I need to implement a Bagging for Caltech 101 dataset and I do not know how can I start. I am programing somenthing in Matlab but I dont know how can I create a file from Caltech101 to Matlab and studying the data to create Ensemble. I am little confusing! Do you have any consideration to help me?

Nice tutorial, Jason! Im a bit confuse about the Variable Importance part, which step in bagging algorithm do you need to calculate the importance of each variables by estimate the error function drops?

Could you please explain for me what is the difference between random forest, rotation forest and deep forest? Do you implement rotation forest and deep forest in Python or Weka Environment? If so, please send the link. Many thanks

As you mentioned in the post, a submodel like CART will have low bias and high variance. The meta bagging model(like random forest) will reduce the reduce the variance. Since, the submodels already have low bias, I am assuming the meta model will also have low bias. However, I have seen that it generally gets stated that bagging reduces variance, but not much is mentioned about it giving a low bias model as well. Could you please explain that?

option 2: is it more complex, i.e. am I supposed to somehow take the results of my other algorithms (Im using Logistic Regression, KNN, and Nave-Bayes) and somehow use their output as input to the ensemble algorithms.

In fact my base is composed of 500 days, each day is a time series (database: 24 lines (hours), 500 columns (days)) I want to apply a bagging to predict the 501 day. How can i apply this technique given it resamples the base into subsets randomly and each subset makes one-day forecasting at random. Is the result of the aggregation surely the 501 day?

Hello, Actually i trained the model with 4 predictors and later based on predictor importance one variable is not at all impact on response so i removed that parameter and trained the model but i am getting error obtained during 3 predictors is less as compared with 4 predictor model. will u please help me out why i am getting this error difference if i removed the parameter if it is not at all related to the response variable is reducing error or the error is same please help me out.

1) Can we define input -> output correlation or output -> output correlation ? 2) Can we tell model that particular these set of inputs are more powerful ? 3) Can we do sample wise classification ? 4) It is giving 98% accuracy on training data but still I am not getting expected result. Because model can not identify change in that particular input.

Hello, Jason, My question is; Does the random forest algorithm include bagging by default? So when I use the random forest algorithm, do I actually do bagging? If the random forest algorithm includes bagging by default and I apply bagging to my data set first and then use the random forest algorithm, can I get a higher success rate or a meaningful result?

If my ntree is 1000, that means that the number of bootstrap samples is 1000, each containing, by default, two thirds of the sampled poits and one third is used to get predictions out-of-bag, is this correct?

Hi jason. thank u for complete explanation. I have a high dimensional data with few samples .() 47 samples and 4000 feature) is it good to use random forest for getting variable importance or going to Deep learning?

Thank you Jason for this article ! Still Im a little confuse with Bagging. If rows are extracted randomly with replacement, it is be possible that a features value disappears from the final sample. Hence, the associated decision tree might not be able to handle/predict data which contains this missing value. How should a Random Forest model handle this case ? How to prevent it from such a situation ?

1. Please I have about 152 wells. Each well has unique properties and has time series data with 1000 rows and 14 columns. I merged all the wells data to have 152,000 rows and 14 columns. I used the data for 2 wells for testing (2,000 rows and 14 columns). and the rest for training (2,000 rows and 14 columns). The random forest regression model performs well for training and poorly for testing and new unseen data. Please, what could be the issue?

Hello Jason, I would like to know what is the difference between RandomForest vs BaggingClassifier(max_features < 1 , bootstrap_features = True) ie Subspace. In both these cases, we consider smaller set of features to derive the tree. Then what makes RandomForest special

## adaboost classifier in python - datacamp

In recent years, boosting algorithms gained massive popularity in data science or machine learning competitions. Most of the winners of these competitions use boosting algorithms to achieve high accuracy. These Data science competitions provide the global platform for learning, exploring and providing solutions for various business and government problems. Boosting algorithms combine multiple low accuracy(or weak) models to create a high accuracy(or strong) models. It can be utilized in various domains such as credit, insurance, marketing, and sales. Boosting algorithms such as AdaBoost, Gradient Boosting, and XGBoost are widely used machine learning algorithm to win the data science competitions. In this tutorial, you are going to learn the AdaBoost ensemble boosting algorithm, and the following topics will be covered:

An ensemble is a composite model, combines a series of low performing classifiers with the aim of creating an improved classifier. Here, individual classifier vote and final prediction label returned that performs majority voting. Ensembles offer more accuracy than individual or base classifier. Ensemble methods can parallelize by allocating each base learner to different-different machines. Finally, you can say Ensemble learning methods are meta-algorithms that combine several machine learning methods into a single predictive model to increase performance. Ensemble methods can decrease variance using bagging approach, bias using a boosting approach, or improve predictions using stacking approach.

Bagging stands for bootstrap aggregation. It combines multiple learners in a way to reduce the variance of estimates. For example, random forest trains M Decision Tree, you can train M different trees on different random subsets of the data and perform voting for final prediction. Bagging ensembles methods are Random Forest and Extra Trees.

Boosting algorithms are a set of the low accurate classifier to create a highly accurate classifier. Low accuracy classifier (or weak classifier) offers the accuracy better than the flipping of a coin. Highly accurate classifier( or strong classifier) offer error rate close to 0. Boosting algorithm can track the model who failed the accurate prediction. Boosting algorithms are less affected by the overfitting problem. The following three algorithms have gained massive popularity in data science competitions.

Stacking(or stacked generalization) is an ensemble learning technique that combines multiple base classification models predictions into a new data set. This new data are treated as the input data for another classifier. This classifier employed to solve this problem. Stacking is often referred to as blending.

On the basis of the arrangement of base learners, ensemble methods can be divided into two groups: In parallel ensemble methods, base learners are generated in parallel for example. Random Forest. In sequential ensemble methods, base learners are generated sequentially for example AdaBoost.

On the basis of the type of base learners, ensemble methods can be divided into two groups: homogenous ensemble method uses the same type of base learner in each iteration. heterogeneous ensemble method uses the different type of base learner in each iteration.

Ada-boost or Adaptive Boosting is one of ensemble boosting classifier proposed by Yoav Freund and Robert Schapire in 1996. It combines multiple classifiers to increase the accuracy of classifiers. AdaBoost is an iterative ensemble method. AdaBoost classifier builds a strong classifier by combining multiple poorly performing classifiers so that you will get high accuracy strong classifier. The basic concept behind Adaboost is to set the weights of classifiers and training the data sample in each iteration such that it ensures the accurate predictions of unusual observations. Any machine learning algorithm can be used as base classifier if it accepts weights on the training set. Adaboost should meet two conditions:

In the model the building part, you can use the IRIS dataset, which is a very famous multi-class classification problem. This dataset comprises 4 features (sepal length, sepal width, petal length, petal width) and a target (the type of flower). This data has three types of flower classes: Setosa, Versicolour, and Virginica. The dataset is available in the scikit-learn library, or you can also download it from the UCI Machine Learning Library.

AdaBoost is easy to implement. It iteratively corrects the mistakes of the weak classifier and improves accuracy by combining weak learners. You can use many base classifiers with AdaBoost. AdaBoost is not prone to overfitting. This can be found out via experiment results, but there is no concrete reason available.

In this tutorial, you have learned the Ensemble Machine Learning Approaches, AdaBoost algorithm, it's working, model building and evaluation using Python Scikit-learn package. Also, discussed its pros and cons.

## python - unbalanced classification using randomforestclassifier in sklearn - stack overflow

I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

Sample weights. If None, then samples are equally weighted. Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node. In the case of classification, splits are also ignored if they would result in any single class carrying a negative weight in either child node.

In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.

Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple

This is really a shame that sklearn's "fit" method does not allow specifying a performance measure to be optimized. No one around seem to understand or question or be interested in what's actually going on when one calls fit method on data sample when solving a classification task.

We (users of the scikit learn package) are silently left with suggestion to indirectly use crossvalidated grid search with specific scoring method suitable for unbalanced datasets in hope to stumble upon a parameters/metaparameters set which produces appropriate AUC or F1 score.

But think about it: looks like "fit" method called under the hood each time always optimizes accuracy. So in end effect, if we aim to maximize F1 score, GridSearchCV gives us "model with best F1 from all modesl with best accuracy". Is that not silly? Would not it be better to directly optimize model's parameters for maximal F1 score? Remember old good Matlab ANNs package, where you can set desired performance metric to RMSE, MAE, and whatever you want given that gradient calculating algo is defined. Why is choosing of performance metric silently omitted from sklearn?

At least, why there is no simple option to assign class instances weights automatically to remedy unbalanced datasets issues? Why do we have to calculate wights manually? Besides, in many machine learning books/articles I saw authors praising sklearn's manual as awesome if not the best sources of information on topic. No, really? Why is unbalanced datasets problem (which is obviously of utter importance to data scientists) not even covered nowhere in the docs then? I address these questions to contributors of sklearn, should they read this. Or anyone knowing reasons for doing that welcome to comment and clear things out.

From sklearn documentation: The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

By clicking Accept all cookies, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

## using random forests in python with scikit-learn | oxford protein informatics group

I spend a lot of time experimenting with machine learning tools in my research; in particular I seem to spend a lot of time chasing data into random forests and watching the other side to see what comes out. In my many hours of Googling random forest foobar a disproportionate number of hits offer solutions implemented in R. As a young Pythonista in the present year I find this a thoroughly unacceptable state of affairs, so I decided to write a crash course in how to build random forest models in Python using the machine learning library scikit-learn (or sklearn to friends). This is far from exhaustive, and I wont be delving into the machinery of how and why we might want to use a random forest. Rather, the hope is that this will be useful to anyone looking for a hands-on introduction to random forests (or machine learning in general) in Python.

Sklearncomes with several nicely formatted real-world toy data sets which we can use to experiment with the tools at our disposal. Well be using thevenerableiris dataset for classification and the Boston housingset for regression. Sklearn comes with a nice selection of data sets and tools for generating synthetic data, all of which are well-documented. Now, lets write some Python!

First well look at how to do solve a simple classification problem using a random forest. The iris dataset is probably the most widely-used example for this problem and nicely illustrates the problem of classification when some classes are not linearly separable from the others.

First well load the iris dataset into a pandas dataframe. Pandas is a nifty Python library which provides a data structure comparable to the dataframes found in R with database style querying. As an added bonus, the seaborn visualization library integrates nicely with pandas allowing us to generate a nice scatter matrix of our data with minimal fuss.

Neat. Notice that iris-setosa is easily identifiable by petal length and petal width, while the other two species are much more difficult to distinguish. We could do all sorts of pre-processing and exploratory analysis at this stage, but since this is such a simple dataset lets just fire on. Well do a bit of pre-processing later when we come to the Boston data set.

First, lets split the data into training and test sets. Well used stratified sampling by iris class to ensure both the training and test sets contain a balanced number of representatives of each of the three classes. Sklearn requires that all features and targets be numeric, so the three classes are represented as integers (0, 1, 2). Here were doing a simple 50/50 split because the data are so nicely behaved. Typically however we might use a 75/25 or even 80/20 training/test split to ensure we have enough training data. In true Python style this is a one-liner.

Now lets fit a random forest classifier to our training set. For the most part well use the default settings since theyre quite robust. One exception is the out-of-bag estimate: by default an out-of-bag error estimate is not computed, so we need to tell the classifier object that we want this.

If youre used to the R implementation, or you ever find yourself having to compare results using the two, be aware that some parameter names and default settings are different between the two. Fortunately both have excellent documentation so its easy to ensure youre using the right parameters if you ever need to compare models.

Lets see how well our model performs when classifying our unseen test data. For a random forest classifier, the out-of-bag score computed by sklearn is an estimate of the classification accuracy we might expect to observe on new data. Well compare this to the actual score obtained on our test data.

Not bad. However, this doesnt really tell us anything about where were doing well. A useful technique for visualising performance is the confusion matrix. This is simply a matrix whose diagonal values are true positive counts, while off-diagonal values are false positive and false negative counts for each class against the other.

Now lets look at using a random forest to solve a regression problem. The Boston housing data set consists of census housing price data in the region of Boston, Massachusetts, together with a series of values quantifying various properties of the local area such as crime rate, air pollution, and student-teacher ratio in schools. The question for us is whether we can use these data to accurately predict median house prices. One caveat of this data set is that the median house price is truncated at $50,000 which suggests that there may be considerable noise in this region of the data. You might want to remove all data with a median house price of$50,000 from the set and see if the regression improves at all.

As before well load the data into a pandas dataframe. This time, however, were going to do some pre-processing of our data by independently transforming each feature to have zero mean and unit variance. The values of different features vary greatly in order of magnitude. If we were to analyse the raw data as-is, we run the risk of our analysis being skewed by certain features dominating the variance. This isnt strictly necessary for a random forest, but will enable us to perform a more meaningful principal component analysis later. Performing this transformation in sklearn is super simple using the StandardScaler class of the preprocessing module. This time were going to use an 80/20 split of our data. You could bin the house prices to perform stratified sampling, but we wont worry about that for now.

As before, weve loaded our data into a pandas dataframe. Notice how I have to construct new dataframes from the transformed data. This is because sklearn is built around numpy arrays. While its possible to return a view of a dataframe as an array, transforming the contents of a dataframe requires a little more work. Of course, theres a library for that, but Imlazy so I didnt use it this time.

With the data standardised, lets do a quick principal-component analysis to see if we could reduce the dimensionality of the problem. This is quick and easy in sklearn using the PCA class of the decomposition module.

Notice how without data standardisation the variance is completely dominated by the first principal component. With standardisation, however, we see that in fact we must consider multiple features in order to explain a significant proportion of the variance. You might want to experiment with building regression models using the principal components (or indeed just combinations of the raw features) to see how well you can do with less information. For now though were going to use all of the (scaled) features as the regressors for our model. As with the classification problem fitting the random forest is simple using the RandomForestRegressor class.

Now lets see how we do on our test set. As before well compare the out-of-bag estimate (this time its an R-squared score) to the R-squared score for our predictions. Well also compute Spearman rank and Pearson correlation coefficients for our predictions to get a feel for how were doing.

Congratulations on making it this far. Now you know how to pre-process your data and build random forest models all from the comfort of your iPython session. I plan on writing more in the future about how to use Python for machine learning, and in particular how to make use of some of the powerful tools available in sklearn (a pipeline for data preparation, model fitting, prediction, in one line of Python? Yes please!), and how to make sklearn and pandas play nicely with minimal hassle. If youre lucky, and if I can bring myself to process the data nicely, I might include some fun examples from less well-behaved real-world data sets.

## random forest classification for predicting lifespan-extending chemical compounds | scientific reports

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Ageing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative diseases. Pharmaceutical interventions that slow down ageing and delay the onset of age-related diseases are a growing research area. The aim of this study was to build a machine learning model based on the data of the DrugAge database to predict whether a chemical compound will extend the lifespan of Caenorhabditis elegans. Five predictive models were built using the random forest algorithm with molecular fingerprints and/or molecular descriptors as features. The best performing classifier, built using molecular descriptors, achieved an area under the curve score (AUC) of 0.815 for classifying the compounds in the test set. The features of the model were ranked using the Gini importance measure of the random forest algorithm. The top 30 features included descriptors related to atom and bond counts, topological and partial charge properties. The model was applied to predict the class of compounds in an external database, consisting of 1738 small-molecules. The chemical compounds of the screening database with a predictive probability of0.80 for increasing the lifespan of Caenorhabditis elegans were broadly separated into (1) flavonoids, (2) fatty acids and conjugates, and (3) organooxygen compounds.

Ageing is a major health, social and financial challenge, characterised by the deterioration of the physiological processes of an organism1,2. Ageing is a predominant risk factor for many conditions including various types of cancers, cardiovascular and neurodegenerative diseases3,4. Interventions targeting the cellular and molecular process of ageing have the potential to delay and protect against age-related conditions.

Several ageing studies have identified interventions that extend the lifespan of model organisms ranging from nematodes and fruit flies to rodents, using dietary restrictions, genetic modifications and pharmaceutical interventions. Lee et al. (2006) presented the first evidence that long-term dietary deprivation can improve longevity in a multicellular species, Caenorhabditis elegans (C. elegans)5. A few years later, Harrison et al. (2009) found that treating mice with rapamycin, an inhibitor of the mTOR pathway, extended the median and maxim lifespan of the mice6. Additionally, Selman et al. (2009) reported that genetic deletion of S6 protein kinase 1, a component of the mTOR signalling pathway, increased the lifespan of mice and protected against age-related conditions7.

Ye et al. (2014) developed a pharmacological network to identify pharmacological classes related to the ageing of C. elegans8. The network showed that resistance to oxidative stress and lifespan extension clustered in a few pharmacological classes, most of them related to intercellular signalling8. Moreover, Putin et al. (2016) developed a deep learning neural network that predicted human chronological age from a basic blood test9. The study identified the top five most critical blood markers for determining chronological age in humans, which were albumin, glucose, alkaline phosphatase, urea and erythrocytes9. Additionally, Mamoshina et al. (2018) developed a deep learning-based haematological ageing clock using blood samples from Canadian, South Korean, and Eastern European populations, with millions of subjects10. The findings showed that population-specific ageing clocks were more accurate in predicting chronological age and quantifying biological age than generic ageing clocks10.

Barardo et al. (2017) built a random forest model to predict whether a compound would increase the lifespan of C. elegans based on the data of the DrugAge database1,4. The features used to build the model were molecular descriptors and gene ontology terms. Feature selection was performed using random forests feature importance measure. The best performing model, with an AUC score of 0.80, was applied to predict the class of the compounds in the DGIdb database.

This study builds on the work conducted by Barardo et al. (2017) to further explore the use of the DrugAge database for predicting compounds with anti-ageing properties4. Specifically, the random forest algorithm was employed to predict whether a compound will increase the lifespan of C. elegans. This was achieved by building five predictive models, each using different descriptor types, based on the data of the DrugAge database published by Barardo et al. (2017)4. The features of the models were molecular fingerprints and/or molecular descriptors calculated from the structure of the compounds in the DrugAge database. Feature selection was performed using variance and mutual information-based methods. To the best of our knowledge, this is the first implementation of molecular fingerprints for building machine learning models based on the entries of the DrugAge database. The best performing model was applied to predict the class of the compounds in an external database, consisting of 1738 small-molecules extracted from the DrugBank database11.

Random forest is a supervised machine learning technique, consisted of an ensemble of decision trees, where each tree is trained independently using a random subset of the data12. The random forest model is widely used in chemo- and bioinformatics related tasks as it is robust to overfitting on high dimensional databases with a small sample sizes4.

The choice of chemical descriptors can significantly impact the quality and predictions of quantitative structureactivity relationship (QSAR) models. Descriptors represent chemical information of molecules in a digital or numerical way that is suitable for model development and are computer-interpretable13,14. In this study, 2D and 3D molecular descriptors were calculated using the Molecular Operating Environment (MOE) software15. 2D descriptors are calculated from the 2D structure of a molecule and provide information related to its structural, topological and physicochemical properties16. On the other hand, 3D descriptors are generated from the 3D structure of a chemical compound and include electronic parameters (e.g. dipole momentum), quantumchemical descriptors (e.g. HOMO and LUMO energies), and surface:volume descriptors14,17,18.

Molecular fingerprints represent the molecules structure using binary vectors, where 1 corresponds to a particular feature or structural group being present and 0 that it is absent. Herein, extended-connectivity fingerprints (ECFP) of 1024- and 2048-bit lengths and RDKit topological fingerprints of 2048-bit length were generated in the RDKit Python environment19. Lastly, the combination of molecular descriptors with ECFPs was tested.

Variance and mutual information filter-based methods were employed to select a subset of relevant features for predicting the anti-ageing properties of the molecules in the dataset. This was performed only for the training set which contained 80% of compounds in the dataset. Feature selection reduced the number of variables used by each model, making computational calculations less expensive. The median AUC scores and standard deviation of tenfold cross-validation (on the training set) obtained by random forest classification for each feature subset can be found in Supplementary Fig.1, Additional File 1. The feature subset with the highest AUC score in 10-fold cross-validation was selected for classifying the compounds in the test set. In cases where two feature subsets achieved the same AUC score, the subset with the lowest standard deviation was used.

The test set contained 20% of the data not used in training the models. The performances of the random forest classifiers on the test and train set as well as the optimal number of variables obtained by feature selection are shown in Table 1.

As illustrated in Fig.1, the predictive performances of the random forest models built with ECFP and MD features did not significantly drop for classifying the compounds in the test set and were compatible with the spread of the AUC scores from cross-validation. This indicated that overfitting was minimised. On the other hand, the performance of the classifier using topological RDKit fingerprints on the test set dropped by approximately 6%. Therefore, the RDKit5 model was not selected for predicting the effect of the compounds in the screening database on the lifespan of C. elegans.

Box-and-whisker plot displaying the distribution of the AUC for the tenfold cross-validation (CV) on the training set and the AUC score for the single measurement taken on the test set, obtained by random forest classification. Each box represents the cross-validation data for a different model, where ECFP of 1024-bit length is shown in green, ECFP of 2048-bit length in blue, RDKit fingerprints in pink, molecular descriptors in red and the combination of ECFP of 1024-bit length with molecular descriptors are represented in orange colour. The value reported within the boxes is the median AUC value of the tenfold cross-validation. The points outside the boxplots represent possible outliers.

The receiver operating characteristic (ROC) curve is the plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) at varying classification thresholds. The ROC curves, displayed in Fig.2, compare the performances of the descriptor types for classifying the samples of the test set. The figure illustrates that the five random forest models performed better than a random prediction.

The ROC curves comparing the performances of the ECFP_1024 (in grey), ECFP_2048 (in green), RDKit5 (in orange), MD (in purple) and ECFP_1024_MD (in blue) for classifying the compounds in the test set. The AUC scores are reported for each descriptor type. The red dashed line corresponds to a random classifier, that gives random answers, with an AUC value of 0.520.

The best performing model, selected by its ability to correctly classify the compounds in the test set, was used for predicting the class of the compounds in the screening dataset. The classifier built solely with molecular descriptors (MD), achieved the highest AUC score for predicting the class of the compounds in the test set. Combining MD with ECFP_1024, the feature type used to obtain the model with the second-highest predictive ability, did not further improve the performance of the classifier. The ECFP_1024 features could have provided additional information that was not useful to the random forest classifier making the predictions more difficult. Therefore, the MD model, which had an AUC score of 0.815 for classifying the compounds in the test set, was selected for further analysis.

In binary classification, the PPV and NPV are the percentage of correctly classified compounds among all compounds predicted as positives or negatives, respectively. Herein, the PPV and NPV indicate that the random forest model performed better on correctly classifying inactive compounds than active ones. The data used in this study was imbalanced as approximately 79% of the samples were negative entries. Thus, a random prediction that a compound is inactive had a much higher initial probability of being correct. To handle the imbalanced data, the class_weight argument of the random forest algorithm was set to balanced, which penalises misclassification of the minority class (i.e. the positive samples)21. The remaining parameters of the random forest model were left to the default settings of the scikit-learn Python library (please refer to the Random forest settings section in the Methods)22. This increased the PPV for classifying the compounds in the test set from 61.1% (value without balancing the class weights) to 65.6% (score achieved after balancing the class weights), while the NPV was reduced by 0.2%.

Investigating which features are considered more important by black-box models such as random forest can aid understanding of how these models make predictions. In this experiment, the feature relevance was measured using the Gini importance of the random forest algorithm. The selected model, MD, was composed of 69 molecular descriptors calculated by the MOE software23. The table containing the full feature ranking can be found in Additional File 2. The analysis was focused on the top 30 features with the highest Gini importance (Table 2), which contained both 2D and 3D molecular descriptors.

Atom and bond counts are simple descriptors that do not provide any information on molecular geometry or atom connectivity. The highest-ranking atom and bond count descriptors were a_nN, b_single, a_count, opr_brigid, and a_nH. While very simplistic, the atom and bond counts outperformed more complex 2D and 3D molecular descriptors. This is because atom and bond counts can partially capture the overall properties of a compound such as size, hydrogen bonding and polarity, which often impact the activity of a drug24. The number of nitrogen atoms, a_nN, was the top-ranking feature of the MD random forest model with a Gini importance score of 0.062. This is consistent with the results of Barardo et al. (2017) where a_nN was also ranked highest for predicting the class of the compounds in the DrugAge database4. Nitrogen atoms could have affected the physicochemical properties of the drugs as well as the interactions and binding of the molecules with target residues.

The highest-ranking topological descriptors included chi0_C, chi0v_C, zagreb, weinerPol, Kier3, chi0 and Kier2. Topological descriptors take into account atom connectivity. The descriptors are computed from molecular graphs, where atoms are represented by vertices and the bonds by edges25. These descriptors can provide information on the degree branching of the structure as well as molecular size and shape25. Although topological descriptors are extensively used in predictive modelling, they are usually hard to interpret26. Topological descriptors may have provided information on how well a molecule fits in the binding site and along with atom counts the interactions with the binding residues.

Top ranking partial charge descriptors were PEOE_VSA+2, PEOE_VSA-4, PEOE_VSA+4, PEOE_VSA-6, PEOE_VSA_PPOS, Q_VSA_PNEG, PEOE_VSA_POL, Q_VSA_PPOS and PEOE_VSA_PNEG. The PEOE_ prefix denotes descriptors calculated using the partial equalization of orbital electronegativity (PEOE) algorithm for quantification of partial charges in the $$\sigma$$-system27,28. On the other hand, descriptors prefixed with Q_ were calculated using the Amber10:EHT force field23. In a ligand-receptor system, partial charges can play a key role in the binding properties of the molecule as well as molecular recognition.

The MD random forest model was applied to predict the class compounds in an external database, consisting of 1738 small-molecules obtained from the DrugBank database11. The top-ranking compounds with a predictive probability of $$\ge 0.80$$ for increasing the lifespan of C. elegans are shown in Table 3. The full ranking of the molecules in the screening database can be found in Additional File 2. The compounds were broadly separated into the following categories; (1) flavonoids, (2) fatty acids and conjugates, and (3) organooxygen compounds. The compounds were classified based on categories Class and Sub Class of the chemical taxonomy section of the DrugBank database (provided by Classyfire) or assigned manually if not present29.

Flavonoids are a group of secondary metabolites in plants that are common polyphenols in the human diet30. Major nutritional sources include tea, soy, fruits, vegetables, wine and nuts30,31. Flavonoids are separated into subclasses based on their chemical structure, including flavones, flavonols, flavanones, and isoflavones30.

Flavonoids have been associated with health benefits for age-related conditions such as metabolic diseases, cancer, inflammation and cognitive decline30,31. Possible mechanisms of action include antioxidant activity, scavenging of radicals, central nervous system effects, alteration of the intestinal transport, sequestration and processing of fatty acids, PPAR activation and increase of insulin sensitivity30.

Diosmin was the top-hit molecule in the screening database, with a predictive probability of 0.96. Diosmin is a flavonol glycoside that is either extracted from plants such as Rutaceae or obtained synthetically32. It has anti-inflammatory, free radical scavenging, and anti-mutagenic properties and has been used medically to treat pain and bleeding of haemorrhoids, chronic venous disease and lymphedema33. Nevertheless, diosmin has poor aqueous solubility, which is a challenge for oral administration34. Kamel et al. (2017) found that a combination of diosmin with essential oils showed skin antioxidant, anti-ageing and sun-blocking effects on mice34. The underlying mechanisms for diosmins anti-ageing and photo-protective effects include enhancing lymphatic drainage, ameliorating capillary microcirculation inflammation and preventing leukocyte activation, trapping, and migration34,35.

Other flavonoids that ranked high for increasing the lifespan of C. elegans were rutin and hesperidin with a predictive probability of 0.95 and 0.94, respectively. Rutin (or quercetin-3-rutinoside), is a flavonol glycoside that is abundant in many plants such as passionflower, apple, tea, buckwheat seeds and citrus fruits36,37. It possesses a range of biological properties including antioxidant, anticancer, neuroprotective, cardio-protective and skin-regenerative activities36,37. Rutin had a high structural similarity to other flavonoids in the DrugAge database and particularly with quercetin 3-O--d-glucopyranoside-(41)--d-glucopyranoside (Q3M). The Tanimoto coefficient between the RDKit fingerprints of Q3M and rutin was 0.99. The similarity map between the two compounds is shown in Fig.4.

Similarity map for ECFP fingerprint with a default radius of 2 (a) structure of reference molecule Q3M from the DrugAge database (b) similarity map of rutin. In green colour are bits that if removed will decrease the similarity, whereas removing bits represented in pink colour will increase the similarity between the two compounds38. The figure was created in the RDKit Python environment19.

Q3M is a flavonoid abundant in onion peel that was found to extend the lifespan of C. elegans39. In the same study, even although rutin was found to improve the tolerance of C. elegans to oxidative stress, which is desirable for longevity, rutin (20g/mL) did not significantly affect the worm's lifespan39. On the other hand, a more recent study by Cordeiro et al. (2020) found that treatment of C. elegans with 15, 30 and 120M rutin increased the lifespan of the nematodes from 28 (control) to 30 days40.

Hesperidin has shown reactive oxygen species (ROS) inhibition and anti-ageing effects in the yeast species Saccharomyces cerevisiae41. Fernndez-Bedmar et al. (2011) found that hesperidin extracted from orange juice had a positive influence on the lifespan of D. melanogaster42. Wang et al. (2020) showed that orange extracts, where hesperidin was the predominant phenolic compound, increased the mean lifespan of C. elegans43. In the same study, orange extracts were also found to promote longevity by enhancing motility and reducing the accumulation of age pigment and ROS levels43.

Soy isoflavones include genistein, glycitein, and daidzein. Genistein, a compound of the DrugAge, has been found to prolong the lifespan of C. elegans and increase its tolerance to oxidative stress44. Gutierrez-Zepeda et al. (2005) found that C. elegans fed with soy isoflavone glycitein had an improved resistance towards oxidative stress45. However, in comparison to control worms, the lifespan of C. elegans fed with glycitein (100g/ml) was not significantly affected45. The effect of daidzein (100M) on the lifespan of C. elegans in the presence of pathogenic bacteria was investigated by Fischer et al. (2012)46. The study found that daidzein had an estrogenic effect that extended the lifespan of the nematode in presence of pathogenic bacteria and heat46. Herein, we applied the MD random forest model to predict the effect of 6''-O-malonyldaidzin on the lifespan of C. elegans. 6''-O-Malonyldaidzin is an o-glycoside derivative of daidzein found in food products such as soybean, miso, soy milk and soy yoghurt47. Its predicted probability for extending the lifespan of the worm was 0.84.

Lipid metabolism has an essential role in many biological processes of an organism. Lipids are used as energy storage in the form of triglycerides and can therefore aid survival under severe conditions48. Additionally, lipids have a key role in intercellular and intracellular signalling as well as organelle homeostasis49. Research on both invertebrates and mammals indicates that alterations in lipid levels and composition are associated with ageing and longevity48,49.

A recent review by Johnson and Stolzing (2019), on lipid metabolism and its role in ageing, summarised key lipid-related interventions that promote longevity in C. elegans50. Some of the studies presented in the review are reported here. ORourke et al. (2013), showed that supplementing C. elegans with the $$\omega$$-6 polyunsaturated fatty acids (PUFAs) arachidonic acid and dihomolinoleic acid (DGLA) increased the worms starvation resistance and prolonged its lifespan by stimulating autophagy51. Similarly, Shemesh et al. (2017) found that DGLA extended the lifespan of C. elegans and maintained protein homeostasis in adulthood52. Additionally, Qi et al. (2017), found that treating C. elegans with $$\omega$$-3 PUFA $$\alpha$$-linolenic acid extended the nematodes lifespan53. The study indicated that the $$\omega$$-3 fatty acid underwent oxidation to generate a group of molecules known as oxylipins. The findings suggested that the increase of the worms lifespan could be a result of the combined effects of the -linolenic acid and oxylipin metabolites53. Sugawara et al. (2013) found that a low dose of fish oils, which contained PUFAs eicosapentaenoic acid and docosahexaenoic acid, significantly increased the lifespan of C. elegans54. The authors proposed that a low dose of fish oils induces moderate oxidative stress that extended the lifespan of the organism. In contrast, large amounts of fish oils had a diminishing effect on the worms lifespan54.

Gamolenic acid or $$~\gamma$$-linolenic acid (GLA) was the second top-hit molecule of the screening database with a predictive probability of 0.95. GLA is an $$\omega$$-6 PUFA, composed of an 18-carbon chain with three double bonds in the 6th, 9th and 12th position55. Rich sources of GLA include evening primrose oil (EPO), black currant oil, and borage oil56. In mammals, GLA is synthesized from linoleic acid (dietary) via the action of the enzyme $$\delta$$-6 desaturase55,56. GLA is a precursor for other essential fatty acids such as arachidonic acid55,56. Conditions such as hypertension and diabetes as well as stress and various aspects of ageing, reduce the capacity of $$\delta$$-6 desaturase to convert linoleic acid to GLA57. This may lead to a deficiency of long-chain fatty acid derivatives and metabolites of GLA. GLA has been used as a constituent of anti-ageing supplements and has shown to possess various therapeutic effects in humans including improvement of age-related anomalies55.

Sodium aurothiomalate, with a lifespan increase probability of 0.82, is a thia short-chain fatty acid used for the treatment of rheumatoid arthritis and has potential antineoplastic activities29,58. In preclinical models, sodium aurothiomalate inhibited protein kinase C iota (PKC) signalling, which is overexpressed in non-small cell lung, ovarian and pancreatic cancers58.

Lactose, with a lifespan increase probability of 0.89, is a disaccharide found in milk and other dairy product. In the human intestine, lactose is hydrolysed to glucose and galactose by the enzyme lactase. Out of the compounds in the DrugAge database, lactose had the highest structural similarity with trehalose. Trehalose has been found to increase the mean lifespan of C. elegans by over 30%, without showing any side effects59. The Tanimoto coefficient between the RDKit fingerprint representations of trehalose and lactose was 0.85. Even though lactose has a high (Tanimoto) similarity to trehalose, Xing et al. (2019) found that lactose treatment at 10, 25, 50 and 100mM concentrations shortened the lifespan of C. elegans60.

Sucrose, with a lifespan increase probability of 0.83, is a disaccharide composed of glucose and fructose61. It is used as the main form of transporting carbohydrates in fruits and vegetables61. Other sugars such as trehalose, galactose and fructose have been found to extend the lifespan of C. elegans59,62,63. Zheng et al. (2017) found the treating C. elegans with sucrose (55M, 111M, or 555M) had no significant effect on the organisms mean lifespan63. On the other hand, a more recent study by Wang et al. (2020) found that treatment with 50M sucrose significantly increased the lifespan of the nematodes, while a concentration of 400M significantly shortened their lifespan64.

Lactulose, with a lifespan increase probability of 0.83, is a synthetic disaccharide composed of monosaccharides lactose and galactose65. Lactulose has shown to be an effective treatment for chronic constipation in elderly patients as well as improve the cognitive function in patients with hepatic encephalopathy65,66.

Other compounds with a predictive probability of0.80 for increasing the lifespan of C. elegans included alloin, a constituent of aloe vera with a predictive probability of 0.81, as well as the antibiotics fidaxomicin (predictive probability=0.84), rifapentine (predictive probability=0.81) and chlortetracycline (predictive probability=0.80).

Rifapentine is a macrolactam antibiotic approved for the treatment of tuberculosis67. Macrolactams are a small class of compounds that consist of cyclic amides having unsaturation or heteroatoms replacing one or more carbon atoms in the ring29. Other macrolactams such as rifampicin and rifamycin have been found to increase the lifespan of C. elegans68.

Golegaonkar et al. (2015) showed that rifampicin reduced AGE products and extended the mean lifespan of C. elegans by 60%68. Advanced glycation end (AGE) products are formed from the non-enzymatic reaction of sugars, such as glucose, with proteins, lipids or nucleic acids68. AGE products have been implicated in ageing and age-related diseases such as diabetes, atherosclerosis, and neurodegenerative diseases68. The effect of two other macrolactams, rifamycin SV and rifaximin, on the lifespan of the nematode was also investigated. Rifamycin SV was found to exhibit similar activity to rifampicin, while rifaximin lacked anti-glycating activity and did not extend the lifespan of C. elegans. The authors suggested that the anti-glycation properties of rifampicin and rifamycin could be attributed to the presence of a para-dihydroxyl moiety, which was not present in rifaximin68. As shown in Fig.5, this functional group is also present in rifapentine. Experimental testing would be required to investigate whether rifapentine possesses similar properties to rifampicin and rifamycin.

Chemical structure of (a) rifamycin SV (b) rifampicin (c) rifaximin and (d) rifapentine. The para-dihydroxynaphthyl moiety possessed by rifamycin SV, rifampicin and rifapentine is highlighted in blue. Rifaximin possesses a para-aminophenyl moiety incorporated in a ring system, highlighted in red68. The figure was designed in ChemDraw and redrawn from Golegaonkar et al. (2015)68.

Pharmaceutical interventions that modulate ageing-related genes and pathways are considered the most effective approach for combating human ageing and age-related diseases. Widely used strategies for identifying active compounds include screening existing drugs with potential anti-ageing properties.

In this study, the random forest algorithm was built to predict whether a compound would increase the lifespan of C. elegans using the entries of the DrugAge database, which contains molecules with known anti-ageing properties such as metformin, spermidine, bacitracin, and taxifolin. Our results provide an update on the findings of Barardo et al. (2017), by employing the latest version of the DrugAge database, which includes many more entries. Specifically, five random forest models were built using molecular fingerprints and/or molecular descriptors as features. Feature selection and dimensionality reduction were performed using variation and mutual information-based pre-selection methods. The best performing classifier, the MD model, was built using molecular descriptors and achieved an AUC score of 0.815 for classifying the compounds in the test set. Combining molecular descriptors with ECFPs did not further improve the models performance. The features of the MD model were ranked using random forests Gini importance measure. Among the 30 highest important features were molecular descriptors related to atom and bond counts, topological and partial charge properties.

The highest performing model was applied to predict the class of the compounds in the screening database which consisted of 1738 small-molecules from DrugBank. The compounds with a predictive probability of0.80 for increasing the lifespan of C. elegans were broadly separated into (1) flavonoids, (2) fatty acids and conjugates, and (3) organooxygen compounds.

This study elucidated several molecules such as orange extracts, rutin, lactose and sucrose, that have been experimentally evaluated on C. elegans but were not entries of the predictive database. A limitation of our algorithm is that it does not consider the substances dose. For example, at certain concentrations sugars such as sucrose can promote longevity in C. elegans, whereas at higher concentrations such sugars have a detrimental effect on the lifespan of the nematodes64. Moreover, lactose, which received a predictive probability of 0.89 for increasing the lifespan of C. elegans by our model, was found to reduce the lifespan of C. elegans at 10100mM concentrations by Xing et al. (2019)60. Nevertheless, the compound could have a beneficial health effect at a different concentration.

Future work would involve in vivo testing of promising compounds such as $$\gamma$$linolenic acid, rutin, lactulose and rifapentine to investigate their effect on the lifespan of C. elegans, as well as, reevaluate the effect of lactose at lower concentrations. Finally, further work would also explore how the predicted probability of lifespan increase is affected when testing two structurally similar compounds that promote longevity at different concentrations.

The dataset published in the study by Barardo et al. (2017) contains positive entries, which are compounds that increase the lifespan of C. elegans and negative entries, compounds that do not increase the lifespan of C. elegans4. In particular, the dataset contains 1392 compounds of which 229 are positive and 1163 are negative entries4. The positive entries of this dataset were obtained from the DrugAge database of ageing-related drugs, (Build 2, release date: 01/09/2016), available on the Human Ageing Genomic Resources website1,69. DrugAge provides information on drugs, compounds and supplements with anti-ageing properties that have been found to extend the lifespan of model organisms1. The species include worms, mice and flies, with the majority of data representing C. elegans4. Data has been obtained from studies performed under standard conditions and contain information relevant to ageing, such as average/median lifespan, maximum lifespan, strain, dosage and gender where available1. The negative entries of the database used in the study of Barardo et al. (2017) were obtained from the literature.

At the time of writing, the latest version of the DrugAge database, Build 3 (release date: 19/07/2019), corrects for small errors and adds hundreds of new entries. Herein, the positive entries in the database used in Barardo et al. (2017) were replaced with the data from the newest version of DrugAge, Build 3. The same negative entries as Barardo et al. (2017) were used4. The modified database contained a total of 1558 compounds with 395 positive entries and 1163 negative ones. In this study, the term DrugAge database refers to the modified dataset with a total of 1558 compounds.

The chemical structures of the DrugAge dataset were converted into canonical SMILES strings using the Python package PubChemPy70. The SMILES strings were standardised by the Standardiser tool developed by Francis Atkinson in 201471. Standardisation removed inorganic compounds, salt/solvent components and metal species as well as neutralised the compounds by adding or removing hydrogen atoms71. Stereoisomers, even if biologically may have different activities, were treated as duplicates as they had identical SMILES strings. For two or more stereoisomers in the same class, only one was kept. For duplicates in different classes, both were removed72. After standardisation and duplicate removal, the number of molecules in the DrugAge database was reduced to a total of 1430 compounds with 304 positive and 1126 negative entries. The predictive database used in this study can be found in Additional File 2.

The standardised SMILES strings were converted into mol files in the RDKit environment and opened in the MOE software19,23. The chemical structures were energy minimised in the Energy Minimize General mode of MOE using Amber10:EHT force field23. A total of 354 descriptors were calculated including all 2D, internal i3D and external x3D coordinate depended on 3D descriptors. Due to software limitation, few 3D descriptors ('AM1_E', 'AM1_Eele', 'AM1_HF', 'AM1_HOMO', 'AM1_IP', 'AM1_LUMO', 'MNDO_E', 'MNDO_Eele', 'MNDO_HF', 'MNDO_HOMO', 'MNDO_IP', 'MNDO_LUMO', 'PM3_E', 'PM3_Eele', 'PM3_HF', 'PM3_HOMO', 'PM3_IP', 'PM3_LUMO') could not be calculated for ten chemical structures. The missing values were replaced with the average value of the remaining chemical structures for the given descriptor.

Molecular fingerprints were generated in the Python RDKit environment from the standardised SMILES strings19. ECFP of 1024-bits and 2048-bits length were calculated with an atomic radius of 2. These were represented as ECFP_1024 and ECFP_2048, respectively. In addition to the ECFPs, RDKit topological fingerprints of 2048-bits length were generated with a maximum path length of 5 bonds and denoted as RDKit5.

Five random forest models were built using five different feature types and trained with the data of the DrugAge database. The feature types explored in this study, ECFP_1024, ECFP_2048, RDKit5, MD and ECFP_1024_MD, are summarised in Supplementary Table 1, Additional File 1. The ECFP_1024_MD feature was a combined descriptor type consisting of ECFPs of 1024 bit-length and molecular descriptors.

Feature selection was implemented in the scikit-learn Python library22. Features with low variance were removed first, creating three feature subsets var_100, var_95 and var_90. The filters removed features with the same value in all entries (var_100), features that had greater than 95% of constant values (var_95) and features with more than 90% constant values, respectively (var_90)73.

where $$p\left( {x,y} \right)$$ is the joint probability mass function and $$p\left( x \right)$$ and $$p\left( y \right)$$ are the marginal probability mass functions for $$X$$ and $$Y$$, respectively74. Herein, Adjusted Mutual Information (AMI) was calculated between each feature and the class labels. AMI was used as it is a variation of MI that adjusts for chance75.

For each of the feature subset, AMI was applied using the adjusted_mutual_info_score function of scikit-learn to order the features based on their AMI score22. The following settings were tested: using 5%, 10%, 25%, 50%, 75% and 100% of the features with the highest AMI score73. For example, if var_100 for MD contained 349 features, the database with 5% of the features would consist only of the 17 highest-ranking features. This process is outlined in Supplementary Fig.2, Additional File 1.

Cross-validation was performed in the scikit-learn Python library using the cross_val_score function22. The predictive database was randomly split into 80% training and 20% test set. The tenfold cross-validation was performed only on the training set. The performance of the models was evaluated using the AUC measure. Cross-validation was repeated 10 times, yielding 10 AUC scores. The predictive accuracy reported was the median AUC value of the 10 measurements obtained by cross-validation. The median, rather than average, AUC score was calculated as the former is more robust to outliers4.

The random forest classifiers were built in the scikit-learn Python module22. To handle the unbalanced data used in this study, the random forest parameter class_weight was set to balanced. The remaining parameters of the random forest classifier were set to their default settings. The models were run with 100 estimators (number of trees in the forest) and the maximum number of features considered in each tree node was the square root of the total number of features. The AUC scores were calculated with the roc_auc_score matrix of scikit-learn using the predict_proba method22.

The best performing model on the test set was applied to predict the class of the compounds in an external database, where the effect of the compounds on the lifespan of C. elegans was mostly unknown. The external database consisted of small-molecules obtained from the External Drug Links database of DrugBank (version 5.1.5, released on 2020-01-03)11. The External Drug Links database contained a list of drugs and links to other databases, such as PubChem and UniProt, providing information on these compounds11,76,77.

Generation of SMILES strings, standardisation and descriptor calculation was performed in the same method used for the training (DrugAge) database, described in the above sections. Some of the entries of the DrugBank database were substances composed of more than one molecule, such as vegetable oils. These entries were either removed from the database or replaced by one of their main active ingredients. For example, borage oil was replaced with gamolenic acid. In the case of soy isoflavones, the major soy isoflavones (genistein, glycitein, and daidzein) had already been experimentally evaluated on the lifespan of C. elegans. Therefore, the entry was replaced with 6''-O-malonyldaidzin, a derivative of daidzein with unknown activity. Stereoisomers were treated as duplicates and only one of them was kept. Substances and stereoisomers present in both the DrugBank and DrugAge databases were removed from the screening database. The resulting database consisted of a total of 1738 small-molecules.

The Tanimoto coefficients and similarity maps were computed in the Python RDKit environment19. The Tanimoto similarity is calculated between a reference molecule, which is known to be active, and a compound of interest with unknown activity.

Herein, the reference molecules were the positive entries of the DrugAge database. The compound with unknown activity was a selected entry from the screening database. The Tanimoto coefficient between the compound of interest with each of the reference molecules was calculated. The highest score achieved and the reference molecule used to obtain that score was reported. The Tanimoto coefficients were computed using the RDKit fingerprint representations of the compounds. Similarity maps were generated using ECFP fingerprint representations.

Mamoshina, P. et al. Population specific biomarkers of human aging: A big data study using South Korean, Canadian, and Eastern European patient populations. J. Gerontol. A. Biol. Sci. Med. Sci. 73, 14821490 (2018).

Gaba, V., Rani, K. & Gupta, M. K. QSAR study on 4-alkynyldihydrocinnamic acid analogs as free fatty acid receptor 1 agonists and antidiabetic agents: Rationales to improve activity. Arab. J. Chem. 12, 17581764 (2019).

Roy, K., Kar, S. & Das, R. N. Chapter 2: Chemical Information and Descriptors. in Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment 4780 (Academic Press, 2015). https://doi.org/10.1016/B978-0-12-801505-6.00002-8.

Bender, A. & Glen, R. C. A discussion of measures of enrichment in virtual screening: Comparing the information content of descriptors with increasing levels of sophistication. J. Chem. Inf. Model. 45, 13691375 (2005).

Ramelet, A. A. Venoactive drugs. In Sclerotherapy: Treatment of Varicose and Telangiectatic Leg Veins (eds Goldman, M. P. et al.) 369377 (W.B. Saunders, 2011). https://doi.org/10.1016/B978-0-323-07367-7.00020-0.

Mangoni, A. A. Drugs acting on the cerebral and peripheral circulations. In A Worldwide Yearly Survey of New Data in Adverse Drug Reactions and Interactions Vol. 34 (ed. Aronson, J. K.) 311316 (Elsevier, 2012).

Cordeiro, L. M. et al. Rutin protects Huntingtons disease through the insulin/IGF1 (IIS) signaling pathway and autophagy activity: Study in Caenorhabditis elegans model. Food Chem. Toxicol. 141, 111323 (2020).

Fernndez-Bedmar, Z. et al. Role of citrus juices and distinctive components in the modulation of degenerative processes: Genotoxicity, antigenotoxicity, cytotoxicity, and longevity in drosophila. J. Toxicol. Environ. Heal. Part A 74, 10521066 (2011).

Shemesh, N., Meshnik, L., Shpigel, N. & Ben-Zvi, A. Dietary-induced signals that activate the gonadal longevity pathway during development regulate a proteostasis switch in caenorhabditis elegans adulthood. Front. Mol. Neurosci. 10, 254 (2017).

Rezapour-Firouzi, S. Chapter 24: Herbal oil supplement with hot-nature diet for multiple sclerosis. In Nutrition and Lifestyle in Neurological Autoimmune Diseases (eds Watson, R. R. & Killgore, W. D. S.) 229245 (Academic Press, 2017). https://doi.org/10.1016/B978-0-12-805298-3.00024-4.

Yahia, E. M., Carrillo-Lpez, A. & Bello-Perez, L. A. Carbohydrates. In Postharvest Physiology and Biochemistry of Fruits and Vegetables (ed. Yahia, E. M.) 175205 (Woodhead Publishing, 2019). https://doi.org/10.1016/B978-0-12-813278-4.00009-9.

Vinh, N. X., Epps, J. & Bailey, J. Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? in Proceedings of the 26th Annual International Conference on Machine Learning 10731080 (Association for Computing Machinery, 2009). https://doi.org/10.1145/1553374.1553511.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## building random forest classifier with python scikit learn

In the Introductory article about random forest algorithm, we addressed how the random forest algorithm works with real life examples. As continues to that, In this article we are going to build the random forest algorithm in python with the help of one of the best Pythonmachine learning library Scikit-Learn.

To build the random forest algorithm we are going to use the Breast Cancer dataset. To summarize in this article we are going to build a random forest classifier to predict the Breast cancer type (Benign or Malignant).

Random forest algorithm is an ensemble classificationalgorithm. Ensemble classifier means a group of classifiers. Instead of using only one classifier to predict the target, In ensemble, we use multiple classifiers to predict the target.

In case, of random forest, these ensemble classifiers are the randomly created decision trees. Each decision tree is a single classifier and the target prediction is based on the majority voting method.

The majority voting conceptis same as the political votings. Each person votes per one political party out all the political parties participating in elections. In the same way, every classifier will votes to one target class out of all the target classes.

To declare the election results. The votes will calculate and the party which got the most number of votes treated as the election winner. In the same way, the target class which got the most number of votes considered as the final predicted target class.

I hope you have a clear understanding of how the random forest algorithm works. Now lets implement the same. As I said earlier, we are going to use the breast cancer dataset to implement the random forest.

Sadly breast cancer is to second most death reason for womens. In the US during the year 2016, almost 246,660 womens breast cancer casesare diagnosed. The myth people believe tumor as cancer but which is not true.

A benign tumor is not a cancerous tumor. Which means its not able to spread through the body like the cancerous tumors. The benign is serious when its growing in sensitive places. This kind of tumors are will well terminated with proper treatment and with the change in diet habits.

If you install the python machine learning packages properly, you wont face any issues. Even though you install the packages properly and you facing the issue ImportError: No module named model_selection. This means the scikit learn package you are using not updated to the new version.

The downloaded dataset is in the data format. So we are going to convert into the CSV format. To do that we are going to write asimple function which first loadsthe data format into the pandas dataframe and later the loaded dataframe will save into the CSV file format.

The loaded dataset doesnt have the header names. So we need to add the header names to the loaded dataframe. To do the same we havewritten a function with takes the dataset and header names as input and add the header names to the dataset.

After adding the header names to dataset we are saving the dataset into CSV format. While saving the file we parameterized the index=False. When we save the loaded dataframe without this the saved file will have an extra column with the indexes. So to eliminate this we are parameterized the index=False.

The best idea to start with is, calculating basic statistics for each column(features and target) of the dataset. You may be wondering what the use of calculating basic statistics of the dataset and how it gonna helps to find the missing values.

Yes, finding the basic statistics will helps us to find the missing values in the dataset. The idea is we can use pandas describe method on the loaded dataset to calculate the basic statistics. This outputs the stats about only the columns which are not having any missing values or categorical values.

As we know the missing values column and the character used to represent the missing values. Now lets write a simple function will take the dataset, header_name and missing value representing the character as input handles the missing values.

Now our dataset is missing values free. Now lets split the data into train and test dataset. The training dataset will use to train the random forest classifier and the test dataset used the validate the model random forest classifier.

To split the data into train and test dataset, Lets write a function which takes the dataset, train percentage, feature header names and target header name as inputs and returns the train_x, test_x, train_y and test_y as outputs.