TarDict: A RandomForestClassifier based software predicts drug- target interaction using SMILES

The future of therapeutics depends on understanding the interaction between the chemical structure of the drug and the target protein that contributes to the etiology of the disease in order to improve drug discovery. Predicting the target of unknown drugs being investigated from already identified drug data is very important not only for understanding different processes of drug and molecular interactions but also for the development of new drugs. Using machine learning and published drug information we design an easy-to-use tool that predicts biological target proteins for medical drugs. TarDict is based on a chemicalsimplified line-entry molecular input system called SMILES. It receives SMILES entries and returns a list of possible similar drugs as well as possible drug-targets. TarDict uses 20442 drug entries that have well-known biological targets to construct a prognostic computational model capable of predicting novel drug targets with an accuracy of 95%. We developed a machine learning approach to recommend target proteins to approved drug targets. We have shown that the proposed method is highly predictive on a testing dataset consisting of 4088 targets and 102 manually entered drugs. The proposed computational model is an efficient and cost-effective tool for drug target discovery and prioritization. Such novel tool could be used to enhance drug design, predict potential target and identify combination therapy crossroads.


Introduction
The identification of drug-target interactions (DTIs) leads to revolution drug discoveries. Drug developers search for drug compounds that interact with specific targets that has biological activities of interest. However, the identification of DTIs for enormous number of chemical compounds by experimenting usually takes 2-3 years, with high associated costs [1]. Thus, many computational methods and procedures developed to solve this problem. One of the most common computational methods, docking methods, which mimic the binding of a small molecule on a protein in 3D structure, were initially studied and used in all docking aspects. Docking methods try various scoring functions and molecule to decrease the free energy of binding. Docking methods have upgrade themselves, and currently, the Docking Approach introduce an alternate docking strategy termed DARC (Docking Approach using Ray-Casting), mapping the structure of a surface pocket "observed" from within the protein to the structure "observed" when viewing a potential ligand [2]. Moreover, many studies focused on similarity-based methods in which it was assumed that drugs bind to similar proteins and vice versa. Yamanashi et al., used a kernel regression method to use the information on known drug interactions as the input to identify new DTIs, merging between a chemical genomic, and pharmacological approaches [3]. Those efforts successfully achieved the paradigm of 'one drug, one target' in the pharmaceutical field when it attracted attention to the role of the small number of main player genes interact with drugs [4]. This interaction shows how many drugs affect the body's proteins and explain how the development of the disease is often the result of a series of disruptions in our body's global pathway network environment [5].

Highlights in Bioinformatics
Page 1 of 5 Since it is time-consuming, expensive and requires considerable effort to be made to study different pathways and to determine whether a chemical and a pathway will interact with each other in a cellular network, it is reasonable to develop computational methods and machine learning algorithms to predict potential drug-target interactions in order to understand the drug mechanism of action [6].
One of the most widely used algorithms in bioinformatics is RandomForestClassifier, which has proved to be a model of choice for various machine learning projects worldwide. The Random Forest, as its name explains, consists of a number of decision-making trees that work together. Each tree in the random forest gives the forecast score and the most voted score chosen to be the prediction model [7,8].
In computational manner, there are many ways to deal with drug chemical structure, and Drug fingerprint is the most commonly used descriptor of drug substructure [9] where drug is transformed into a binary vector whose index value represents drug substructure existence. For proteins, descriptors are conventionally used as computational representations [10]. Unfortunately, feature-based models that use protein descriptors and drug fingerprints showed weak performance than conventional quantitative structure-activity relationship (QSAR) models [11]. Thus, We developed a computational prediction tool called "Tar-Dict" that could be used to predict the biological target of any drug based on its SMILES string without needing for drug fingerprint. TarDict uses the RandomForestClassifier algorithm to construct a regression model to predict which target would be attacked by the drug's chemical compound. We used annotated structural drug data from DrugBank along with the target gene and pathway as a training data set to exceed our predictability accuracy.

Drug-target data
Drug-target data were obtained from the DrugBank database [12], which contains required information such as drug name, gene name, target pathway, and SMILES. We focused on 20442 SMILES of different drugs that have been tested and studied for liver carcinoma tissue and have well-known drug targets in the DrugBank database.

Algorithm selection
To select the the suitable algorithm to fit the data and be able to predict wisely, we build script that loop 23 algorithm under the same training and testing data. After testing and evaluation the mentioned algorithms, RandomForestClassifier was the best algorithm has accuracy and performance (Code 1 ).

Model building using RandomForestClassifier
A random forest is primarily a set of decision-trees, where each tree is a hierarchy of if/else questions which lead to decisionmaking. The only downside to decision-trees is their tendency to overfit training data. each tree is different from the other in the random forest. The theory behind random forests is that any tree will predict fairly well but will probably overfit some of the data. Like other trees, all perform well and overfit in various ways, minimizing overfitness by means of an average of the results. This decline in overfitting, preserve the trees predictive ability.

Parameters tuning
A crucial parameter is tree number of forest that is determined by the parameter n_estimators in our script. We are constructing 30 trees. These trees are built separately and the algorithm allows specific random choices for each tree to ensure that the trees are distinct. However, the number of features chosen by each tree to create the if/else set is regulated by the max_features parameter of our example, which is 40. We called a bootstrap function for our data to make sure the forest is random. This is , we draw samples spontaneously with replacement from our data points frequently (the same data may be chosen many times). Random forest embedded in the machine learning library of Scikit-learn Python [13] (Code 2). Figure 1 shows the random forest decision tree flowchartlike structure. We used the function CounterVectorizer, and it is programmatic aspects to Convert a collection of text to a matrix of token counts. We used CounterVectorizer to convert SMILES textual data to numeric to boost the learning ability of our machine learning model. We used the training file as control or/and standard to vectorize the SMILES input.
Using python programming language we packed the developed model into a standalone tool. TarDict receives SMILES entiries and returns a list of possible similar drugs and possible targets. It connects the input drugs with biological pathways and eventually exports the possible pathwasy to the user. TarDict uses three steps to retrieve possible drug targets; (a) it receives drug SMILES information; (b) predicts the closest drug to this SMILES; and (c) RandomForestClassifier based model begins to identify the target that the predicted drug contributes to and finally exports the name of the pathway to the user.

Results and Discussion
Drug discovery and improvement are highly correlated with the information that could be obtained through known models of drug-target interactions. Where drug-target prediction tools could enhance therapeutics and increase the impact of medical research on human health. In this study, 20442 drug biological and chemical information collected from the DrugBank database was used to construct a machine learning model that could blindly predict the targeted pathway of novel drugs using chemical structure information.
Ensembles algorithms are the combination of several learning models, which enable more efficient models to be developed. Within machine learning science there are several models, but there are two ensemble models which are successful in a wide variety of datasets, all of which use decision trees as their build- To evaluate the predictive capability of TarDict, ten performance evaluation measures were applied.These predictive accuracy tests inform the different aspects of TarDict's performance ( Table 1) using python script imports the module of each test. In addition, we validate the accuracy of the model prediction by testing the randomly selected SMILES data. Figure 2 shows the classification report assessing the predictive accuracy of TarDict for each drug class and the difference between the actual and the predictive target proteins. TarDict is an easy-to-use drug-target prediction where minimal information on the drug is needed to predict its potential biological target. This could facilitate therapeutic studies in which a large number of drugs are tested for their potential in targeting pathogenic genes. TarDict is an open-source software where researchers could use the same pipeline to create more effective prediction models for any drug class of interest. In addition, the process of predicting the closest drug will allow TarDict to be compatible with future drug findings and its potential use in drug classification by targetted biological pathways.

Conclusions
In the drug target interaction area, the implementation of tools like TarDict is an outstanding innovation technology which evolves quickly with other modern fields of precision medicine, genomics and teleconsultation. This requires years and a million dollars to look for and grow therapeutic agents to cure a particular illness through clinical trials. Since approximately 17 databases and resources are available to detect drug-target interactions [13], TarDict's capacity to recognize the targets of an unknown drug is still dominant in this field. TarDict is an open source platform which could be modified to include additional features, modify the algorithm, and link its code with other tools.