br A false positive i e classifying a benign
A false positive i.e. classifying a benign tumour as malignant will recommend not-required chemotherapy. On the other hand, a false negative i.e. classifying a malignant tumour as benign will lead to no treatment at all and let the disease advance further. In such problem domains, to develop a trust in the machine learning model, it MF498 is desir-able that the outcome of the model should be understandable to a human expert. In other words, the human expert should be able to identify what features made the model predict a particular outcome for a given instance. So, a human-interpretable machine learning model is
T. Bikku and R. Paturi
basically one whose outcomes can be interpreted by a human expert. In machine learning, the features of the data are more, so the amount of features required to analyse the classification also grows exponentially . Richard E. Bellman coined this phenomenon as the “curse of di-mensionality”, as the number of relevant features increases with the increase of dimensions when considering problems of enumeration on product spaces in dynamic optimisation . To solve a large number of features in the datasets so as to analyse the scenario, feature selection has been extensively used in the machine learning applications. A classification model can be built by eliminating redundant or irrelevant features from the dataset and considering the high ranked or high priority features extracted based on the criterion by using feature se-lection strategies. The common approach to represent the high di-mensional data into variables known as features, which preserves the information without losing the valuable information as shown in Fig. 1.
Microarray data sets are large and analyses the data by using vari-ables and data points. This strategy is used to reduce the dataset into genomes, which can distinguish between the two cases or classes. For example, a data point can have 500,000 variables approximately and processing multiple data points is not an easy task, which high com-putational cost . When the dataset dimensionality grows rapidly then it is very difficult to prove the result statistically due to the sparsity of the data in the dataset. The Biomedical Document parser identifies the structure of phrases and sentences in the XML and pdf formats of documents. Information about gene, protein and its pathways is ana-lyzed using the MedScan toolkit . The MedScan is a three-tier knowledge extraction system based on a biomedical document parsing model. In the first tier, the pre-processor module aimed to tag various biomedical MeSH terms using domain specific concepts. Pre-processor module reads the biomedical XML format of a MEDLINE abstract and parses into individual terms or sentences. In this module, a protein name dictionary is used as a training dataset to filter the protein names and to select the terms or sentences containing at least one gene-protein name. In the second tier, the natural language processor performs a set of se-mantic relationships between a term or sentence structures. It is based on context-free grammar and a lexicon parse tree for MEDLINE protein extraction. In the final tier, knowledge extraction engine acting as a domain knowledge filter for extraction key MeSH-based document in-formation in the form of conceptual graph format. MedScan utilizes the ontology-based filter to select gene or protein semantic entities into the ontology tree structure . The main limitations of the MedScan toolkit are: the efficiency of the natural language processor should be optimized by improving the size, quality and pre-processing algorithms. The vo-lume of the biomedical gene, protein or disease entities can be increased several times by extending the ontology structure of high computational systems. Support vector machine optimisation aims at constructing a separation function for domain knowledge extraction. Each document is classified using the hyperplanes and feature vector space.