The system design was inspired on an initial analysis of information contained in various data resources related to toxicology, covering databases, literature, EPARs and a small sample of printed EFPIA toxicological reports provided by one of the partners. This information was used to define the overall text mining strategy to be followed, the type of entities and methods to be used and which data types contain in the industry reports could be captured from the other document collections. Thereafter followed the document data set construction, automatic retrieval of EPARs, NDAs in addition to the entire PubMed database and a large set of full text articles. These documents were standardized, converted to plain text, pre-processed, tokenized (including sentence boundary recognition).
To avoid limitations related to the lack (at this stage) of a proper ontology useful for text indexing and retrieval of hepatotoxicity textual data we opted for a text categorization strategy requiring only a small seed set of high precision hepatotoxicity keywords. A collection of PubMed abstract records relevant for drug-induced liver diseases suitable for text mining purposes was assembled, mainly derived from the Comparative Toxicogenomics Database (CTD). A semi-supervised text classification system based on Support Vector Machines was constructed, both to do the classification at the level of abstracts/passages as well as individual sentences. This system contained distinct classifiers for sentences from scientific literature and for sentences from EPARs. Validation was done on documents annotated in CTD as relevant for drug-induced liver disease.
A crucial step was the recognition of chemical compounds in the various article collections. A dictionary lookup approach using a refined set from the Jochem lexicon was used to detect compound and drug names. Another strategy was based on named entity recognition systems, which rely on a statistical model of text to recognize mentions of chemical substances. Here we applied various taggers, i.e. OSCAR4, ChemicalTagger and an in house system. Finally the third strategy (only for scientific documents) consisted in using metadata in terms of chemical substance MeSH terms. To normalize/ground the compound mentions, in case of the dictionary based approach we returned external identifiers from one of these databases/types: ChemIDplus, ChEBI, CAS, PubChem compound, InCHI string, DrugBank. In addition to this we carried out a name to structure conversion of all the compound mentions to generate SMILES and Sdfiles.
To determine if the detected compound/drug mentions are directly associated to hepatotoxicity we used several strategies. One approach relied on building a lexical resource of terms relevant to this particular toxic end point and then retrieved co-mentions at the sentence level (term co-mention). Another strategy relied on rules that analyzed in the context of mention of compound location descriptions and adverse effect descriptions (rule matching). We also explored textual pattern based approaches that made used of n-gram statistics and part of speech tagging to define the actual extraction patterns (pattern matching). Moreover, the result of the sentence and document classification system can be used for more efficient retrieval of hepatotoxicity evidence for a given search term or chemical substance (interpretation and validation). In addition to value as a search engine we have used this context based scoring of the sentences and documents to score/prioritize compounds for their relevance to hepatotoxicity (machine learning approach).