![]() For our experiment we chose to focus on the disease control domain, but our method could easily be applied to many domains in bio-medicine or other areas. We also wish to investigate this method in specific domains, and see if it is still feasible with a smaller seed set as the seed set in Snow et al was the entirety of WordNet. They then attempt to find a set of patterns which maximises the overall performance. There they started with a base pattern consisting only of "*"s and slots and grew the pattern by replacing each of these "*"s with a word from their corpus. In order to find these patterns we chose to develop a system based on Soderland's WHISK system as it is uses regular expression matching and so requires no prior knowledge of the syntax of the language. ![]() Our work is most closely related to the work in Snow et al for automatically discovering hypernyms. An attempt at improving the result of a term similarity classifier by graph clustering was explored in Ibekwe-Sanjuan & Sanjuan, however their method was not based on probabilities and the clustering method suggested would lead to questionable behaviours such as grouping a synset when the vast majority of links are not indicative of synonymy, hence the classifier must have a very high precision. This may cause problems in applications where only a single term is use to represent the synset, for example a method based on usage of an ontology may require one main term for the synset. Also the result is self-consistent, so we will not get a result that X is synonymous to Y, which is synonymous to Z, but not that X is synonymous to Z. For practical applications of this problem a simple list of synonymous terms is much more desirable for several reasons, firstly the results are much simpler and easier to store and work with, as you need only list the groups instead of the synonymy relation between each pair. One of the disadvantages of most of these approaches is that they give only binary classification rather than outputting synonym sets. An attempt to find gene and protein name synonyms was explored in Yu et al, again they manually chose their patterns. They also used a number of hand-chosen synonymy patterns to detect potential synonyms and used this to improve their detection rate. ![]() ![]() Finally they attempted to improve their results by using distributional similarity. They then generated a large number of patterns and classified them by a logistic regression-based method. Their pattern extractor was based on using a dependency grammar and so requires a grammar and a parser, which will limit applicability to those few languages where large coverage parsers have been developed. Due to issues of accuracy and scalability to other relations, some work has gone into constructing these patterns automatically, notably in Snow et al. Hearst used patterns including "X such as Y" to detect a hypernymy relation between terms, however she chose these patterns by hand. This method however lacks the ability to differentiate between specific semantic relations (for example synonymy, hypernymy, agent/disease). As shown in Dumais et al, this can be used to improve the recall of a synonym classifier although at the cost of its precision. Distributional similarity, that is identifying terms by other terms which occur in close proximity, has also been shown to be effective for identifying synonymy, most notably the Latent Semantic Analysis method of Cederberg & Widdows. "cancer of the mouth"), to detect synonyms, but many synonymous terms are not simple variants, so this method is limited. Several methods have been suggested, for example, in Morin & Jacquemin they explore using term variation (for example "mouth cancer" vs. There has been a large amount of interest in constructing thesauri and ontologies automatically, not only for synonymy relations, which we study here, but also for hypernymy (general/specific) relations. Against this background, automatic discovery of synonymy relations between terms has been shown to be useful for both maintaining and expanding existing ontologies and for constructing ontologies, however the accuracy of such methods remains a major issue. Domain specific thesauri and ontologies are expensive to construct, due to the scarcity of human expert resources, and often do not give sufficient variants of terminology. General thesauri, such as WordNet, give relatively poor coverage of specialised domains and thesauri often do not exist for many languages and domains. Thesauri which list synonymous terms have been found to be useful for improving the results of information retrieval systems and synonymy relations are encoded in many ontologies, for example Gene Ontology, and biological databases such as SWISS-PROT. Synonymy is one of the most important relations found between different terminology and is of critical importance for building high quality text mining systems for biomedical literature.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |