TExSIS - Terminology Extraction for Semantic Interoperability and Standardization
CrossLang is partner and user group member of the TExSIS project which aims at the automatic extraction of mono- and multilingual company specific terminology on the basis of a company’s document streams. These term lists are crucial in every language based man-machine communication: in machine translation, computer-assisted translation and in monolingual and multilingual document management.
TExSIS is funded by the Flemish government for 92,5%. The remaining 7,5% has been raised by the members of the user group: CrossLang, SD Worx, Telenet, Belga Persbureau, Selor, Yamagata Europe, Telelingua, Mentoring Systems, Xplanation Language Services, TextKernel, Comsof, Actonomy, ITP, Nacco Materials Handling Group, Docbyte, PSA Peugeot Citroën SA, Intersystems, Jabbla, MEGA-doc, Mediargus, Eurologos, Wetenschappelijk en Technisch Centrum voor het Bouwbedrijf, and Limecraft.
Project name
Terminology Extraction for Semantic Interoperability and Standardization
Start date
Jan. 1, 2011
End date
Dec. 31, 2012
Sponsor
IWT Tetra fund
Description
Deliverables
The concrete deliverables that will result from the project are: knowledge (reported in technical reports and publications), prototypes of the different components, a prototype client‐server architecture for fully automatic monolingual and multilingual terminology extraction, etc. These prototypes will be made available open source. Software developers can further customize and implement the prototypes in company specific end user applications.
Evaluation
The practical applicability of the terminology extractor will be evaluated in two use cases with a broad scope, being machine translation and information retrieval. Both use cases cover the needs of a broad target group of companies and organisations. For both use cases, their will be a close cooperation with the companies in the user group for the delivery of documents and for the evaluation of the prototype. This should guarantee the general applicability of both use cases.
1) Use case machine translation
The tools for terminology extraction which will be developed in the course of the project, will be integrated in machine translation systems. How can (Flemish) companies benefit from these results?
-
Reduction of the implementation cost of machine translation (MT). The implementation of MT systems and the customization of these systems to a specific company environment (e.g. automotive, banking, telecom domain) requires a huge amount of manual work, leading to a high implementation cost. Accurate terminology extraction will lead to a smoother domain adaptation and a lower implementation cost of MT.
-
Reduction of manual postediting. Companies which use MT for the translation of their user manuals, customer support documents, etc. currently often have to strongly postedit their texts in order to have an acceptable output. The integration of a fully automatic terminology extraction in the MT system will lead to an improved translation quality and thus reduce the costs of manual correction. This reduction of the manual postediting also leads to a shorter time‐to‐market.
-
Automatic production of company specific monolingual and multilingual dictionaries which can be used for a uniform communication on new and existing products and services. A graphical user interface can assist writers and translators in the adaptation/correction of these automatically extracted mono‐ and multilingual term lists.
-
Automatic method for consistency checks. Via automatic terminology extraction on all new texts in TMX, HTML, XML etc., the terms in these documents can be checked against the company specific term banks. A graphical user interface will be developed for the replacement of false terms in these new documents.
2) Use case information retrieval
The tools for terminology extraction developed in the project can also be integrated in intelligent information management applications. These applications contribute to faster, easier and more user-friendly solutions to manage, index, categorise and search large document collections. Companies which manage large archives or data bases, can be assisted in different ways via the TExSIS project:
-
Automatic construction of a thesaurus of a certain archive. In case of an existing archive of textual documents, TExSIS can extract a thesaurus of relevant keywords (names of persons, organisations, locations and other keywords) in a fully or semi‐automatic way. This will allow companies to archive and maintain their documents in a structured way.
-
Term suggestions during document construction. When constructing a new document, the writer can be guided in his/her choice of relevant keywords via a term suggestion system. These new keywords are automatically linked to the thesaurus; the user will also be able to add new keywords to the thesaurus.
-
Automatic assignment of metadata to documents. When archiving a new document, the most important keywords can be added as metadata to the document in order to simplify or even automatize classification. This can be done in a fully automatic or semi‐automatic way.
-
Increased speed in searching large archives and data bases. When documents are labeled with relevant metatdata, it becomes possible to browse documents per category or to search for specific entities, which is heavily beneficial for the userfriendliness of search engines. Such search applications can make use of unfolded tree structures with subcategories or graphical sets of relevant suggestion terms.
More info can be found on the website of the Language and Translation Technology Team of the University College Ghent Faculty of Translation Studies.
