TExSIS (Terminology Extraction for Semantic Interoperability and Standardisation)

CrossLang is partner and user group member of the TExSIS project, which aims at the automatic extraction of mono- and multilingual company-specific terminology on the basis of a company’s document streams. These term lists are crucial in every language-based man-machine communication: in machine translation, computer-assisted translation and in monolingual and multilingual document management.

92,5% of TExSIS is funded by the Flemish government. The remaining 7,5% has been raised by the members of the user group: CrossLang, SD Worx, Telenet, Belga Persbureau, Selor, Yamagata Europe, Telelingua, Mentoring Systems, Xplanation Language Services, TextKernel, Comsof, Actonomy, ITP, Nacco Materials Handling Group, Docbyte, PSA Peugeot Citroën SA, Intersystems, Jabbla, MEGA-doc, Mediargus, Eurologos, Wetenschappelijk en Technisch Centrum voor het Bouwbedrijf, and Limecraft.

Project details

Project name

Terminology Extraction for Semantic Interoperability and Standardization

Start date  

Jan. 1, 2011

End date

Dec. 31, 2012

Sponsor

IWT Tetra fund

Description

Deliverables

The concrete deliverables that will result from the project are: knowledge (reported in technical reports and publications), prototypes of the different components, a prototype client‐server architecture for fully automatic monolingual and multilingual terminology extraction, etc. These prototypes will be made available open source. Software developers can further customize and implement the prototypes in company-specific end user applications.

Evaluation

The practical applicability of the terminology extractor will be evaluated in two use cases with a broad scope: machine translation and information retrieval. Both use cases cover the needs of a broad target group of companies and organisations. For both use cases, their will be a close cooperation with the companies in the user group for the delivery of documents and the evaluation of the prototype. This should guarantee the general applicability of both use cases.

1) Use case machine translation

The terminology extraction tools which will be developed in the course of the project will be integrated into machine translation systems. How can (Flemish) companies benefit from these results?

Reduction of the implementation cost of machine translation (MT). The implementation of MT systems and the customisation of these systems to a specific company environment (e.g. automotive, banking, telecom domain) requires a huge amount of manual work, leading to a high implementation cost. Accurate terminology extraction will lead to a smoother domain adaptation and a lower MT implementation cost.

Reduction of manual postediting. Companies which use MT for the translation of their user manuals, customer support documents, etc. currently often have to strongly postedit their texts in order to have an acceptable output. The integration of a fully automatic terminology extraction into the MT system will lead to an improved translation quality and thus reduce the costs of manual correction. This reduction of the manual postediting also leads to a shorter time‐to‐market.

Automatic production of company-specific monolingual and multilingual dictionaries which can be used for a uniform communication on new and existing products and services. A graphical user interface can assist writers and translators in the adaptation/correction of these automatically extracted mono‐ and multilingual term lists.

Automatic method for consistency checks. Via automatic terminology extraction on all new texts in TMX, HTML, XML etc., the terms in these documents can be checked against the company-specific term banks. A graphical user interface will be developed for the replacement of false terms in these new documents.

2) Use case information retrieval

The tools for terminology extraction developed in the project can also be integrated in intelligent information management applications. These applications contribute to faster, easier and more user-friendly solutions to manage, index, categorise and search large document collections. Companies which manage large archives or databases can be assisted in different ways via the TExSIS project:

Automatic construction of a thesaurus of a certain archive. In case of an existing archive of textual documents, TExSIS can extract a thesaurus of relevant keywords (names of persons, organisations, locations and other keywords) in a fully or semi‐automatic way. This will allow companies to structure the archiving and maintenance of their documents.

Term suggestions during document construction. When constructing a new document, the writer can be guided in his/her choice of relevant keywords via a term suggestion system. These new keywords are automatically linked to the thesaurus; the user will also be able to add new keywords to the thesaurus.

Automatic assignment of metadata to documents. When archiving a new document, the most important keywords can be added as metadata to the document in order to simplify or even automatize classification. This can be done in a fully automatic or semi‐automatic way.

Increased speed in searching large archives and databases. When documents are labeled with relevant metadata, it becomes possible to browse documents per category or to search for specific entities, which is heavily beneficial for the user-friendliness of search engines. Such search applications can make use of unfolded tree structures with subcategories or graphical sets of relevant suggestion terms.

More info can be found on the website of the Language and Translation Technology Team of the University College Ghent Faculty of Translation Studies.