MULINEX: Multilingual Web Search and Navigation
MULINEX is a multilingual search engine for the WWW. During the phase of document gathering, the system extracts information about documents by making use of language identification, thematic classification and automatic summarisation. In the search phase, the users’ query terms are translated in order to enable search in different languages. Search results are presented with a summary and information about the language and thematic categories to which the document belongs. Summaries and documents are translated on demand by making use of the LOGOS machine translation system. The system is to be deployed in the online services of Bertelsmann Telemedia and Grolier Interactive Europe, and supports French, German and English. The current MULINEX prototype is the first system for translingual information access integrating retrieval, summarisation and translation.
translingual information retrieval, categorisation, summarisation, language identification, query translation, machine translation
The Internet is rapidly changing from an English dominated medium to a multilingual information and communication service. At present navigation in this multilingual information space is still far from the ideal scenario – the ability to access, in one’s own mother tongue, the mass of multilingual documents over the Internet in a seamless and transparent fashion.1.1. Social and Economic Factors
The total number of online households in the world is expected to rise from 23.4 million (8.3 % of all households) in 1996 to more than 60 million (near 25 %) in 2000. Increased PC penetration, telecommunications deregulation, indigenous content development, and deployment of integrated services digital network (ISDN) in Europe’s and Asia’s most advanced online markets will be among the key factors driving this growth.
By the year 2000, countries outside the US (especially in Europe and Asia) will account for 46 per cent of all online households, up from 37 percent today. France, Germany, Japan and the UK will be the next largest online markets in 2000. Scandinavia and Italy will experience growth rates that are at least as large as the major markets.
MULINEX addresses some of the issues of multilinguality by developing a leading-edge application that facilitates multilingual information access with navigation and browsing, enabling effective multilingual searching on the Internet by providing translation of queries, customised summaries, and thematic classification of documents.
MULINEX processes multilingual information from the WWW and other online sources and presents it to the user in a way which facilitates finding and evaluating the desired information quickly and accurately. It does this by combining the newest information retrieval technology with advanced language technologies to improve search and navigation in the WWW.
The current MULINEX prototype is the first system for translingual information access integrating retrieval, summarisation and translation.1.2. Consortium Objectives
The MULINEX consortium consists of five European companies, who aim to improve their competitiveness in the internet market through the development and application of advanced language technology for providing improved user-friendly web search and navigation services.
The co-ordinating partner DFKI conducts basic and application-oriented research in artificial intelligence and other fields of advanced computer science. DFKI’s Language Technology Lab applies cutting-edge language technology to a variety of application areas. It develops large scale resources such as lexicons, morphology and grammars.
The two user partners provide web services. Grolier Interactive Europe operates Club Internet and creates and hosts websites for commercial customers. Bertelsmann Telemedia defines itself as an Internet Solutions Company which provides online services and is active in the area of electronic commerce.
DATAMAT specialises in system integration, has been involved in the development of a second generation text retrieval software called Fulcrum SearchServer, which is used in the project as the basic full-text retrieval system.
TRADOS develops translation tools, terminology database systems and translation memory systems.
The analysis of user needs by Grolier Interactive Europe and Bertelsmann Telemedia has established the following requirements for a multilingual web search engine (Hernandez 1997):
The following functionalities were listed as desirable additions, which add value to the MULINEX system, compared with other search engines.
Bertelsmann Telemedia have conducted a survey with about 70 internal and external users, the results of which are summarised in the following:
For evaluating various design options of the MULINEX User Interface, a psychological experiment was carried out jointly by DFKI and MEFIS, an institute which specialises in media psychology. The purpose of the experiments was to find out how the system can best support the user in the choice of relevant documents on the results page. The following questions were addressed in the experiment:
A total of 84 German subjects were tested with a mock-up system which presented subgroups with various design alternatives. The subjects were given the task to submit a predetermined query to a search engine, and to select documents which were relevant to a given information need by looking through lists of potentially relevant documents. The information needs were formulated as follows:
We gathered data about the background of the users (about their language skills, computer skills etc.), their impression of the system (by making use of a semantic differential designed for software evaluation and open-ended questions), and their performance (by analysis of the log files).
The users reacted very positively to the (manually constructed) thematic classification of documents. Document summaries were criticised as being too short and uninformative, and different kinds of summaries had no significant quantitative effect on the subjects’ performance. There was a subgroup of subjects which made extensive use of automatic translations of summaries. Translations were considered useful although their quality was criticised.
MULINEX is a multilingual Internet search engine that supports selective information access, navigation and browsing in a multilingual environment. During the phase of document gathering by the web spider, documents are analysed in order to obtain useful information about documents in addition to the traditional keyword-based indices. The project emphasises a user-friendly interface, which supports the user by presenting search results along with information about language, thematic category, automatically generated summaries, and allows the user to sort results by multiple criteria. Translingual search (Greffenstette 98) is supported by interactive translation of user queries. Commercial machine translation technology (LOGOS) is used to provide translations of foreign-language documents on demand.
The demonstrator includes the following functionalities:
The underlying architecture is object-oriented and manager-based. The benefits of an object-oriented architecture are increased modularity, flexible systems and ease of reusability through inheritance of structures and components. The main characteristic of a manager-based approach is encapsulation of the interaction of components, leading to increased independence between components and objects.4.1. Architecture Overview
The MULINEX system consists of weakly coupled subsystems. Subsystems communicate with each other through their managers by sending requests and receiving corresponding responses. Requests and responses are encapsulated in objects. According to the current architecture we identify the following subsystems:
Figure 1 - A subsystem-oriented view of the architecture of Mulinex.
We develop the system in an iterative and incremental process through analysis, design, implementation, and testing. We started by developing a use case model (Jacobson et al. 1992) and defining the corresponding scenario diagrams. This way we could easily identify the main classes that carry out the system’s functionality. The use case model specifies the functionality of the system. It controls the formation of all other models; i.e. the functionality specified by the use cases is structured by the analysis model, realised by the design model, implemented by the implementation model and tested in the testing model.
UML, the Unified Modelling Language (UML Specification 1997), is used as the modelling language, and RationalRose as the object-oriented modelling and software development tool.
In this section, we describe the technologies and resources that are used in the components of the MULINEX system.5.1. Document Acquisition
The gathering of documents is performed by a modified version of the Harvest gatherer (Bowman et al. 1994). Harvest has been augmented to call routines for language identification, document classification, and automatic summarisation, and to work in conjunction with the Fulcrum SearchServer, which is used as the information retrieval core engine in our system.5.2. Language Identification
Language identification is performed by making use of an algorithm which compares the relative frequencies of the most frequent n-grams (from 1 to 5 characters) in a document to 40 stored language profiles (Cavnar and Trenkle 1994).5.3. Categorisation
Document classification is performed by the k-nearest-neighbour algorithm (Yang 1994), a statistical algorithm which classifies a new document by combining the category assignments of the k most similar training documents, weighted by the statistical distance (tf.idf) between the new document and each of the k best matching training document. The categoriser is trained with documents from newsgroups in French, German and English.
In addition, there is a keyword-based categorisation algorithm for narrow, specialised categories.5.4. Summarisation
Summarisation is performed by selecting the sentences which best characterise a document. During document gathering, it operates in query-independent mode by selecting sentences on the basis of structural and layout HTML markup, and by position in the document or paragraph.5.5. Query Formulation / Translation
The MULINEX system translates and expands the users’ queries. Since the retrieval performance of automatically translated queries is inferior to monolingual information retrieval (Oard 1997), there is an (optional) step of user interaction, where the user can select terms from the translated query and add his own translation. Queries are morphologically analysed by making use of Morphix (Finkler and Neumann 1988) and MMORPH (Petitpierre and Russell 1995), and then translated by making use of multilingual dictionaries. We make use of the terminology database MultiTerm from TRADOS. The translated queries are the input to the search in the document collection.5.6. Information Retrieval
The search is being performed by the Fulcrum SearchServer, a state-of-the-art information retrieval system, which incorporates linguistic technologies for morphological normalisation of documents and queries. The results are presented along with information about their language, thematic categories and automatically generated summaries. If the result pages are accessible to the system without large delays (e.g., if they reside on the same intranet), a summary which is tailored to the user’s query can be produced. Results can be ordered by relevance or by thematic categories.5.7. Database
Two SQL-based database management systems are used in the MULINEX system: Fulcrum Search Server for all information retrieval tasks and for storing category profiles, and a standard SQL database (MSQL) for storing user profiles and the multilingual lexicon.5.8. Multilingual Lexical Resources
The MULINEX system uses six bilingual lexicon databases with 100.000 to 200.000 entries each for all six language pairs supported by the system (German-English, German-French, French-English and the converse pairs).5.9. Machine Translation
Summaries and result documents can be translated on demand by making use of the LOGOS machine translation system.
The system will be made publicly available by the user partnters in the consortium, who will obtain feedback from the end users of the system in order to evaluate the usability of the system.6.1. Validation Sites
In May 1998, the system has been installed in the online services of Grolier Interactive Europe and Bertelsmann Telemedia, two large internet service and content providers in France and Germany. They will use it to provide multilingual search facilities for their sites, and to enhance the functionality of their existing search engines. These services will become publicly available in the 3rd quarter of 1998.6.2. Validation Methodology
The end users of the system will be invited to provide feedback on the usability of the system via questionnaires, in which they evaluate the system, suggest improvements and can provide personal details. Users can also use a mailto-link to give feedback in free form. In addition, there will be in-depth interviews with a selected group of end-users.
In the next months until the end of the project, a number of enhancements will be made to the baisc functionality of the system described above.7.1 Mulinex Agent
The agent system performs information search tasks periodically on behalf of the user. Registered users can specify search queries and define interest profiles. The agent system runs these queries periodically and informs the user of new information which matches his interests.
For selected domains, the agent will extract the important facts from documents for the user by making use of the Saarbrücken Message Extraction System SMES (Neumann et al. 1997). The user will be notified by e-mail, a personalised web page or a push channel.7.2. Query Expansion
We plan to add a query expansion module based on the results of the EuroWordNet project (Vossen 1997).7.3. Result Clustering and Visualisation
Presently, search results can be grouped into a set of pre-defined categories. In future versions, MULINEX will use clustering methods for automatically grouping search results.
The clusters will be presented to the users with a graphical user interface based on VRML.7.4. Query Disambiguation
Presently, translations of queries are disambiguated by the user who selects among alternative translations. This presupposes knowledge of the target language by the user.
In the next version, the user will be presented with terms in his own language which correspond to the translation alternatives, so that he can perform disambiguation in his own language.
Bowman et al. 1994
Capstick et al. 1998
Cavnar and Trenkle 1994
Finkler and Neumann 1988
Fowler and Scott 1997
Jacobson et al. 1992
Neumann et al. 1997
Petitpierre and Russell 1995
UML Specification 1997