Natural Language Processing (NLP)

Applied to searching Chinese, Japanese, and Korean (CJK) language patents.

An emerging trend in IR relates to use of machine translation (MT) to allow searching across information in many languages.  A number of popular search engines offer “on the fly” MT of web pages in number of languages into English, but there is a severe “lost in translation” problem that would greatly complicate reliance on this for technology searching. Nonetheless, cross language searching remains an area of intensive research and it is already proving valuable in some applications.

Patent data show a more formalised structure than natural language, however.  This may reduce the challenge somewhat of using character-based text search algorithms.  Our scoping of the Chinese patent data predicts that it is an accomplishable process to adapt in and invent search capabilities to integrate with the existing searchable patent data structures the handling of non-ASCII characters that are much more complex.  However, research needs to continue into the semantic models  and utilities for differentiated thesauri.

Languages with large number of characters compared to the Latin alphabet or with an absence of white space between words pose particular IR challenges.  Issues such as character encoding, word break analysis and specific processing for transliterated foreign words, abbreviations, and personal, organization and company names must be addressed for the effective searching of CJK language patents.

Fortunately, a large part of the informatics groundwork necessary for integration of Chinese information into databases can be extended into other language databases based on non-ASCII characters and other semantic structures.  Thus further extension into patent information in the languages of other important trading partner jurisdictions such as India, the Middle East, Indonesia, Korea and Japan should be accomplishable as the program continues.

Chinese Language

Foreign proper nouns are usually transliterated into Chinese by using characters for their phonetic rather than semantic value.  Each Chinese-speaking region may transliterate the same name differently: For instance, ‘Novartis’ could be translated into 诺瓦提斯 or 挪伐帝司, depending on the attorney who prepares the documents. In this example, the characters used for the translation are completely different. In Mainland China, the foreign names of individuals, companies and institutions that appear the bibliographic page of patents are normally transliterated into Chinese, which makes searching difficult. Fortunately, each region uses a relatively small and consistent set of characters when transliterating.

A further complication is the use of simplified characters in mainland China and traditional characters in Hong Kong and Taiwan. Although traditional Chinese and simplified Chinese share more than 70% of the characters, some characters (words) can be very different. For example, ‘international’ is ‘國際’ in traditional Chinese and ‘国际‘ in simplified Chinese. Since the majority of Chinese language patent documents filed with WIPO or EPO are from Mainland China, simplified Chinese is more dominant. However, if the original documents were prepared in traditional Chinese, the search terms must also be entered in traditional Chinese to find the relevant documents.

Phrase and names can be abbreviated by taking a character from each part of the word or phrase. There is no clear rule as to whether these should be the first or subsequent characters. For example Beijing University, 北京大学, is usually abbreviated as abbreviated 北大. Abbreviations are irregular and their use widespread, so constructing a comprehensive lexicon is a challenge.

Japanese language

The Japanese language uses a combination of four scripts with an extremely complex morphology and orthography. Natural Language Processing (NLP) techniques are essential for successful full text searching of Japanese text. The CJK Dictionary Institute (CJKI)32 specialises in CJK computational lexicography and has developed a lexical database with over two million Japanese and one million Chinese entries.

A major source of complexity in processing Japanese texts is the presence of an extremely large number of homophones. Many homophones are synonyms in some senses but not others and as a result it is hard to predict which an author will choose to use in a particular context. For example, the verb ‘noboru’ can be written using three different Chinese characters: 上る can be used to mean “to move upwards”, 登る can be used to meaning to “transport oneself to a high point” and 昇る can be used to depict the rise of an astronomical body. The point is that given that these words are all pronounced in the same way and that they have very similar meanings, there is ample scope for one to be used in place of another through error or expediency on the part of an author.

On the other hand, personal names (in most cases written in Chinese characters) have a variety of ways in which a single combination of characters can be pronounced (e.g. the first name ‘Shoko’ when written in Chinese ‘尚子’ can also be read as ‘Naoko’ (most common) and ‘Hisako’). This can create a problem when searching for an inventor’s name if it has been transliterated into English, as all variations must be considered when searching.

One of the most diverse areas of the Japanese language in terms of orthography is words derived from English. The orthography of these “katakana” words is extremely variable and even native Japanese speakers wishing to search Japanese text may benefit from NLP techniques that allow English keywords to be entered to retrieve all katakana and Latin alphabet variants. These katakana characters are also frequently used for Japanese words on certain occasions, or for non-Japanese languages other than English. The former type of katakana characters is often seen in the cases of organisms that are used in scientific literature; e.g. rice is ‘稲 (ine)’, which is written in katakana as ‘イネ’ for rice that is used as scientific material (not for agriculture). This creates a problem when using the word ‘rice’ as a search term, as (again) both katakana and Chinese writing must be considered.

Finally, the Japanese have several ways of writing numerals, similar to those explained in the section on Chinese above. In patent documents, the bibliographic information sheet uses Arabic numerals to identifying the different types of information (e.g. ‘(51) International patent classification’, ‘(21) application number’), dates, references, and names of things like vectors, proteins, etc in the specification. All other numbers are mostly in unicode e.g. “200”, Chinese characters are rarely used (except for old patent documents).

Cross language Information Retrieval

The Cross-Language Information Retrieval (CLIR) activities initiated within the Text Retrieval Conference (TREC) have stimulated much interest in Europe and Asia. The European Cross-Language Evaluation Forum (CLEF)33 supports TREC like workshops for CLIR using European languages, whilst the Japanese NTCIR workshops34 support CLIR in Chinese, Japanese, Korean and English (with a particular interest in patents). A task at the most recent NTCIR workshop involved a patent examiner invalidating a claim in English by using English queries to identify sections within Japanese patents that would invalidate the claim.

Current CLIR techniques make use of parallel bodies of text available in both the query language and the corpus language (the language of the text being searched). For patents, professionally produced English abstracts may be used. This data is used in Machine Translation (MT) of either the query into the corpus language or documents into the query language. Both strategies provide good results, and combining both provides close to mono-lingual search quality.