Optical character recognition (OCR)

Over recent years the accuracy and speed of Optical Character Recognition (OCR) technology has greatly improved and it is now practical to perform OCR on very large collections of digital images cost effectively. CAMBIA has extensive experience in this area, having carried out OCR on millions of pages covering PCT, EPB and Australian patents (USPTO data is brought in electronically without the need for an OCR step).

Many patent filings in Australia enter national phase via the PCT.  PCT applications should be considered prior art for Australian filings as soon as published, so another factor complicating searching of Australian patents and applications is that PCT applications may first appear in a language other than English, such as Japanese.  There is discussion of adding Chinese as a language in which PCT applications are accepted.  Optical character recognition of Chinese and Japanese characters may require special software.

We have recently conducted extensive trials of the latest OCR technology with a focus on the recognition of Chinese and Japanese as well as European languages.  We found two OCR engines that with our applications achieved accuracy levels of over 99% on PCT patents published in Japanese and patent documents provided in Chinese.

Some National patent offices30 indicate that there may be some current data provision issues that could limit the usefulness of even a well designed search interface for full text or field based searching of bibliographic data, claims and specifications sections of patents. To facilitate the future searching of specific sections of text, such as claims, it will be important to ensure that OCR data quality issues are anticipated so that they do not affect the facility with which data markup can be accomplished.

One way to address at least some of the data integrity issues is through the adoption of applicable international standards.  For example, for the dissemination of Australian patents after OCR, the revised “wo-published-application” dtd format, outlined in the WIPO document C PCT 1037-7631 may have some relevance.  OCR quality of photocopied PCT applications being transmitted to Australia for national phase could be improved by instead obtaining the OCR that had been done for the PCT applications directly, and then the national phase information that is often stamped on hard copies could be added electronically.  The recent decision by IP Australia to accept, from 18 July 2005, limited filing of international applications directly in electronic format in its capacity as PCT receiving office also offers an opportunity for a focus on data quality based on international standards.


30see for example http://www.ipaustralia.gov.au/pdfs/patents/fields.pdf
C. PCT 1037 – 76 of 8 July 8 2005