Understanding the Extent of Patenting of Arabidopsis Sequences

In Chapter 8 we expanded upon the idea that claims originating on sequences from one organism, for example Arabidopsis, may also claim similar sequences from other organisms. In particular claims to sequences from a model organism (that may have little economic importance) may include hybridisation and/or similarity language that may effectively claim sequences from economically important crops. So companies that seek to claim large numbers of genes from Arabidopsis are effectively also attempting to claim similar sequences from important crop plants.

The degree to which such claims may read on other organisms is very difficult to calculate due to:

  • Difficulty in obtaining the sequences claimed in patent applications: Only a subset of patented sequences are present in popular public databases such as GenBank. Sequences from published patent applications, although in the public domain, need to be obtained from the various patent authorities.
  • Mapped genomes: Whilst we can identify and map claimed sequences from organisms such as Arabidopsis, where we have extensive genome information, the genomes from many other organisms are not as well defined. Hence it is not possible to map the claims from Arabidopsis onto the genome map of soya, since the latter effectively does not exist.
  • Data formats and availability are variable:  WIPO sequences for the most part are available as TIFF images only. The USPTO supplies recent sequence data within application documents, available here or at its website.  Bulk sequence data must be downloaded separately, and has a data structure different from that available at GenBank. Much of the older sequence data is not publicly available in a useful format (i.e. available as scanned pages or as OCRed data files).
  • Software issues: Since data formats vary between published sequences, only that present within Genbank is easily analysed by the average user. Software is not yet readily available to allow the average user to perform in depth analysis of patented sequences and to map such sequences to the claimed organism, or to find matches in the genomes of related organisms. Issues of claim language (see below) make quantitative analysis of large groups of patent documents difficult.
  • Claim languageThe claim language used in patent documents is critical! While we can recognise groups of claims (based on hybridisation conditions and sequence similarity measures), each claim can be subtly different. This results in the need for complex and time consuming analysis – often involving a great deal of time and effort. Since such language is complex and very variable, it does not lend itself well to automated analysis via software solutions. Although, a software solution is needed to analyse bulk sequence claims and for finding the extent of inter-genome claims.

For each of the points above, we have endevoured to develop a solution, an improvement, or an approximation, that gives a better idea of the state of inter-genome claims.