Methods for Genome Mapping

In this genome landscape, we have determined the extent to which the Arabidopsis genome sequences are claimed in both issued U.S. patents and U.S. patent applications. Our initial approach was to reproduce aspects of the analysis performed by Jensen and Murray (Science 310:239-40, 2005) in their analysis of the human genome. Our process entailed a number of informatics steps that are outlined below.

In summary, we compiled a database of patent sequences that are claimed in patent applications, and compared these sequences to the published Arabidopsis genome using a BLAST-type interface. We then determined which portions of the claimed sequences had significant homology to sequences in the Arabidopsis genome, and mapped these sequences to their location on the Arabidopsis chromosome. The results of our analysis are shown in the subsequent pages of this landscape.

1. Compilation of a searchable Arabidopsis genome database

We used the most recent Arabidopsis genome sequences from the Arabidopsis Genome Project at NCBI. We then used the formatdb program from NCBI to convert the data to a searchable BLAST database.

2. Compilation of sequence databases for granted patents and patent applications

Applications

For patent applications, we acquired the sequences of the bulk sequence applications from the Publication Site for Issued and Published Sequences (PSIPS) web site. This web site provides sequence listings for U.S. patents and applications that are longer than 300 pages. We also acquired the sequence listings for the non-bulk sequence listings (fewer than 300 pages in length) that are published by the USPTO as an XML document. For each of the listing types (bulk sequence and non-bulk sequence), there was a separate file for nucleotides and amino acids. Data for U.S. applications have only been available since 2001.

The bulk and non-bulk sequence listings were then converted to a common data format (FASTA) and combined to create one database for nucleotide sequences, and one database for amino acid sequences. Additionally, each of these combined databases was converted to a searchable BLAST database for use with CAMBIA’s patent sequence search tool.

Granted (Issued) Patents

For granted U.S. Patents, we had a data source that wasn’t available for the applications; GenBank at NCBI has a searchable patent database of sequences disclosed in granted patents. To create our granted patents sequence database, we started by acquiring the U.S. patent sequences from GenBank. This required removing all sequences that originated from non-U.S. patents.

We then acquired the sequence listings from the bulk and non-bulk patents in the manner described above in the Applications section. The data from all three sources (GenBank, bulk, and non-bulk) were converted to a common format. We then carried out a filtering step that removed any duplicate sequences in the data provided by GenBank, and the sequences provided by the USPTO (bulk and non-bulk).

The identical process was carried out for nucleotide sequences and amino acid sequences. As with the applications, each of these combined databases was converted to a searchable BLAST database for use with CAMBIA’s patent sequence search tool.

3. Identification of sequences that are claimed in granted patents and patent applications

A key feature of our analysis is that we determined which sequences were actually claimed in patents and applications, rather than just disclosed in the specification. To this end, we created  four databases that contain only the sequences that are claimed in patent applications. The four databases created correspond to nucleotide sequences in applications, amino acid sequences in applications, nucleotide sequences in granted patents, and amino acid sequences in granted patents.

We compiled a list of common phrases that are used to identify sequence listings in claims. This step was tricky, as there are many different phrases that patent applicants use to designate sequence listings in claims (see examples of phrases) . Using these phrases, we created a list of sequence ID numbers that are designated in patent claims. We then created four new databases that contain only the sequences that are claimed in applications and patents.

4. BLAST search of Arabidopsis genome database using the claimed sequences as input

After compiling a collection of sequences that are claimed in patents and applications, we then used those sequences to query the Arabidopsis genome database (see step 1) using mega BLAST to identify claimed sequences that have significant homology to sequences in the Arabidopsis genome. We performed this analysis only with the claimed nucleotide sequences in patents and applications.

5. Plotting the results of the analysis

The sequences that had significant homology to sequences in the Arabidopsis genome were plotted three different ways. The criteria for matches in the database was that they have a BLAST E value less than e-200.

  1. Sequence Count. For these plots, the Arabidopsis genome was divided into 300 kbp segments.  For patent applications that claim Arabidopsis gene sequences, we  plotted the number of sequences that match at least a 150 bp fragment of each 300 kbp genome segment. For each sequence, only the highest-scoring genome match was counted.
  2. Patent Count. As with the Sequence Count plots, the Arabidopsis genome was divided into 300 kbp segments. For patent applications that claim Arabidopsis gene sequences, we plotted the number patent applications that claim a 150 base pair or longer fragment of each 300 kbp genome segment. For each sequence, only the highest-scoring genome match was counted.
  3. Percent Genome Coverage. For these plots, we plotted the percentage of each genome segment that was claimed in patent applications. Unlike the previous two plot types, there was no requirement that the fragments that matched the Arabidopsis genome sequences be a minimum length (e.g., 150 base pairs). However, the minimum size for a positive match using the BLAST interface is 26 base pairs, so matches shorter than 26 were not included in the analysis. In addition, if multiple sequences covered the same portion of the genome, that portion of the genome was counted as being covered only once.

The end result of our analysis was a dataset of patents containing Arabidopsis sequences in their claims, linked to the specific map location of those sequences on the Arabidopsis genome. The sequences and matched genomic regions are made using relatively highly specific search criteria. Resulting in “exact” matching of claimed sequences to their corresponding genomic sequence. No attempt was made to identify homologous sequences within the genome, which may also be claimed by the patent document. Hence the maps obtained are likely to under-represent the actual sequence claims in Arabidopsis, from patents claiming Arabidopsis sequences.