Sequence Search Methodology

For the study reported here, two different datasets were compiled:  influenza sequences (both nucelotide and amino acid) and sequences recited in patent claims.  The patent sequences were compared to the influenza sequences using BLAST programs. Then, the portions of the patent sequences that had significant homology to influenza sequences was determined.  Criteria for a match varied according to whether the query was with nucleotide or amino acid sequences.

1. Generation of a searchable influenza genome and protein database

We obtained the most recent influenza genome and protein sequences collections from the FTP site of NCBI’s Influenza Virus Resource. There are more than 2,000 complete influenza genome sequences in the database.  Because relatively small changes in the flu genome at either the nucleotide or amino acid level can have large epidemiological effects, all sequences in the database were obtained. The formatdb program from NCBI to convert the data to a searchable BLAST database.

2. Compilation of sequence databases for granted patents and patent applications

Applications
For patent applications, sequences of the bulk sequence applications were obtained from the Publication Site for Issued and Published Sequences (PSIPS) web site. This web site provides sequence listings for U.S. patents and applications that are longer than 300 pages. Sequence listings for the non-bulk sequence listings (fewer than 300 pages in length) are published by the USPTO as an XML document. For each of the listing types (bulk sequence and non-bulk sequence), there was a separate file for nucleotides and amino acids. Data for U.S. applications have only been available since 2001.

The bulk and non-bulk sequence listings were then converted to a common data format (FASTA) and combined to create one database for nucleotide sequences, and one database for amino acid sequences. Additionally, each of these combined databases was converted to a searchable BLAST database for use with CAMBIA’s patent sequence search tool.

Granted (Issued) Patents
For granted U.S. Patents, we used a data source that wasn’t available for the applications; GenBank at NCBI has a searchable patent database of sequences disclosed in granted patents. To create a database of sequences in granted patents, the U.S. patent sequences were acquired from GenBank, which required removing all sequences that originated from non-U.S. patents.

Sequence listings from the bulk and non-bulk patents were obtained in the manner described above for Applications. The data from all three sources (GenBank, bulk, and non-bulk) were converted to a common format. A filtering step removed duplicate sequences in the data provided by GenBank and by the USPTO (bulk and non-bulk).

The identical process was carried out for nucleotide sequences and amino acid sequences. As with the applications, each of these databases was converted to a searchable BLAST database for use with CAMBIA’s patent sequence search tool.

3. Identification of sequences that are recited in the claims of granted patents and patent applications

A key feature of our analysis is determining which sequences are recited in the claims of patents and applications, rather than just disclosed in the specification. To this end, we created four databases that contain only the sequences that are recited in the claims of patents and patent applications. The four databases correspond to nucleotide sequences in applications, amino acid sequences in applications, nucleotide sequences in granted patents, and amino acid sequences in granted patents.

Identification of sequences in claims involves the use of keywords that are used to identify sequence listings (e.g., SEQ ID NO:). Establishing a comprehensive list of keywords is challenging, as many different phrases are used. Applying these phrases resulted in a list of sequence ID numbers that are designated in patent claims. Four new databases were then created that contain only the sequences that are recited in the claims of applications and patents.

4. BLAST search of influenza genome and protein databases using the sequences recited in claims as input

After compiling a collection of sequences that are recited in the claims of patents and applications, we then used those sequences to query the influenza genome database (see step 1) using MEGABLAST to identify nucleotide sequences, and blastp to identify the amino acid sequences that are recited in claims and have significant homology to sequences in the influenza genome.

Nucleotide MEGABLAST parameters:

Positive matches had an E value of 1e-200 or less, and were at least 150 nucleotides in length.

Amino Acid blastp parameters:

Positive matches had at least 80% identity.