Mapping Arbidopsis Patents to the Arabidopsis Genome
The first steps in understanding the extent to which claims from one organism overlap into the genome of another is to develop the databases, software, and methodology to answer the question. The next step we chose was to perform the most straightforward analysis, both to test the methodology and to obtain a “baseline” for comparisons of future analyses. We set out to:
- Developed reliable databases containing the defined sequences of interest. In the following analyses we have used data from a number of sources to develop a database containing sequences in the claims section of US granted patents and patent applications. To the extent possible, we have filled the known gaps in the existing GenBank patent sequence data using data from the USPTO.
- Defined a standard methodology. Our in-house bioinformatics team developed the methodology and parameters used during analysis (initially based on the work of Jensen & Murray, 2005).
- Developed software tools. Our in-house bioinformatics team developed and identified software solutions allowing us to build the database above (the patent query databases), the target database (the Arabidopsis genome sequence), to perform the megaBLAST analysis, and to map the output onto Arabidopsis chromosomes.
These initial steps were used to attempt to reproduce aspects of the analysis performed by Jensen and Murray (Science. 2005 Oct 14;310(5746):239-40.) on the human genome map for Arabidopsis. Essentially, claimed sequences from US-granted patents were megaBLASTed against the Arabidopsis genome sequences, and sequences of interest (longer than 28bp, with expect values of <1 x 10e-200 and highest bit scores) were identified. The genome positions of each of these sequences of interest were identified and grouped together into regions. Each region spans a 300kb region of an Arabidopsis chromosome. The number of patents in each region was then mapped to the region position on the respective chromosome.
The end result of this analysis was a dataset of patents containing Arabidopsis sequences in their claims, linked to the specific map location of those sequences on the Arabidopsisgenome. The sequences and matched genomic regions are made using relatively highly specific search criteria. Resulting in “exact” matching of claimed sequences to their corresponding genomic sequence. No attempt was made to identify homologous sequences within the genome, which may also be claimed by the patent document. Hence the maps obtained are likely to under-represent the actual sequence claims in Arabidopsis, from patents claiming Arabidopsis sequences.
|Note also: This analysis does not deal with those sequences that are present in patent applications, and most importantly will not show sequences appearing in the claims section of bulk sequence applications!|