Specialised Data Searching
Two areas of specialised data searching are presented here, chemical and sequence searching, because the art areas in which they occur are of special significance to Australian industry, and because they present particular challenges for claim construction.
Searching the patent literature for chemical substances poses a unique set of challenges.
Chemical names can have many equivalent synonyms, but in the literature chemical substances are often represented by chemical structures, which can also be depicted by a number of equivalent nomenclatures.
Claims over chemical structures are frequently presented in Markush42 form, in which a particular claim can prophetically cover a large variety of substances in which a named structure is represented. The metes and bounds of a given claim are often provided by various transitional phrases such as “consisting of” or “comprising”. A Markush structure is a claim with multiple “functionally equivalent” chemical entities allowed in one or more parts of the compound.
For an indication of the challenge, on the figure page that follows, the Markush structure shown covers more than 150,000 individual compounds, so even though relatively few actual words for substituents are used, a claim on this structure would read on 150,000 possible variations of the core molecule that would theoretically need to be checked in the prior art.
Searching for specific substances requires a database containing an index of chemical fragment codes. Searching Markush structures requires a specialised “Markush” database. Both types of searches require dedicated chemical structure tools (computer programs) that allow standard chemical structures to be converted to and from chemical fragment codes and/or Markush structures.
The following are the chemical structure databases available currently, both full fee-paying services with non-trivial costs.:
- Derwent/INPI Merged Markush Service (also called Markush DARC), comprises the Derwent Markush index merged with the INPI Pharmsearch, with data from 1987 onwards, available on Questel Orbit.
- Chemical Abstracts Service (CAS) MARPAT, available online on STN Web and available via the desktop PC programs STN Express and SciFinder, with data from 1988 onwards.
Both these databases have richer data indices for more recent patents than for earlier patents.
An encouraging trend in the area of chemical searching is the use of Bayesian search algorithms, exemplified by Reel Two,43 which can use thesauri defined by the user to translate chemical names (CAS, IUPAC, common, SMILES) to structure. The structures selected by the user can then be utilised as a starting point for iterative searches in which additional synonyms and related terms found in the searched database are brought forward to the user as options. However, good knowledge of chemical terminology by the searcher is clearly required. Additionally, this is currently set up primarily for in-house dedicated and proprietary databases and is not available as a web service.
Figure 1: An example of wording in a Markush claim over chemical structures referred to in a diagram.
Figure 2: An example of a BLAST alignment
42 named after Dr. Eugene Markush, whose 1923 patent became a test case for the inclusion of multiple, independently varying functional groups in the description of a chemical invention (this is the source of the figure shown).
Searching Patent Data for Biological Sequences
Biological sequences can be broadly categorised into two types: sequences of nucleic acids (DNA and RNA) identified by the coding letters A, U or T, C, and G, which identify nucleotides with particular base-pairing relationships; and polypeptide, peptide or protein sequences (linear polymers of amino acids either identified by a single letter alphabetic code or a three letter code, e.g. “S” or “Ser” for the amino acid serine)…
The challenge in claim construction is that biological function is usually not dependent on the exact sequence, and similar or homologous functions can be “encoded” by many variations that might be only weakly similar to the particular sequences in the specification. Thus, claims are seldom over a single exact sequence, and tend to be worded such that they read on many sequences that may be found in the literature.
For example, a typical claim wording is “An isolated polynucleotide having a sequence (to some named percentage of the sequence such as 65%, with higher homology being “preferred”, and higher homology up to 95-99% “most preferred”, “particularly preferred”, or “especially preferred”) homologous to SEQ ID No. 1, or a portion / fragment (of specified length) thereof” or “An isolated polynucleotide which hybridizes (usually under some specified conditions) to SEQ ID No.1, or a portion / fragment thereof”44.
The language used in sequence claims is also very often similar to that used in chemical claims using the “Markush” notation. The metes and bounds of a given claim are provided by various transitional phrases such as “consisting of” or “comprising” multiple sequences.
There is no general rule for the required percentage agreement or the stringency conditions for chemical hybridization, because biological function of an enzyme and particularly the portion of the sequence encoding the active site may be very highly conserved, whereas function of a structural protein may be seen in variants that are less similar.
A further complication in prior art searching is that the literature may or may not indicate the biological function related to a particular sequence accurately, so for a particular sequence claim it may be challenging to find the many sequences in the prior art that could be related by sequence similarity, by potential hybridisation or by function.
Furthermore, amino acid sequences are encoded by “translated” DNA sequences, which occurs by a biological mechanism that reads sequences or “frames” of three nucleotides each, so a deletion or insertion error in the nucleotide sequence can result in a different “translation”. Thus, for comparisons involving a relation to a predicted protein sequence, DNA sequences must be checked in three “frames” on each of the two base-pairing strands of each DNA molecule. For example, shown on the figure page above is a sample output in which a DNA sequence being searched has matched a sequence in a patent document, although there are short gaps that suggest the protein sequence would not match.
Biological sequences are usually given as separate listings as well as shown in figures or tables within the specification or claims section of a patent. Short sequences (less than 20 residues) may be quoted within the text of a sentence, and not all sequences in the specification or sequence listing will be covered by the claims; for example, sequences claimed by others or in the public domain may have been used in the examples of a patent specification to find or compare the sequences that are claimed.
The two most commonly used sequence search algorithms are variations of FASTA and Basic Local Alignment Search Tool (BLAST) as developed by the US National Center for Biotechnology Information (NCBI), or the similar, “local alignment” algorithm FASTA, both of which use heuristics to search DNA and protein sequences. For a comparison of sequence search algorithms for protein sequences, see Shpaer et al. (1996).45 Most of the patent search service vendors provide these algorithms for searching patent data for sequences, though with varying flexibility of use. In general it is desirable, depending on the claims, to do:
- DNA sequence query against DNA sequences, both for similar sequences and complementary sequences that would hybridise to similar sequences
- Protein sequence query against protein sequences
- Protein sequence query against translated DNA sequences
- “DNA sequence translated into 6 possible reading frames” query against protein sequences
These algorithms have a tendency to trade off completeness for search speed, so many users do not realise how important user specification of parameters is to determine the precision of the output. The sequences being compared are aligned within a moving “search window”, and controls for adjusting input parameters include:
- search matrix (e.g. blosm62)
- “word” length (the length of the moving search window in which matches are made
- gap penalty (for adjustments due to small insertions and deletions in one sequence relative to another)
- cut-off threshold for expected frequency (“E score”)
- cut-off threshold for percent identity over the sequences being compared
- options for displaying output, such as displaying alignments, etc.
Because different patents can claim different degrees of relatedness for sequences (e.g. from exact matches to as low as 40% identity), searches by examiners and searches for freedom to operate should be performed initially with small word lengths, low gap penalties and low cut-off thresholds with the intention of finding all related sequences, including distantly related ones. Then depending on the claims language of the patent documents of interest, each patent document found to contain a sequence that a claim may read on, can be checked to determine whether it should be discarded or included.
44Examples are taken from the Examination Guidelines for Patent Applications relating to Biotechnological Inventions in the UK Patent Office.
45Shpaer E et al (1996) “Sensitivity and Selectivity in Protein Similarity Searches: A Comparison of Smith – Waterman in Hardware to BLAST and FASTA. Perkin – Elmer, Genomics 38, 179 – 191