An Integrated Chemical and Biological Data ... - ACS Publications


An Integrated Chemical and Biological Data...

1 downloads 84 Views 2MB Size

12

An

Integrated C h e m i c a l a n d B i o l o g i c a l D a t a R e t r i e v a l

System for

Drug

Development

1

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

J. A. PAGE , R. THIESEN, and F. KUHL Walter Reed Army Institute of Research, Bethesda, MD 20014

The Division of Experimental Therapeutics, The Walter Reed Army Institute of Research, in conjunction with the Division of Biometrics has been engaged in the development and implementation of a large scale integrated chemical - biological data retrieval system for the support of the Army Medical Research and Development Command's drug development a c t i v i t i e s . The system i s being developed on a Control Data Corporation 3500 with one million bytes of memory, 16 disk drives with removable packs which contain 37 million bytes of storage each, six 7-track tape drives, two line printers, and 16 communication lines supporting line speeds of 110, 300, and 1200 baud. This effort represents a total redesign of the original system which was described earlier [1].

F i l e Organization The WRAIR Chemical Information Retrieval System (CIRS) is comprised of four subsystems: Biology, Inventory, Chemistry and the Report Generator. The f i r s t three subsystems contain f i l e s of information peculiar to each system, and programs for searching these f i l e s . The Report Generator is used to combine and sort output from searches of the other subsystems. The subsystems must be searched separately because they are too large to be searched together. The output from any subsystem may be used to control the search of the next, by means of a common key. For example, a chemistry search yields a number of structures, each of which is identified by a unique accession number. These numbers might then be used to garner information from Biology and Inventory relative to samples of the structures whose numbers come from Chemistry. Or, the l i s t of sample numbers from an Inventory search might be used to extract information from Chemistry 1

Present address: Uniformed Services University of the Health Sciences, Bethesda, MD 20014. This chapter not subject to U.S. copyright. Published 1978 American Chemical Society Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

182

RETRIEVAL

O F MEDICINAL

CHEMICAL

INFORMATION

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

r e l a t i n g t o those samples. In both examples, the Report Genera t o r would combine the information from the various subsystems and s o r t the r e p o r t i n t o the d e s i r e d order. The standard CIRS r e p o r t may contain information from any o f the subsystems alone, o r from any two, or from a l l three. Chemistry R e t r i e v a l Subsystem. The chemistry r e t r i e v a l subsystem f i l e design and o r g a n i z a t i o n i s being d e s c r i b e d i n d e t a i l f o r p u b l i c a t i o n elsewhere [2]. B r i e f l y the system cons i s t s o f two numeric index cross reference f i l e s , a screen index f i l e , and a master s t r u c t u r e f i l e . I t contains about 270,000 unique s t r u c t u r e s and occupies 8 d i s k packs of 37 m i l l i o n chara c t e r s each. The three index f i l e s provide a cross indexing scheme that allows f o r f l e x i b i l i t y i n sequencing and updating. The system may be accessed by 1) a c c e s s i o n number, a unique number s i m i l a r i n concept t o the CAS r e g i s t r y number, which i s assigned by the chemistry update system t o each new s t r u c t u r a l formula, 2) sample number, which i s a unique, s e q u e n t i a l number assigned to each p h y s i c a l sample without regard t o the chemical s t r u c t u r e by the inventory update system, o r 3) chemical s t r u c t u r e s , e i t h e r whole or sub-structures. The a c c e s s i o n index f i l e contains the accession number f o r a given s t r u c t u r a l formula and a t a b l e of sample index records f o r each a c c e s s i o n number. T h i s sequence provides quick access t o data f o r a l l samples o f a p a r t i c u l a r chemical. In order t o provide c o n t i n u i t y and allow f o r the expression of a h i e r a r c h i c a l r e l a t i o n s h i p the a c c e s s i o n number i s s t r u c t u r e d so t h a t f u n c t i o n a l l y d i f f e r e n t f i l e s may be maintained and s a l t s may be t i e d , through t h e i r a c c e s s i o n number, to the parent compound. The p a r t s and f u n c t i o n s o f the p a r t s are as f o l l o w s : 1) A two d i g i t alpha p r e f i x designates s e r i e s . C u r r e n t l y only two s e r i e s are being used: "WR" f o r s t r u c t u r e s f o r which a p h y s i c a l sample has been r e c e i v e d f o r screening and "XR" f o r s t r u c t u r e s proposed or under c o n s i d e r a t i o n but not a c t u a l l y r e c e i v e d . An a d d i t i o n a l s e r i e s f o r r e l a t e d s t r u c t u r e s reported i n the l i t e r a t u r e i s planned. The s e r i e s p r e f i x i s a u t o m a t i c a l l y up-graded t o "WR" i f the compound i s r e c e i v e d and processed through the inventory system. 2) A s i x d i g i t s e q u e n t i a l number which i d e n t i f i e s the p r i mary chemical s t r u c t u r e . 3) A two d i g i t numeric " s a l t s u f f i x " which i s assigned by the update system t o d i f f e r e n t s a l t s of compounds having the same primary s t r u c t u r e . T h i s allows the user t o r e t r i e v e a s p e c i f i c compound and a l l of i t s s a l t forms without doing a sub-structure search. I t a l s o allows data on a given compound and a l l o f i t s s a l t forms t o be grouped together on an a c c e s s i o n number sequenced r e p o r t .

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

12.

PAGE

ETAL.

Integrated

Data

Retrieval

System

183

The sample index f i l e i s keyed by the sample or b o t t l e number. T h i s number i s c r o s s - r e f e r e n c e d t o the accession number so t h a t a given sample may be attached t o a s p e c i f i c s t r u c t u r e . The sequence permits the chemistry subsystem d i r e c t access t o e i t h e r the b i o l o g y o r the inventory subsystem and provides f o r d i r e c t access o f the s t r u c t u r e s f o r r e p o r t s by the inventory and/or b i o logy subsystems. The sample index f i l e a l s o contains some admini s t r a t i v e i n f o r m a t i o n about each sample such as the source; the method by which the sample was obtained (e.g. g i f t , purchased, e t c . ) , whether t h i s sample i s the o r i g i n a l submission o r a dup l i c a t e , d i s c r e e t ( i . e . , p r o p r i e t a r y ) o r open. The screen index f i l e i s the f i r s t f i l e accessed f o r any s t r u c t u r e o r sub-structure search. I t contains a l l the informat i o n necessary t o determine s t r u c t u r e matches. When the s t r u c t u r e matches have been l o c a t e d , a d d i t i o n a l information f o r each s t r u c t u r e , such as the s t r u c t u r e p i c t u r e , may be r e t r i e v e d v i a the a c c e s s i o n number. Because i t s o r g a n i z a t i o n i s index-sequent i a l , the screen index f i l e may be accessed e i t h e r s e q u e n t i a l l y , or s e l e c t i v e l y by use o f the i t s indexes. Each s t r u c t u r e has i t s own record on the screen index. The c h i e f items s t o r e d f o r each are the connection t a b l e ( i n a compressed, non-redundant format), and the s t r u c t u r e ' s unique acc e s s i o n number. The key f o r each r e c o r d c o n s i s t s o f the acc e s s i o n number, preceded by the s t r u c t u r e screen and the p a r t i t i o n i n g f a c t o r . The screen i s a 96-bit superimposed code d e r i v e d a l g o r i t h m i c a l l y from the s t r u c t u r e . I t has been described i n d e t a i l by Feldman [3]. I f two s t r u c t u r e s have d i f f e r e n t screens, they must have d i f f e r e n t s t r u c t u r e s . Thus only those f i l e s t r u c t u r e s having the same screen as a given query compound are cand i d a t e s f o r matching. I t e r a t i v e matching w i l l be necessary t o confirm the matches, but the amount o f i t e r a t i v e matching r e q u i r e d i s d r a s t i c a l l y reduced by the screen. For sub-structure searches, the i n c l u s i v e property o f the screen becomes s i g n i f i c a n t . In the example below, the f i r s t s t r u c t u r e (discounting hydrogens which are not considered i n the c a l c u l a t i o n o f the screen) i s wholly contained by the second and as a sub-structure would be a m a t c h —

The screen f o r the f i r s t s t r u c t u r e i s a l s o wholly contained i n the screen f o r the second, i . e . f o r each b i t s e t i n the f i r s t screen, the corresponding b i t i s set i n the second. To match a sub-structure then, a candidate's screen must have a t l e a s t a l l the b i t s t h a t a r e set i n the query's screen. The e f f e c t i v e n e s s of t h i s system i n e l i m i n a t i n g f i l e compounds from c o n s i d e r a t i o n depends on the nature o f the sub-structure and v a r i e s g r e a t l y .

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

184

RETRIEVAL

OF

MEDICINAL

CHEMICAL

INFORMATION

The p a r t i t i o n i n g f a c t o r i s conceptually s i m i l a r to Hode's bucket index [ 4 ] . I t i s a 12 b i t code d e r i v e d from the screen through a s e r i e s o f AND and OR operations i n such a way that the i n c l u s i o n p r o p e r t i e s o f the screen are preserved. Its function i n the screen index f i l e i s to a l l o t records to one of 4096 p a r t i t i o n s with a t h e o r e t i c a l l y uniform d i s t r i b u t a t i o n . Therefore, i n a f i l e o f 250,000 records the expected number of records i n a given p a r t i t i o n i s 61. Since f o r i d e n t i t y searches, the f a c t o r and screen o f the query must be matched e x a c t l y only t h i s very small p o r t i o n o f the f i l e need be read p r i o r to the search. The u t i l i z a t i o n of the p a r t i t i o n i n g f a c t o r i n the subs t r u c t u r e search i s more complex. Any compound c o n t a i n i n g a given sub-structure w i l l have a p a r t i t i o n i n g f a c t o r which cont a i n s a t l e a s t those b i t s set i n the sub-structure's p a r t i t i o n i n g f a c t o r . As s u b - s t r u c t u r e queries become more s p e c i f i c more screen b i t s are u s u a l l y s e t , and more b i t s are set i n the part i t i o n i n g f a c t o r . The number of p o s s i b l e i n c l u s i v e matches to the f a c t o r drops e x p o n e n t i a l l y as the number of one b i t r i s e s . Because the screen index has the p a r t i t i o n i n g f a c t o r as i t s major key, i t i s necessary to read only those records having the r i g h t factor. I f , however, the sub-structure i s so general that more than one t h i r d o f the p a r t i t i o n s must be accessed randomly i t i s quicker t o scan the screen index f i l e s e q u e n t i a l l y . The master f i l e contains the chemical s t r u c t u r e , which has been captured a t i n p u t and saved i n a condensed form. I t i s sequenced by a c c e s s i o n number as t h i s number i s a u t o m a t i c a l l y assigned to a new s t r u c t u r e by the f i l e update system. In add i t i o n to the s t r u c t u r e , the molecular formula and q u a l i f i e r s are in this f i l e . The molecular formula has been stored i n such a way as to permit searching both as an exact match or an i n c l u s i v e match. T h i s format a l s o permits the s o r t i n g of matches by molec u l a r formula i n t o CAS sequence f o r r e p o r t i n g . Because the connection t a b l e s are e s s e n t i a l l y two-dimensiona l , and do not c o n t a i n s p e c i a l bond types, many chemical propert i e s such as stereochemistry cannot be represented. The s o l u t i o n t o t h i s and s i m i l a r problems was the i n c l u s i o n of machinereadable q u a l i f i e r f i e l d s f o r each s t r u c t u r e to i n d i c a t e such t h i n g s as stereo i n f o r m a t i o n , polymers, mixtures, and c o o r d i n a t i o n complexes. Each q u a l i f i e r has a l s o non-searchable f r e e t e x t s t o r e d with i t to a i d human i n t e r p r e t a t i o n of the p i c t u r e . B i o l o g y R e t r i e v a l Subsystem. The b i o l o g y r e t r i e v a l subsystem c o n s i s t s o f two indexed s e q u e n t i a l f i l e s c o n t a i n i n g b i o l o g i c a l t e s t data r e l a t i n g to the s t r u c t u r e s of the chemistry subsystem. There are over three m i l l i o n records occupying 6 d i s k packs of 37 m i l l i o n c h a r a c t e r s each. Both f i l e s are sequenced by sample number (BN) and l a b o r a t o r y i d e n t i f i c a t i o n number (Lab I.D.). The data f i e l d s are dependent on the type of e x p e r i mentation done by a s p e c i f i e d l a b o r a t o r y and are predefined i n a data name d i c t i o n a r y . From a u s e r s p o i n t of view the two f i l e s 1

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

12.

PAGE E T A L .

Integrated

Data

Retrieval

System

185

are i d e n t i c a l . The d i v i s i o n i s a r b i t r a r y and i s designed to allow searching only p a r t o f the data base. The primary f i l e contains those l a b ID's most f r e q u e n t l y accessed while the other f i l e contains a number o f secondary t e s t systems and h i s t o r i c a l data. New l a b o r a t o r i e s can be added t o the system by simply adding e n t r i e s i n t o the data name d i c t i o n a r y under a new l a b ID number. Inventory R e t r i e v a l Subsystem. The inventory r e t r i e v a l subsystem i s an indexed s e q u e n t i a l f i l e c o n t a i n i n g information p e r t i n e n t t o the p h y s i c a l samples. I t c u r r e n t l y contains 433 thousand records and occupies 5 d i s k packs of 37 m i l l i o n chara c t e r s each. The f i l e i s maintained i n sample number sequence. When a sample i s r e c e i v e d i t i s assigned the next a v a i l a b l e sample number and a l l a v a i l a b l e data ( i . e . , date o f r e c e i p t , source, amount, c o n d i t i o n o f r e c e i p t , s h e l f l o c a t i o n , chemical and p h y s i c a l p r o p e r t i e s , etc.) are entered i n t o the record. A l l t r a n s a c t i o n s i n v o l v i n g t h a t sample (shipments to t e s t i n g l a b o r a t o r i e s , removal from inventory, etc.) and the date of the t r a n s a c t i o n s a r e a l s o entered i n t o record. The data f i e l d s f o r t h i s f i l e a r e a l s o p r e d e f i n e d i n a data name d i c t i o n a r y f o r searching. Retrieval C r i t e r i a Chemical Subsystem. The heart o f the chemical r e t r i e v a l subsystem i s the sub-structure search c a p a b i l i t y . The general purpose o f sub-structure searching i s to r e t r i e v e compounds having s p e c i f i e d s t r u c t u r a l s i m i l a r i t i e s . In our system, the s i m i l a r i t i e s a r e s p e c i f i e d i n the form o f an incomplete s t r u c ture, which must be i n c l u d e d i n any f i l e s t r u c t u r e that i s t o be r e t r i e v e d . While the f i l e s t r u c t u r e may contain atoms and i n terconnections not shown i n the query, those i n the query must be matched. Thus, a query sub-structure may contain normal s t r u c t u r e atoms and bonds, and i n d e f i n i t e atoms o r bonds. The former must be matched e x a c t l y , and the l a t t e r may be s u b s t i t u t e d according to the r u l e s governing the p a r t i c u l a r atom or bond. S t r u c t u r e s , e i t h e r queries or f i l e compounds, are represented by a connection t a b l e . The t a b l e contains an entry f o r each non-hydrogen atom, together with information on the numbers and s i z e s o f covalent bonds on each atom, and the other non-hydrogen atoms attached t o i t ( c a l l e d "neighbors"). Each entry a l s o shows the number o f hydrogens attached t o the atom, any i o n i c charges, and a f l a g t h a t i s set i f the atom i s i n a r i n g . We have d e l i b e r a t e l y discarded the knowledge o f what type o f bond attaches which neighbor, not because i t i s u n i n t e r e s t i n g but because i t allows resonating s t r u c t u r e s , such as phenyl r i n g s , t o appear i d e n t i c a l r e g a r d l e s s o f the p r e c i s e arrangement o f double and s i n g l e bonds. We a l s o make c e r t a i n adjustments to tautomers to allow them t o be i d e n t i f i e d by e i t h e r form r e g a r d l e s s o f which

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

186

RETRIEVAL

OF

MEDICINAL

CHEMICAL

INFORMATION

form i s used f o r input. The d e t a i l s o f t h i s have been discussed elsewhere [ 5 ] . Our connection t a b l e s are not capable of d i s t i n g u i s h i n g stereoisomers or polymers. But codes f o r the presence of such c o n d i t i o n s and t e x t e x p l a i n i n g them are stored with the o r i g i n a l i n p u t formula and are r e t r i e v e d with i t . These are the q u a l i ­ f i e r s mentioned above. They may a l s o be s p e c i f i e d f o r a chemis­ t r y search. Normal s t r u c t u r e s are coded f o r input by means of a s p e c i a l ­ l y - m o d i f i e d t e l e t y p e [£J, which allows the s t r u c t u r a l formula to be typed as a combination o f atoms and bonds to represent chains and r i n g s . I t a l s o allows s t r i n g s of s u b s c r i p t e d element symbols and groups i n c l o s e d w i t h i n parentheses, whose connections must be i n f e r r e d . The extensive l o g i c necessary to i n t e r p r e t these formu­ l a s has been described [ 7 ] . The r e s u l t i s a f a i r l y simple s e t o f r u l e s f o r the chemist w r i t i n g the formula f o r input that g e n e r a l ­ l y corresponds to normal chemical conventions. These s t r u c t u r e s are complete, and except f o r c e r t a i n Markush-type s t r u c t u r e s , no ambiguity or s u b s t i t u t i o n i s allowed a t any p o i n t . Queries, on the other hand, are incomplete s t r u c t u r e s , allowing a d d i t i o n s and substitutions at specified points. Most query atoms are normal atoms. That i s , they are not s p e c i a l atoms, they have no u n s p e c i f i e d bonds and t h e i r valance i s not zero. I t i s r e q u i r e d t h a t they match f i l e atoms. The f i l e s t r u c t u r e must c o n t a i n one i d e n t i c a l atom f o r each normal atom i n the query. I f the query atom i s i n a r i n g a f l a g i s s e t i n the atom d e s c r i p t o r word and the f i l e atom must a l s o be i n a r i n g . However, i f the query atom i s not i n a r i n g , the f i l e atom need not be i n a r i n g , but i t i s allowed to be. For example:

Ζ - ΝΗ - Ζ

w i l l allow

If you wish t o f o r c e a query atom to be matched only by a f i l e atom which i s a r i n g member, the query atom must be i n a r i n g . I f the query cannot be w r i t t e n i n such a way as to include the p a r t i c u l a r query atom i n a r i n g , r i n g members may perhaps be s p e c i f i e d with a s p e c i a l atom. The simplest type o f s u b s t i t u t i o n i s that o f a s p e c i a l atom. They appear i n the query as atoms with s p e c i a l symbols and may be r e p l a c e d i n the f i l e s t r u t u r e by any atom meeting the c r i t e r i a they impose. In F i g u r e 1, the query s t r u c t u r e contains two spec­ i a l atoms, an "X" which allows the s u b s t i t u t i o n of any non-hydro­ gen atom and a Q" which allows the s u b s t i t u t i o n o f any nonhydrogen, non-carbon atom. These s p e c i a l atoms may appear any­ where w i t h i n the query s t r u c t u r e , that i s , they need not be t e r ­ minal atoms but may be incorporated i n a s t r i n g or i n a r i n g . In order t o be a " h i t " t o a query a f i l e atom need only have a t l e a s t the bonds, charges e t c . , i n d i c a t e d f o r a s p e c i a l atom. 11

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

12.

PAGE ET AL.

Integrated

Data Retrieval

187

System

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

Therefore, the f i l e compounds i n F i g u r e 1 contain many atoms other than the normal atoms and the s p e c i a l atoms i n the query. The s p e c i a l atoms allowed and t h e i r r e s t r i c t i o n on r i n g membership are given i n Table 1.

Symbol Ζ X Xr Xc Q Qr Qc Ha Mt Rc Cc

Table I. Types of S p e c i a l Atoms Atom Type Any Not Η Not Η Not Η Not H, not C Not H, not C Not H, not C F, C l , Br, I Any metal Carbon Carbon

Ring Membership Indifferent Indifferent Required Excluded Indifferent Required Excluded Indifferent Indifferent Required Excluded

Any atom i n the query, i n c l u d i n g s p e c i a l atoms may c a r r y a charge, which must then be matched by the f i l e atom. The i n v e r s e however, i s not t r u e . That i s , the absence of a charge on a query atom does not preclude a charged f i l e atom as a match. Because the valence of s p e c i a l atoms i s indeterminate (except f o r Ha, Rc, Cc), a l l s p e c i f i c a l l y r e q u i r e d bonds to a s p e c i a l atom must be shown e x p l i c i t l y . The r i n g and chain carbons (Rc and Cc) may be used to allow one to w r i t e p a r t of a r i n g as a chain, o r to exclude r i n g s , e s p e c i a l l y fused r i n g s , as answers. (See F i g u r e 2). F i g u r e s 3,4, and 5 show the r e l a t i o n s h i p s among the s p e c i a l atoms and how they may be used to modify a query so that i t becomes more general or more s p e c i f i c depending on the nature and the number of matches d e s i r e d . Figure 3 d i v i d e s the universe of p o s s i b l e s u b s t i t u t i o n s i n t o e i g h t c a t e g o r i e s and l i s t s which of these c a t e g o r i e s w i l l be r e t r i e v e d by each s p e c i a l atom. Figure 4 shows r e p r e s e n t a t i v e s t r u c t u r e s f o r each category s i n g l y sub­ s t i t u t e d on a methyl group and which of these s t r u c t u r e s would be r e t r i e v e d by each s p e c i a l atom. F i g u r e 5 shows a few of the p o s s i b i l i t i e s f o r an ortho d i - s u b s t i t u t e d phenyl r i n g . These simple examples g i v e , of course, only an i n d i c a t i o n of the v e r s i t i l i t y p o s s i b l e . Combinations of these few s p e c i a l atoms pro­ v i d e s a very powerful sub-structure search c a p a c i t y . The other major query element i s the u n s p e c i f i e d bond, w r i t ­ ten as any bond overstruck with a question mark. In general, the u n s p e c i f i e d bond may be used to allow the connections on the connected atoms to vary, as long as the neighbor r e l a t i o n i s maintained. Used between two normal atoms, i t r e q u i r e s the two

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

O F MEDICINAL

CHEMICAL

INFORMATION

Query:

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

Matches:

Figure 1.

Example of a simple substructure query

Query A: C R R -C C

C

C

C

Matches:

Q

Ci* (Τ' NH

2

Ouery B:

ζ

Matches:

CH

3

Figure 2.

CH CH3 2

Substructure query using R and C c

c

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Integrated

PAGE E T AL.

Mt

not C

chain

chain

3

4

Mt ring

not C ring

Ha

H

1

2

System

chain

S

C ring

7

6

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

C

Data Retrieval

8

Ζ

= 18

Qr =

6-7

X

= 28

Ha =

2

Xc = 2-5

Mt =

3,6

Xr

= 68

Ce =

5

Q

= 2,3,4,6,7

Rc =

8

Figure 3. Areas of substitution allowed by the special atoms

Qc = 2 4

CH — 3

Query —

Ζ

Xc





















CH3*Cd-CH3











CH -NH































CH

Xr

Qr

X

Response 1

Q

Qc

Ha

Mt

Cc

Rc



4

CH3CH3

CH3-C1

3

2

CH -S^~[ 3

Figure 4.

• •

• • •



Substructure substitution at a single point

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

OF

MEDICINAL

CHEMICAL

INFORMATION

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

190

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

12.

PAGE E T AL.

Integrated

Data

Retrieval

191

System

atoms t o be neighbors, but allows the valence o f the two atoms t o vary (Figure 6) o r the nature o f the attachment between the atoms t o vary (Figure 7). A s i n g l e query may c o n s i s t o f a sub-structure fragment with normal atoms and any number and combination o f s p e c i a l atoms and u n s p e c i f i e d bonds. I t may i n f a c t c o n s i s t o f only s p e c i a l atoms. I t may a l s o c o n s i s t o f s e v e r a l sub-structure fragments each o f which may have s p e c i a l atoms and u n s p e c i f i e d bonds. These f r a g ments may be r e l a t e d t o each other i n any combination o f three ways. I f the fragments are simply disconnected, each fragment must be d i s t i n c t l y and simultaneously present i n the f i l e s t r u c t u r e . I f however two fragments are r e l a t e d through the Boolean operator "AND", they are matched independently. Thus i f p a r t s o f the fragments are i d e n t i c a l , those i d e n t i c a l p a r t s are redundant and need not be d i s t i n c t l y present. For example:

X -^~"^-X · X-CH-CI»Ci 2

r e q u i r e s a t l e a s t two CI atoms i n the response.

C1-^"""^CHC1 N0 -^^-CH CHCH CI 2

2

2

while

2

CI

X-^ "Vx AND X-CH -CI AND CI 2

allows two CI atoms i n the response

H 0 -^~~^-CH Cl

but only r e q u i r e s one.

CI CH CH -^^-NHCH CH

2

2

2

2

r

CICH CH "-^ ^-CCI 2

2

etc.

i n a d d i t i o n t o the f i r s t responses. Two fragments may a l s o be r e l a t e d through the Boolean ope r a t o r "AND NOT". In t h i s case the f i l e s t r u c t u r e must not match the s p e c i f i e d fragment. Each query may c o n t a i n up t o 32 such fragments i n any combination t h a t i s not d i r e c t l y c o n t r a d i c t o r y . B i o l o g y and Inventory R e t r i e v a l Subsystems. E s s e n t i a l l y the same programs a r e used t o search the b i o l o g y and inventory systems. The major d i f f e r e n c e i s i n the data name d i c t i o n a r y that i s attached. Each subsystem has a d i c t i o n a r y o f a l l data items in i t s f i l e . T h i s provides the c a p a b i l i t y o f searching on any f i e l d o r combination o f f i e l d s i n e i t h e r data base. A f i e l d may be d e f i n e d as numeric, alpha, alphanumeric o r as a repeating group. A l s o a d d i t i o n a l f l e x i b i l i t y i s provided by a l l o w i n g new f i e l d s t h a t are not i n the data name d i c t i o n a r y to be d e f i n e d and used i n the search. The search i s made up o f a search command (SUBSET, o r SUBSET, USING), the f i e l d ( s ) t o be s e l e c t e d as d e f i n e d

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

O F MEDICINAL

CHEMICAL

INFORMATION

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

Query: SfC-C

Matches: CH SCH CH 3

2

Ο <

3

0 CH3-S-CH3

O CH3S-CCH3

S-0

S CH3COH

0 Figure 6.

Unspecified bond query

Query: NfN

Matches: N=N

CH Crf-NNHCH 3

Figure 7.

2

Ο

Unspecified bond query

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

12.

PAGE E T AL.

Integrated

Data

Retrieval

System

193

i n the d i c t i o n a r y , and the search c r i t e r i a . The r e t r i e v a l system i s capable o f handling up to ten imbedded ANDed or ORed f u n c t i o n s as search c r i t e r i a . There are f i v e operations which are used with the data f i e l d s to make up the search c r i t e r i a . They are EQUAL, GREATER THAN, LESS THAN, CONTAINS (which compares f o r a s p e c i f i c s t r i n g o f characters) and the NOT of each. The CONTAINS operator used with the word KEY works as a p a r t i t i o n search using only the major p a r t of the key. Thus i t f u n c t i o n s as a generic search. Searches using repeating groups must use a s u b s c r i p t i n the data items d e c l a r e d as r e p e a t i n g i n the d i c t i o n a r y . For a s p e c i f i c occurance of a r e p e a t i n g group, a number s u b s c r i p t i s used i n the search c r i t e r i a . The s u b s c r i p t ALL i s used when a l l occurrences must meet the search c r i t e r i a . Searching

Procedures

Chemistry Subsystem. The chemistry system may be searched i n e i t h e r an o n - l i n e or batch mode. Each mode has d i f f e r i n g c a p a b i l i t i e s and uses. The o n - l i n e system w i l l perform 4 d i s t i n c t types of searches: 1) by a c c e s s i o n number; 2) by sample number; 3) by f u l l s t r u c t u r e , or i d e n t i t y search; 4) simple sub-structure search. A l l o n - l i n e f u n c t i o n s use an IMLAC PDS-4 i n t e l l i g e n t graphics t e r m i n a l . The IMLAC machine contains a small computer with 8K of memory, and a d i s p l a y processor d r i v i n g a graphics CRT. Struct u r e s are represented i n the IMLAC by the character set developed f o r the chemical t e l e t y p e [£]. T h i s character set allows twodimensional r e p r e s e n t a t i o n of most s t r u c t u r e s . Each c h a r a c t e r , along with the blank, the backspace, the l i n e feed and the r e verse l i n e feed, has i t s own 7-bit ASCII code, and a p i c t u r e i s s t o r e d i n the IMLAC as a s e r i e s of such codes. The IMLAC processor i s programmed to i n t e r p r e t the codes and draw the c o r r e sponding c h a r a c t e r s on the CRT. The IMLAC i s a l s o programmed to allow i t s operator to enter or e d i t such p i c t u r e s with i t s keyboard. P i c t u r e s may a l s o be sent or r e c e i v e d to or from the host computer. Thus, answers may be d i s p l a y e d to the IMLAC operator by t r a n s m i t t i n g the ASCII r e p r e s e n t a t i o n of the s t r u c t u r e v i a phone l i n e to the IMLAC. The IMLAC could i n f a c t be used f o r primary input r a t h e r than the t e l e t y p e and i n a system with l e s s thruput t h i s would be d e s i r a b l e . The t e l e t y p e s are used because they are much l e s s expens i v e , we already owned them, and the operators are f a m i l i a r with then. The f i r s t three types of o n - l i n e r e t r i e v a l s l i s t e d above were designed to r e p l a c e manually-searched card f i l e s . With the program, the operator may r e t r i e v e and d i s p l a y s t r u c t u r e s by knowing e i t h e r the a c c e s s i o n number, the sample number, or the structure i t s e l f . In the case of an i d e n t i t y s t r u c t u r e search,

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

194

RETRIEVAL

OF

MEDICINAL

CHEMICAL

INFORMATION

the i n d e x - s e q u e n t i a l design of our connection-table f i l e allows the s t r u c t u r e to be r e t r i e v e d i n l e s s than 5 seconds, and the p i c t u r e to be d i s p l a c e d i n another 5 seconds. Since the s t r u c ture screen i s computed from the s t r u c t u r e i t s e l f , only the s t r u c t u r e must be entered f o r i d e n t i t y q u e r i e s . The person e n t e r i n g the query need not know the index c r i t e r i a . The o n - l i n e sub-structure search was intended o r i g i n a l l y as a t o o l f o r t e s t i n g the system, but has been made a v a i l a b l e to s e l e c t e d users o f the system. Only the simplest searches are allowed: There i s no B o l l e a n combination, nor i s there any a b i l i t y t o search any information except the s t r u c t u r e s . (This means, among other t h i n g s , t h a t the e n t i r e data base i s e l i g i b l e f o r the r e p o r t , r e g a r d l e s s of the p r i v a c y a t t a c h i n g to any compound. Since d i s c r e e t compounds are thus a v a i l a b l e , access to the program must be r e s t r i c t e d . ) The f u l l range of s p e c i a l atoms, u n s p e c i f i e d bonds, and s t r u c t u r e q u a l i f i e r s i s a v a i l a b l e to the o n - l i n e user. Answers are returned by s t r u c t u r e and accession number only (not by sample number), each s t r u c t u r e being d i s p l a y ed as the user asks f o r i t . I f the corresponding sample numbers are d e s i r e d they can be obtained by requesting a l l sample numbers for the given a c c e s s i o n number. The usefulness to the WRAIR S t a f f of searching f o r subs t r u c t u r e s o n - l i n e has yet to be determined. Since the chemistry r e t r i e v a l system uses the screen and connection t a b l e generated by the query to search to screen index f i l e f o r e i t h e r a whole or a sub-structure no more information ( d i s k - d r i v e s ) i s needed o n - l i n e f o r sub-structure searching than f o r i d e n t i t y matches, so one i s no more c o s t l y than the other, i n terms of d i s k resources. But the elapsed time needed f o r such sub-structure searches v a r i e s g r e a t l y and may be l a r g e . The time needed may be kept small i f the query i s s p e c i f i c and many screen b i t s are s e t . Then, because the f i l e s are indexed by an a b b r e v i a t i o n of the screen, i . e . , the p a r t i t i o n i n g f a c t o r , they need not be read completely, thus e l i m i n a t i n g most of the input/output time necessary to read the e n t i r e f i l e . (Of course a d d i t i o n a l time w i l l a l s o be saved because such a s i t u a t i o n w i l l l i m i t the number of i t e r a t i v e matches needed as well.) The worst s i t u a t i o n i s a query so general t h a t the e n t i r e screen f i l e must be read, and most of i t must be searched itérâtively. The time f o r usch a case depends l a r g e l y on the s i z e of the f i l e and the competition for CPU time from other multi-programmed jobs. The estimated worst-case search would r e q u i r e about 35 minutes elapsed time i n an otherwise empty machine, f o r a f i l e of 250,000 s t r u c t u r e s . The economies to be gained by searching many sub-structures at once, i . e . running batch searches, are great. They a r i s e c h i e f l y because the time needed to read the f i l e of s t r u c t u r e s from the d i s k may be apportioned among the sub-structure q u e r i e s . T h i s time i s l a r g e r than the time r e q u i r e d by the i t e r a t i v e searching u s u a l l y needed f o r searches, and thus the time needed to search f o r a batch of 10 sub-structures i s no where near the

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

12.

PAGE E T AL.

Integrated

Data

Retrieval

System

195

time needed t o do 10 s i n g l e sub-structure searches. For example 10 batched general queries of the worst-case typed noted above should r e q u i r e o n l y about 45 minutes of elasped time. There are two a d d i t i o n a l reasons f o r a batch o f f - l i n e search. F i r s t , the search c r i t e r i a can be expanded to allow searches t h a t are not p o s s i b l e o n - l i n e because of time c o n s i d e r ­ a t i o n s , as i n the case of Boolean combinations of simple fragments, or a v a i l a b l e computer resources, as i n the case o f c r i t e r i a based on n o n - s t r u c t u r a l information that i s s t o r e d on other f i l e s . The second reason f o r batch searching i s based on the need t o e f f i c i e n t l y handle l a r g e numbers of queries and answers which r e q u i r e a source of high-volume hard copy output. A l l searches, whether f o r o n - l i n e or batch p r o c e s s i n g , may be formulated i n t e r a c t i v e l y . The s t r u c t u r e fragments are gen­ erated on the IMLAC t e r m i n a l and e d i t e d o n - l i n e . I f erroneous, they are immediately returned f o r c o r r e c t i o n . Thus the user may d e f i n e and e d i t a l l the questions f o r a given batch search on­ l i n e , and be sure they are a t l e a s t i n the c o r r e c t syntax. This c a p a c i t y i n c r e a s e s the e f f e c t i v e n e s s of the batch search by e l i m ­ i n a t i n g long delays i n turn-around due to input e r r o r s . Indi­ v i d u a l fragments, with no Boolean combinations, may be used f o r a p r e l i m i n a r y search o n - l i n e . Thus the user has the c a p a b i l i t y of formulating a query and l o o k i n g at a predetermined number of answers. Based on t h i s "preview" the user may then submit the query t o the batch search, r e d e f i n e the search c r i t e r i a and look at a r e v i s e d set o f answers, or d e l e t e the query e n t i r e l y . This preview c a p a b i l i t y may i n f a c t prove to be the most u s e f u l f u n c t i o n o f the o n - l i n e sub-structure search. The batch search a l s o contains the c a p a c i t y to search by a s p e c i f i e d molecular formula. T h i s c a p a b i l i t y was not i n c l u d e d i n the o n - l i n e search because there was no requirement f o r i t . Batch search output i s p r i n t e d o f f - l i n e on a Versatec e l e c ­ t r o s t a t i c dot matrix p r i n t e r / p l o t t e r . It prints 100 dots per i n c h and uses 11 i n c h wide paper which i s e a s i l y i n c o r p o r a t e d into reports. B i o l o g y and Inventory Subsystems. These subsystems are only searched i n batch mode. Quick access i s provided by p e r i o d i c COM l i s t i n g s i n a c c e s s i o n number and sample number sequences. Be­ cause o f the s i z e of the f i l e s , simple queries ( i . e . what i s the m a l a r i a screening data f o r compound A or how much of compound Β do we have on hand) are most e f f i c i e n t l y and economically handled with m i c r o f i l m . More complex searches and searches r e q u i r i n g i n t e r a c t i o n among the sub-systems r e q u i r e s u f f i c i e n t computer resources t o p r o h i b i t running during prime time on a m u l t i programmed computer. While e i t h e r the B i o l o g y or Inventory subsystems may be searched independently, i n p r a c t i c e t h i s i s seldom done. More o f t e n , i n f o r m a t i o n from more than one system must be found i n

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

196

O F MEDICINAL

CHEMICAL

INFORMATION

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

order t o answer a query. I f the search begins with the Biology system the SUBSET command provides the i n t e r f a c e with the other systems by generating a f i l e o f answers with can be used as input t o continue the search. For example a search such as SUBSET, BIOREC, WHERE, BIOSOURCE = 2300 AND, BIOPGS = G, OR, BIOSOURCE = 2300, AND, BIOPGS = P. would create a f i l e o f b i o l o g y records where the source o f the compounds was 2300 and the type was e i t h e r Ρ (purchased) o r G (gift). T h i s f i l e could then be used, f o r example, t o f i n d a d d i t i o n a l samples o f the same compounds from other sources. The SUBSET, USING command permits the output o f one search t o be used as the c o n t r o l l i n g f a c t o r t o a second search. A r e c o r d i s read from the u s i n g f i l e , the key i s e x t r a c t e d from the r e c o r d and i s used t o perform a random f i n d on the new input f i l e . Once the r e c o r d i s found the a d d i t i o n a l c r i t e r i a , i f any, are a p p l i e d and, i f these c r i t e r i a are met, a new subset f i l e i s created from the matched r e c o r d s . F o r example: SUBSET, USING, FILA, CIS, WHERE, AMTON= 500 M, AND, ARRAY (ALL) NOT = MM. The p r e v i o u s l y r e t r i e v e d b i o l o g i c a l data s u b - f i l e would be used t o query the inventory f i l e . Once matching records ( i . e . records with the same sample number) were found they would be checked f o r an amount on hand equal t o o r greater than 500 mg and no shipment t o t e s t system MM. I f these c r i t e r i a are met, the matching inventory records are used t o create a new SUBSET f i l e . The search may then continue t o o b t a i n a d d i t i o n a l information from the chemistry f i l e s . The Report Generator would then pro­ cess a l l appropriate SUBSET f i l e s according t o c r i t e r i a s p e c i ­ f i e d by the user t o generate the f i n a l r e p o r t . Search S t r a t e g i e s As i n d i c a t e d e a r l i e r , most searches r e q u i r e accessing more than one data base. There are no r u l e s governing the sequence i n which the d i f f e r e n t subsystems must be searched. The output from a search o f any system can be used, through the SUBSET USING command, as c r i t e r i a o r p a r t i a l c r i t e r i a f o r searching any other subsystem. The outcome o f any given search should be the same r e g a r d l e s s o f which system i s used t o begin the search. However, the amount o f computer time r e q u i r e d t o complete a given search may i n f a c t depend on the sequence i n which the systems are searched. In general i t i s best t o begin a search on the system which w i l l be most r e s t r i c t i v e i n the number o f responses. However, the determination o f that system i s not always obvious and r e q u i r e s a f a i r degree o f f a m i l i a r i t y with the contents as w e l l as the o r g a n i z a t i o n o f the data bases. For example, suppose a user wishes t o send t o t e s t system Β a l l q u i n o l i n e d e r i v a t i v e s t h a t are a c t i v e i n t e s t system A. In a d d i t i o n , t e s t system Β

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

12.

PAGE E TA L .

Integrated

Data

Retrieval

System

197

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

r e q u i r e s a minimum q u a n t i t y of 700 mg to complete the t e s t and the user does not wish t o deplete the t o t a l inventory supply. The search c r i t e r i a then a r e : Biology subsystem search: SUBSET,BIOREC,WHERE,LABID=TSA. Inventory subsystem search: SUBSET,INVREC,WHERE,LABID(ALL),N0T=TSB, AND, QUANT, NOT 1000 mg Chemistry subsystem search: Structure= Ζ Ζ

ζ The user must r e a l i z e t h a t s t a r t i n g t h i s search with the inventory system w i l l not provide a s a t i s f a c t o r y answer since the inventory system i s sequenced by sample number. Even though an i n d i v i d u a l sample may appear t o meet the inventory c r i t e r i a i t i s necessary t o know the p e r t i n e n t accession numbers so that i n ­ ventory data f o r a l l samples o f a given compound may be compared a g a i n s t the inventory c r i t e r i a together. That i s , i f any one sample o f a compound with m u l t i p l e inventory e n t r i e s has been shipped t o TS Β then a l l samples of the same compound, r e g a r d l e s s of shipping data, should be r e j e c t e d by the search. The search cannot do t h i s u n t i l i t has the necessary information to c o r ­ r e l a t e sample data. S i m i l a r l y the d e c i s i o n t o begin with e i t h e r of the remaining systems r e s t s on the user's knowledge o f the data. That i s , i f there are a l a r g e number o f q u i n o l i n e s on the s t r u c t u r e f i l e and t e s t system A i s a small l a b o r a t o r y with r e l a t i v e l y few a c t i v e s , i t would be p r e f e r a b l e t o s t a r t the search with the b i o l o g y system. I f the sub-structure c r i t e r i a o f the chemistry p o r t i o n o f the search are r e l a t i v e l y s p e c i f i c then, because o f the speed o f the search, i t may be d e s i r a b l e to s t a r t with the chemistry search. The f e a s i b i l i t y o f programming a master search which would accept f r e e t e x t search c r i t e r i a from the user and c o n s t r u c t a search s t r a t e g y i s being s t u d i e d . F i r s t however, more user ex­ p e r i e n c e i n the development o f e f f e c t i v e search s t r a t e g i e s i s necessary. The Report Generator contains a d d i t i o n a l options open t o the user which are not p r e c i s e l y search c r i t e r i a . Because o f the p r o p r i e t a r y nature o f much o f the data base, unless the user s p e c i f i c a l l y i n d i c a t e s otherwise, only open (non-proprietary) data w i l l be reported. I t i s a l s o p o s s i b l e f o r the user to s p e c i f y a l l h i t s on a combined search be reported r a t h e r than o n l y matched h i t s . For example the chemistry data base could be searched f o r t r i a z i n e s and the output used t o search the b i o l o g y data base f o r m a l a r i a a c t i v e s . The user has the o p t i o n o f spec­ i f y i n g t h a t o n l y those t r i a z i n e s t h a t show m a l a r i a a c t i v i t y be

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

RETRIEVAL

198

O F MEDICINAL

CHEMICAL

INFORMATION

r e p o r t e d o r t h a t a l l t r i a z i n e s along with any a v a i l a b l e data showing m a l a r i a a c t i v i t y be reported. The user may s p e c i f y t h a t only data w i t h i n a c e r t a i n range of a c c e s s i o n numbers or sample numbers be reported, or that only data r e c e i v e d w i t h i n a s p e c i f i c time span be reported. In add i t i o n the user may s p e c i f y the sequence of the data i n the report.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

Applications The CIRS i s used by the D i v i s i o n o f Experimental Therap e u t i c s i n support o f a l a r g e drug development program. The R e t r i e v a l System (as separate from the s t r u c t u r e r e g i s t e r y and data maintenence system) i s r o u t i n e l y used f o r a v a r i e t y of f u n c t i o n s . A n a l y s i s o f Research Proposals. The Army Medical Research and Development Command supports research i n the d i r e c t e d s y n t h e s i s of screening candidates. Twice y e a r l y the D i v i s i o n of E x p e r i mental Therapeutics reviews synthesis proposals. The proposed s t r u c t u r e s i n each p r o p o s a l are entered i n t o the data base i n the "XR" s e r i e s and are a l s o entered as i d e n t i t y searches to determine whether o r not the compounds are already on hand and to i d e n t i f y d u p l i c a t i o n among the proposals. In a d d i t i o n , subs t r u c t u r e searches f o r the major c l a s s e s o f compounds proposed are a l s o run. Matches from both the i d e n t i t y and sub-structure searches are then used to query the b i o l o g y and inventory f i l e s . The reviewer i s presented with a r e p o r t sequenced by proposal number p r o v i d i n g him with a l l a v a i l a b l e information on the a v a i l a b i l i t y and a c t i v i t y o f a l l s p e c i f i c compounds and c l a s s e s o f compounds i n each p r o p o s a l . A s i m i l a r procedure i s used by cont r a c t monitors t o review progress on synthesis c o n t r a c t s and to prevent d u p l i c a t i o n o f e f f o r t . Review o f Screening Data. S t r u c t u r e s are added to the screening data a f t e r the data are r e c e i v e d from the l a b o r a t o r i e s . The screening data simply a c t as whole s t r u c t u r e queries to the chemistry r e t r i e v a l system. The r e p o r t i s sequenced by a c c e s s i o n number so t h a t data f o r d u p l i c a t e samples w i l l be grouped t o gether. Sub-structure searches can be formulated from the r e p o r t to i d e n t i f y a d d i t i o n a l samples e i t h e r on hand or i n p r e p a r a t i o n i n those c l a s s e s o f compounds t h a t are i n t e r e s t i n g . The matches from the sub-structure search can then be used to f u r t h e r query the b i o l o g y and inventory f i l e s to determine i f those a d d i t i o n a l samples i d e n t i f i e d have already been screened, i f so t h e i r a c t i v i t y , i f not the amount a v a i l a b l e , and a p o s s i b l e source f o r obtaining additional material. Monitoring o f Screening L a b o r a t o r i e s . The system i s used to determine c o r r e l a t i o n o f a c t i v i t y between primary and secondary

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.

12.

PAGE ET AL.

Integrated Data Retrieval System

199

screens for the same disease and between basic screens for d i f ­ ferent diseases. It is also used to determine samples with sufficient quantity for screening when a new test system is being developed. It can be used to moniter ongoing a c t i v i t y . For example the user could request a report on a l l samples shipped to a given test system for which no biological data has been reveived.

Downloaded by UNIV OF NEW ENGLAND on January 22, 2017 | http://pubs.acs.org Publication Date: December 14, 1978 | doi: 10.1021/bk-1978-0084.ch012

Conclusion The chemical-biological data system has been an integral part of the Army's drug development program for over 12 years. The current up-grading of the original system w i l l eliminate the labor-intensive maintenance of manual f i l e s , allow closer mon­ itoring of a l l aspects of the program and provide information in a more timely manner. Those features of the system which are particularly attractive to the end user are structure input and output, machine coding of sub-structure queries, on-line editing of queries, and user designed reports which provide correlated data from a l l data f i l e s . References 1.

D. P. Jacobus et. al., "Experience with Mechanized Chemical and Biological Information Retrieval Systems." J. Chem Doc Vol 10, ρ 135, 1970.

2.

J . A. Page, R. Theisen, F. Kuhl, Manuscript in preparation.

3.

A. Feldman, "An Efficient Design for Chem Structure Searching. I. The Screens." J . Chem. Info. and Comp. Sciences, Vol 15, No. 3, 1975.

4.

L . Hodes and A. Feldman, "An Effective Design for Chemical Structure Searching. II. F i l e Organization". J. Chem. Info. and Comp. Sciences, in press May 78.

5.

A. Feldman, "An Efficient Design for Chemical Structure Searching. III. The Coding of Resonating and Tautomeric Forms. J. Chem Info and Comp. Science, Vol 17, No. 4, 1977.

6.

A. Feldman, "A Chemical Teletype." J. Chem. Doc., Vol 13, No. No. 2, 1973.

7.

A. N. De Mott, "Interpretation of Organic Chemical Formulas by Computer", IEEE Spring Joint Computer Conf, 1968, ρ 61.

RECEIVED August 29,

1978.

Howe et al.; Retrieval of Medicinal Chemical Information ACS Symposium Series; American Chemical Society: Washington, DC, 1978.