Clinical and Pharmacogenomic Data Mining: 4 ... - ACS Publications

Clinical and Pharmacogenomic Data Mining: 4...

1 downloads 89 Views 3MB Size

Clinical and Pharmacogenomic Data Mining: 4. The FANO Program and Command Set as an Example of Tools for Biomedical Discovery and Evidence Based Medicine Barry Robson IBM Global Pharmaceutical and Life Sciences, Route 100, Somers, New York 10589, and Department of Biostatistics Epidemiology & Evidence Based Medicine, Saint Matthews University School of Medicine, Grand Cayman, Cayman Islands, British West Indies Received March 19, 2008

The culmination of methodology explored and developed in the preceding three papers is described in terms of the FANO program (also known as CliniMiner) and specifically in terms of the contemporary command set for data mining. This provides a more detailed account of how strategies were implemented in applications described elsewhere, in the previous papers in the series and in a paper on the analysis of 667 000 patient records. Although it is not customary to think of a command set as the output of research, it represents the elements and strategies for data mining biomedical and clinical data with many parameters, that is, in a high dimensional space that requires skilful navigation. The intent is not to promote FANO per se, but to report its science and methodologies. Typical example rules from traditional data mining are that A and B and C associate, or IF A & B THEN C. We need much higher complexity rules for clinical data especially with inclusion of proteomics and genomics. FANO’s specific goal is to be able routinely to extract from clinical record repositories and other data not only the complex rules required for biomedical research and the clinical practice of evidence based medicine, but to quantify their uncertainty, that is, their essentially probabilistic nature. The underlying information and number theoretic basis previously described is less of an issue here, being “under the hood”, although the fundamental role and use of the Incomplete (generalized) Riemann Zeta Function as a general surprise measure is highlighted, along with its covariance or multivariance analogue, as it appears to be a unique and powerful feature. Another characteristic described is the very general tactic of the metadata operator ‘:)’. It allows decomposition of diverse data types such as trees, spreadsheets, biosequences, sets of objects, amorphous data collections with repeating items, XML structures, and so forth into universally atomic data items with or without metadata, and assists in reconstruction of ontology from the associations and numerical correlations so data mined. Keywords: Data Mining • Clinical • Genomic • Proteomic • Fano • Mutual Information

1. Introduction 1.1. Fano’s Mutual Information. In collaborations with the University of Virginia, St. Matthew’s University School of Medicine, the University of British Columbia, and others, the author has been exploring methods for obtaining rules from patient clinical records and curating them for inference and decision making in clinical practice and research. Though relatively little of the records included patient genomic or proteomic data at the time of the studies, it is evident that such data places a further demand on the need for probabilistically weighted rules or other associated measures of certainty/ uncertainty. There have been many traditional statistical approaches to the analysis of data as a basis for inferences and decisions with uncertainty.1–11 The measure developed by Fano12 is attractive for measuring the mutual information between specified states, events, observations, or measurements, for example, I(Congestive heart failure; Diabetes com3922 Journal of Proteome Research 2008, 7, 3922–3947 Published on Web 08/13/2008

plicated; Heart valve expression biomarker #76872). This “rule” as written is Fano’s mutual information function I(A; B) although here extended to an indefinite number of items I(A; B; C;...) (Section 3.1), the value of which gives zero (no evidence), positive (evidence for) or negative (evidence against) the association shown. With further explicit inclusion of the negative evidence provided by the contrary predicted state, that is, by including the analogous rule for no congestive heart failure, this accords with decision theory.13–15 Many such rules can be brought together by a process of inference for diagnoses, decision guidance, predictions, and patient risk assessment for certain morbidities (serious diseases). Many data analytic techniques1–36 can and will contribute to Evidence Based Medicine,37 but the above seems particularly persuasive. 1.2. Early Applications. Despite that, before the present series of papers,20–22 the main use in biology of such Fanobased decision theory appears to have been the series of studies13–15 and resulting widely used GOR methods16,17 for 10.1021/pr800204f CCC: $40.75

 2008 American Chemical Society

Clinical and Pharmacogenomic Data Mining 4 16,17

predicting protein secondary structure, in which case the states were the secondary structure of amino acid residues, say R-helical and their chemical nature such as alanine. That is, for the most part, some 3-5 and 20 qualitative (categorical, nominal) states, respectively. Here, the rules were called the parameters of the method, and the form of the parameters and the method of use for prediction were essentially “hard-wired” to satisfy the special physicochemical constraints, mainly the vicinity effect in sequence for secondary structure formation, and a limited number of relevant terms in the information theoretic expansion15 of I(residue secondary structure; surrounding sequence). A diagnostic conclusion, selection of best therapy, or prediction of a morbidity as a risk factor is analogous to a prediction of a secondary structure state, but the clinical record is not of in the sequence data type, and has no such physicochemical constraints. The limited number of qualitative states and the relatively simplicity and number of information terms as rules for the problem addressed by the GOR method is in sharp contrast to the potentially seemingly indefinitely large number of states, frequently quantitative, and greater complexity of interactions between them, in medical records. Further, the inclusion of further rules based on variances between values (covariance, multivariances) is required in medical record analysis if the valuable information in correlations is not to be lost. Clinical data mining thus requires a greater diversity of tactics for practical application. Nonetheless, a feature of GOR of importance for present purposes was the Bayesian estimate of Fano’s measure from what were at first very limited protein structure data, as Expected Information represented by socalled #() functions15 pronounced “pound” (U.S.) or “hash” (U.K.) functions. Clinical data, in which, for example, a potential condition such as congestive heart failure is to be predicted, can be framed as a similar and often data-deficient problem. This is especially as the number of states (the dimensionality or complexity of the problem) increases. Since one wishes to include the rules of highest complexity attainable with available computing power, and there are in practice more rules which have higher complexity (more correctly, according to the binomial distribution), a great deal of data is always sparse. 1.3. Recent Progress. To overcome this difficulty and provide means of generating and managing the diverse terms needed in the expansion of I(congestive heart failure; medical records), such terms were rendered as products of prime numbers in which each prime number uniquely characterized an argument of the information expression, and number theory was invoked.18,19 This transformed the basic GOR idea into the more general field of data mining.20–23 It also became immediately clear that the # function as a Bayesian expected information measure could be usefully reformulated as the Incomplete (generalized) Riemann Zeta Function24,25 IRZF or here simply “Zeta Function”, that is, the ζ(s,n) of number theory. The new parameter s which #() lacked relates to the nature of the surprise measure, the most typical choice being s ) 1 to yield for expressions with terms #() the information measure in natural units or nats. Parameter n is the amount of observed data such as o[Congestive heart failure & Diabetes complicated & Heart valve expression biomarker #76872]. A further Zeta Function subtracted from it is a function of the expected frequency of occurrence (in the Chi-square test sense) using e[Congestive heart failure & Diabetes complicated & Heart valve expression biomarker #76872]. Other values of s are of interest; for example, s set as an arbitrary large value for

research articles each Zeta Function ζ gives binary logic (values of 0 or 1),20–23 and the reciprocal 1/ζ can be related to the prior probability of uniqueness and matches between s randomly selected records20–23 or even between nucleic acids in the use of expression arrays.24,25 1.4. The FANO Program. The computer program called FANO (or CliniMiner) so developed20–23 generally seems wellsuited to high dimensional problems in association and multivariance analysis,23,28–35 and hence efficient analysis of large patient record archives.36 It thus seems well-suited for Evidence Based Medicine37 and Personalized Medicine with genomic and proteomic biomarkers.38 To achieve computational support for these disciplines, there is actually a further step required after data mining. Many rules with related arguments from the data mining must be brought together in an inference process, the universal best practice for which is not entirely clear,39 with Dirac’s inference system40 from quantum mechanics41 being, in principle at least, the main contender out of many methods (for argument, see refs 39–41). The extension from logic of contrary or negative arguments45,46 to the information theory case, capable of treating those so-called logical fallacies where these are really “can’t say”, is an example of an issue which is not simple as it may appear if we want to make maximal use of information available. In contrast to categorical and syllogistic logic, that the patient has 99% risk of congestive failure is as useful as “will have congestive heart failure”. Fortunately, it is possible to separate these issues of using rules for inference from the issues relating to generating those rules in the first place, noting that these even without inference have significant interest for discovery of previously unnoted relationships.36 The matters relating to rule generation are less contentious, at least in the sense that the dragon,23 that is, the formidable difficulty, is well-characterized. This is the combinatorial explosion of rules of increasing dimensionality (complexity) for which there are not only many to manage, but the data becomes progressively sparse.23 A mere hundred clinical parameters on patient records can potentially imply up to about 2100, that is, about 1030, rules,22 and records analyzed have on the order of 100-300 parameters,36 so it is clear that the further addition of potentially hundreds or thousands of genomic and proteomic biomarker data will take a heavy toll.38 1.5. FANO as a Research Result in Itself. To help tackle the above difficulty, and to do research into the best methods of doing so, FANO has a powerful but fairly minimal command set. That command set and the discussions of the consequence of using it is presented here as a research result illustrating the kind of methodology and formalism for control of data mining which is required for research data mining, and research into data mining which is a discipline in itself. Despite that variability of approach implied, it helps the mission (of seeking improved methods) that the result is always a welldefined object. It is a table (in both spreadsheet and XML format) of all rules of diverse complexity (dimensionality, number of arguments,), and of type. The current types are association, covariance (not confined to complexity 2), and multivariance (high dimensional covariance treated in a different manner involving optimization to fit the data). They are all presented in the same rule format and coranked by information content from most positive down to most negative. In contrast, the input files are more various, mainly the records to be analyzed, the command file, and an interpreter file that can be modified to handle different domains of applications, for example, medical analysis and market analysis say of Journal of Proteome Research • Vol. 7, No. 9, 2008 3923

research articles

Robson

Table 1. input.dat, Extract of a Typical Simple Input File CRS#

7743 7743 7741 7741 7902 7902 7782 7782 7753 7753 7753 7753 7751

white blood cell count (g/L)

hemoglobin (g/L)

hematocrit (L/L)

red blood cell count (T/L)

platelet count (g/L)

5.4 8.8 7.6

115 107 144

0.35 0.32 0.43

4 3.6 4.6

203 116 159

3.7

1

0.5

5.4

1.3

0.7

6.4 8.5 7.4 11.9

125 69 126 102

0.37 0.2 0.38 0.29

4.1 2.2 3.9 3.1

284 89 234 91

4.4 6.6 3.5

1.3 0.4 2.2

0.6 20 cig./day, to rather similar effect. 2.2.3. Significance of Multiple Metadata. In the example Animal:)Vertebrate:)Primate:)human Primate is first order metadata, Vertebrate is second order metadata, and so on. While for counting purposes, content to the left of the rightmost ‘:)’ is taken as the single metadata, the multiple metadata structure is preserved through to the rules in output. From the mined relationships, it is possible from FANO output to draw dendrograms or tree diagrams20 from which such multiple metadata structures may be constructed. Vertebrate:)Primate: )human represents one path or section of path up a tree from root to leaf node, and may either be deduced from the dendrogram or assembled to make larger ontologies. When a dendrogram is drawn from items with single or multiple metadata, it represents a process of assembly of a more complex ontology. 2.2.4. Relatively Structured Data. FANO performs relatively structured data analytics, the more structured the better. It does not use prior knowledge or grammar and vocabulary as text analytics which is in the field of unstructured data analytics. It does not intrinsically deal with the issue of typographical errors, though the convert.dat file could contain substantial code or links to other software to do this. Nonetheless, given text and any kind of character-based data, FANO makes “best effort” while careful reports are produced as a cross-check that he intended action was indeed taken. In occasional accidents in reading wrong files or failing to suppress text content, its suggestive output rules regarding word association make it seem that it is making some effort at text analysis (it could in principle learn through a series of passes, especially if helped by splitting sentences into records with grammatical determiners such as were copresented). Generally, however, FANO is expecting to see structure imposed by format represented by certain characters (or strings of characters) which are delimiters. If they are absent, FANO sees the content as one giant record, in which it can still count the appearance and even recurrences of items. The delimiters are specified by the contents of the command file (usually, called command.dat). Metadata can be present on the input data file written by the user as Weight:)200_pounds (in which case order, and even countable recurrences of the same item) are immaterial in the

research articles record. This is the default: it is overridden by the choice to use an XML or spreadsheet columns-with-headings format. For non-XML input mode, if it is declared that the first line of input is metadata (as is done by the command first line metadata)on), as in the manner of column headings on a spreadsheet, then the metadata is automatically prefixed to all the data items in the in the column so headed. For example, 200_pounds will be prepended by column heading Weight to yield Weight:)200_pounds. The default state is that conversion to numeric data values and pooling into states above and below the average or a specified only takes place if first line metadata)on. It is suppressed if the facility to use first line as metadata is switched off by first line metadata)off), which is also the default, and then qualifiers must be added explicitly as qualifiers line by line, for example, entering the item as Age: )40. A common choice of tabular format and interpretation is compatible with spreadsheet applications, in which case first line metadata)on is required. Each line represents a record (patient record extract) with items delimited in comma separated variable (.csv) format. As in the typical spreadsheet application, this format also implies that commas within text in quotes are not treated as separators. Ideally, it is considered the job of other software to process data in spreadsheet-like form, but FANO has in fact a reasonable variety of format handling capabilities, some departing considerably from classical.csv format. Some are somewhat complicated to use in accordance with the complexity of the task that they perform, and may be switched off and hidden in some installations. For example, input could indeed resemble true literary text. This includes the interpretation of records as, for example, corresponding to sentences with words separated by white spaces and/or tabulation. Also a string of text such as a DNA or protein sequence may be considered as, for example, a “patient record extract” and symbols can be read in chunks of specified length, (say, 10 characters at a time as an item) which may also be optionally overlapping, that is, extracting as an item characters 1...10, then 2...11, then 3...12, and so forth. As well as the consideration of delimiters, a certain amount of other processing of the input data characters is done by FANO in order to maintain integrity of what constitutes an item, to render some optional forms (such as patient IDs expressed differently by different data curators) as a common form, and for metadata handling. In the output, in consequence, white space will separate events, and so for readability, all white space which existed within items are converted to underscores. That is, the item high blood pressure becomes high_blood_pressure. Commas within items are converted to slashes “-” (minus sign” which is usually sensible for readability). Any quotes round items including embedded commas are retained. 2.2.5. The Quantitative Preference. There are many ways to represent quantitative data as qualitative, and the converse (see Section 2.4). FANO attempts to see data as numeric whenever possible since this brings the powerful tool of covariance/multivariance to bear. Use of numerical data provides more information in any trend in value which is present, and can be more efficient at least for lower dimensional problems. The “trend” is in relation to values with other metadata. Numerical data must start with a string interpretable as a number, and hence, the easiest way to make numerical data qualitative is to preappend a non-numeric character. Valid numeric characters include for present purposes ‘+’ and ‘-’ and those in scientific notation such as 1.72E-23. 42_years are Journal of Proteome Research • Vol. 7, No. 9, 2008 3925

research articles read as 42 and 43% is read as 43, not 0.043, unless such conversion is requested on the convert file. Converting numbers to an invalid numeric form, such as 63 to #63, or Years_63, is a task which could also be performed by simple user-supplied code on the convert file (see below). For convenience, a limited number of interpretations and actions related to such matters can be controlled from the input data itself (as an option to specification using commands on the command file). The character pair (% in the first line of metadata has special significance as a useful support: (%10) means that all items prepended with this metadata will be grouped by dividing by 10 and taking the nearest integer). So the item 42 with metadata Age(10%) becomes Age(%10):)4, indicating age group 40-49. In some later FANO versions, the command data)quantitative and data)qualitative will override the quantitative/qualitative interpretation. This applies to all the data, unless a list of metadata to be uniquely affected is applied, as in data)quantitative)Age,Weight,Height. In addition, a conversion file can render for example yes/don’t know/know, or high/ normal/low as +1/0/-1 so allowing quantitative analysis. When the first line is set to represent metadata, or between prespecified delimiters, absence of entry has a special significance. The absence of characters, or a blank or string of blanks, is taken as a “don’t know” or “not available” which affects statistical normalization appropriately in later calculation. However, a command is available which can make specified characters be interpreted in the same way, such as “?” or “don’t know” (not necessarily in the quotes). The empty entry or blank is the default if this command is not given. 2.2.6. The Qualitative State Preference. This actually coexists and cooperates in parallel with the above quantitative preference. While requesting covariance/multivariance analysis simply means that numerical is interpreted that way, and qualitative data will be ignored, calling associative analysis, alone or at the same time, means that numerical data is also set up internally for treatment in a qualitative way. In general, qualitative (nominal, categorical) data amenable to association analysis are not amenable to variance analysis techniques, but the converse is not true. Numerical data can also be partitioned into (at least) two or more sets, say above and below the mean value, producing association rules for cross-checking with multivariance rules. Two states is the active default, since it keeps down the number of output rules and provides “negative” or complimentary states useful in inference. In that case, items which are numeric are pooled as their averages for association analysis, and as the quantities above and below the mean are distinguished for reporting covariance analysis. A command can change this option to divide numeric data into two pools separated by alternative selected threshold. 2.3. File command.dat. As for the input data file, the following responses of FANO to the command file are described in some detail but may be found useful in following the kind of approach that was at least most useful to the program’s creator. The command.dat file or equivalent is formally optional input and, if absent or empty, defaults are assumed. However, it is generally needed in practice, since it is unlikely that every type of input will correspond to all the defaults. Though the number of commands is not large, their combined use is in some cases fairly complex for advanced use, as described separately below. The command file (containing commands and typically called ‘command.dat’) is read from the first line downward. If the command cannot be identified, an exception is reported 3926

Journal of Proteome Research • Vol. 7, No. 9, 2008

Robson and the program halts, while it still issues a well-formed XML report file. With a few intuitive exceptions mainly to do with file manipulation, the order of commands is nonetheless immaterial. An example exception for advanced users is that, in the unusual circumstances that if FANO cycles are being checked against previous functioning, or for an older installation environment, then it is required to work like the original pilot release cycle. To use older commands or current commands with an older action, the older commands should be placed between pilot version)on and pilot version)off (or end of file). Immediately following a command name on command.dat, arguments may be present. If present, they are always preceded by ‘)’. Commands are not case sensitive, though their arguments are in general case sensitive. Example:- auditor)Barry Robson. Throughout the use of commands, the usual tolerance of blanks and outlay apply as in modern programming languages, but care is required if the argument is a regular expression (a pattern to match) which is presumed to start in the character immediately after the ‘)’. In the absence of an argument, assuming a default argument, the equals sign is not formally required, but when a command is described below, it is usually written ending in ‘)’ as an indication that an argument usually follows. The absence of a command implies a certain default action, which is usually that the argument is )off. Reasonably enough, the use (or use implied as default) of ‘off’ is that it switches off the action specified. This may or may not be the default depending on what is regarded as the common case of usage, and this is usually reasonably intuitive. Off is, for example, concerned with assuming that there is no special adaptations to spreadsheet data and gene or protein sequences or text, or assuming the time-saving approximations for very large data in order to avoid an accidental expensive run. Also, commands which might have been set in the operating environment command line can be set to off (i.e., by writing the word ‘off’ to the right of the equals sign )) which is not in general the default state. However, in case of doubt, the command and argument should be explicitly present to give the required action. Usually, there is always at least one argument, though, as indicated above, it may be optional and there will be a default. 2.4. Input File convert.dat. Essentially, the convert.dat file defines an automated process for curation of the input data. It is described in more detail with a worked example in Appendix. The convert.dat file is optional control, and if absent or empty, the data input is read as-is (subject however to any standard, basic in-FANO conversions due to the commands in the command.dat file). One need for curation is that it is not unusual to find inconsistencies of notation in digital records of one institution even within a presumed tidied tabular form. For example, blanks or zeros to mean a zero value in one column, and an unknown in another. The convert.dat file’s further related functions are to customize the program to be able to better process data typical data at the installation site, to standardize data like dates which may not be entered in a consistent manner, to ensure that “don’t know” entries are correctly interpreted from a different indicator of unknowns used at the site, to convert to numeric equivalent forms of data where desirable and possible, and not least conversely to change brief or uninformative codes for entries into forms which are more generally readable and commonly understood when read in the output. A simple example of data conversion is that true/don’t know/false might be converted to 1/0/-1,

Clinical and Pharmacogenomic Data Mining 4 or Female/Male to 0/1. As noted above, FANO will generally be able to work quite well without this file, but obviously, certain data will then not be accessible as anything more than categorical data for association analysis. For example, in the absence of the date and time conversion to a (here) required form in fractional “medical years” such as 2008.5832, will stay or be forced to be qualitative. It would then not allow date and time stamping that FANO may otherwise be able to use in multivariant analysis. The name convert.dat is the default name which can be renamed by a command on the command file. Also, for more efficient use, file convert.dat can be replaced by a subroutine intrinsic to FANO by specifying the subroutine name in a command on command.dat. For example: conversion source) convertproject23.dat. But if the argument contains the word ‘subroutine’ as in conversion source)subroutine convertproject23, the corresponding subroutine is used instead. The contents on convert.dat are really a kind of Perl script (to curate the input data). Any commands in this conversion file must be Perl (unless embedded or called Java, etc.) but with restriction on use of variable names representing information which is passed to and from the main program. As for the command.dat file, the symbol # can be used to “comment out”, that is, treat as comment or effectively inactive instructions, the code on the rest of the line (that usage is also valid Perl). Such special variables such as $set (the data) and $qualifier (the metadata) are often simply used to make the ultimate rules in the output more readable (see Table 2). Executable statements containing these are, in simple cases, the only lines of code on the convert.dat file. $set and $qualifier have been named to be easily recognizable, as in command $set ) 1, which looks like “set equal to 1”, and easily modifiable by those with little or no programming experience. $unknown is the set symbol or regular expression for indicating an unknown (“don’t know”) entry, while assigning a string of characters or regular expression to array $uninteresting means that any rules with metadata or data item at least partially matching it will be pruned out from associative and multivariate analysis. Relatively readable examples are $set ) $unknown if $qualifier )∼ / AI571206 /; meaning set the data item as an unknown if the metadata is AI571206, and so entry and so remove it from analysis, or alternatively $uninteresting[1] ) ’ AI571206’; which allows the data to be included for analysis but filtered out as uninteresting. In contrast $interesting[1] is available which ensures that in certain random sampling processes that items with the specified the data or metadata is selected. Again, anything which transforms these variables can either modify the incoming data and/or metadata accordingly or, in the case of $uninteresting, the action of FANO on that data. A mode can be set to make a much broader range of FANO variables accessible to the actions of this file and vice versa. This confers enormous power to the convert.dat file while making it risky if care is not taken in complex cases, meaning primarily not with the FANO programmer. If this is used and there is doubt about the action of a variable, then variables not intended to impact or be impacted by FANO behavior should definitely be declared local (“my” in Perl) to subroutines on convert.dat. Paradoxically, this open approach between FANO and convert data actually assists modularization, since a great deal can be done without editing the main program code. This is also consistent with the philosophy of FANO code being fixed and the code on convert.dat being special to

research articles projects. Such openness of variables to use remains nonetheless well-known as an undesirable programming practice. To restate and elaborate on a point made above, the typical conversions by convert.dat are (1) qualitative to quantitative, in which qualitative data is converted to numeric data, so making it accessible to covariance analysis, and (2) quantitative to qualitative, in which numerical data is converted to a description, such as low/normal/high for a laboratory result. The former is the usual preferred direction, because of the access it provides to covariance analysis while retaining access to association in analysis. It allows pooling in association analysis of quantitative date into sets representing ranges (see below). Hence, values like true/false should always be converted to binary 1/0 or true/don’t know/ false to trinary -1,0,+1. If particular data are dominated by two states, then it is found desirable to rename the states 0 and 1 and the minority states 0, at least in initial analysis. A combined approach is also possible. Some standardization operations on convert files or subroutines can be quite complex as shown at the latter half of the example in Appendix. A variety of date formats such as 5/10/2003 (U.S.) can be converted, if not all. Time should typically be in 14:30:55 or 14:30 format. If in doubt, the user should adopt the standard universal time Unix form 06-212003 14:30:55. Calendar and clock data under a metastate containing the word “date” are converted to “Real Biological Time”, “Standard Medical Time”, or “biochrons” or just “chrons” which are years expressed with decimal parts (hence millichrons, etc.), reflecting GMT, and allowing for leap years and leap seconds (yes, they do exist). A typical time is something like 2003.567435907and a patient who is say 49 will typically be something like 49.035673493 chrons old at the moment of any test. Thus, the duration of a medical event is not affected by crossing a U.S. state boundary, for example. This capability resides on the convert.dat file since it is a conversion feature. FANO does not know where in the world the data came from, so obviously, the user would have to put in the GMT time or stick to a time zone of your choice. If an error occurs in the convert.dat file, the script on it is not obeyed but notification is given and the run proceeds to facilitate debugging. This is because most uses are to do with convenient annotation or data preprocessing, which is not necessarily lethal, and an initial “practice run” with essential reporting is useful even if convert.dat fails. 2.5. Input Program File execute.dat.prex. This is basically an external data setup program that can work alone or be activated by FANO. Whereas convert.dat is usually obeyed in a FANO environment, execute.dat.prex provides an alternative environment in which convert.dat curates and otherwise processes input.dat. It may also contain a user interface to facilitate the setting up of the contents of the convert.dat file by nonprogrammers. Such preprocessing and curation of data by convert.dat under the control of execute.dat.prex rather than by FANO eliminates the need for processing to occur at FANO run time, thus, making the FANO run more efficient for large tasks. However, execute.dat.prex can also be activated by a command from FANO prior to the main analysis, in which case, curation of input.dat takes place after FANO starts, but still prior to the main part of the analysis run, not during it, as if execute.dat.prex had been called prior to FANO. Thus, only the limited number of variables used above can be shared with FANO and obviously no variables can be shared during the main analysis. For FANO to call this external program, the command execute file must appear in the command file, or Journal of Proteome Research • Vol. 7, No. 9, 2008 3927

research articles Table 2. command.dat, Sample Control File

Robson execute file) followed by another program name (execute. dat.prex being the default). Compare command evaluate) which executes not an external script but only the line of Perl code following the equal sign, at command time. No external file is called if execute file) is not used. 2.6. Output File FANO.xml. See the example of Table 3. This is the main “output results and report” file in an integrated system, though a separate spreadsheet file FANO.csv with the data mined rules is also output and that would be the convenient main output for a personal computer. The output of importance is the ranked associations as rules and/or multivariances as rules. The output extract shown in Table 4 is a relatively small scale run from the first paper in this series,20 though as some of the reports indicate, a small number of records with relatively few parameters can still generate much output. For examining these, FANO.xml is not easily readable by eye compared with the alternative spreadsheet output, but also contains further extensive data about the run including reproducing the command and conversion file contents, and at least the four top records of input data. The FANO.xml file provides XML which can be processed for display or subsequent analysis. The file should contain well-formed XML with a correctly declared FANO: tag name prefix. This is so even for exceptions, which are presented as and tags. In rare occasions, an error can also prematurely terminate writing of the XML. In such a case, the error report goes to the screen commencing with the description “BAD XML”. In very severe cases when FANO aborts with interpreter/compiler error reports, an output file trakfano.dat should be examined for information. 2.7. FANO.csv. See the example of Table 4. This is sometimes referred to as the “uniform format file’ because it summarises all the results in the same “rule-like” format where information (the quantity of information assessed as being in the in the rule) is analgous to “rule strength”. The exmple shows the upper left quadrant, and a few lines of ouput only. In general, there are always at least two events (arguments); the number is given by “complexity”. The heading “type” refers to the statistical approach which gave rise to the measure. The headings “saw” and “expected” are the number of actual and expected observations in the case of associations, and the effective corresponding number for covariances (shown) and multivariances (not shown) treating the covariances, loosely speaking, as if they were associations for “fuzzy sets”.

3. The FANO Commands 3.1. The Surprise Function. These commands relate to setting the surprise function, conditional on finite data, which is most commonly, but not necessarily, a measure of information. The most general form of a rule is S(A; B; C;...) which is the surprise function with arguments A,B,C, and so forth. These arguments represent states, events, observations or measurements represented by a data item with possible metadata associated with it, as in smoker, or Age:)>60. 3.1.1. Set Riemann Zeta s). The parameter can be of any numeric value real or complex, but is typically 1. This is the core of the data mining technique. A “rule” in information theoretic terms is calculated as a general “surprise” measure:Ss(A;B;C; . . . ) ) ζ(s,o[A&B&C&...] + a) - ζ(s,e[A;B;C; . . . ] + a) (1) Here o[A & B & C &...] is the observed number of times that A and B and C, and so forth occur together, and e[A; B; C;...] is 3928

Journal of Proteome Research • Vol. 7, No. 9, 2008

Clinical and Pharmacogenomic Data Mining 4 a

Table 3. FANO.xml

research articles the expected number of times in the Chi-square sense, and specifically here e[A;B;C;...] ) N-c+1o[A]o[B]o[C]. . .

(2)

where c is the complexity, that is, number of arguments A,B,C... and N is the total amount of data, according to normalization requirements such that, for example, the classical (“frequentist”) estimated probability would be P(A) ) o(A)/N, P(B) ) o(B)/N, and so on. Parameter a is described below, and is typically a ) 0. The Incomplete (generalized) Riemann Zeta Function is given by ζ(s,n) ) 1 + 2-s + 3-s + 4-s + ... + n-s

(3)

The function can handle absence of data (N ) 0) and negative values. These latter can arise because positive and negative quantities can be added to the natural frequencies to express the contribution of prior information, to get the final value of o[ ] and e[ ] as arguments. It can also formally handle fractional values of e[ ] such as 6.23, by an integration rather than series sum for eq 3. However FANO simply interpolates to achieve these values. For the expected value of Fano-like mutual information15,20–23 for any size of data, large or small, then s ) 1, and ζ(s,n) ) 1 + 2-1 + 3-1 + 4-1 +... n-1, so I(A;B;C;..) ) ζ(s ) 1,o[A&B&C&...] + a) - ζ(s ) 1,e[A;B;C;...] + a) (4)

a FANO has no graphical representation of its own and writes XML to allow the user to provide his or her own, or to integrate FANO into a larger system in which other output from other data analytic software, or medical images, may be displayed at a portal. For most routine purposes, the alternative spreadsheet output (Table 4) suffices. Extract of output from top and bottom of file from a simple association analysis run in this case with “unstructured” data records which are “open” to new classes of data (and potentially to data with new metadata) and are of varied length. This requires first line metadata)off. It relates to biomarker studies described in the first paper in this series,20 and such a file was used to generate for the dendrogram (“tree diagram”) clustering morbidities and markers for syndromes. Simple biomarkers are reported for patients with the morbidities indicated. By “unstructured” data is meant bag or collection data type. Such data is not tabular and the order is arbitrary, and data items can appear more than once per record, such as recurrences of disease such as pancreatistis or reappearance of a biomarker from an expression array result (they could be date stamped: see text).

is written as the “rule” as an information function about the association of A,B,C,... and can be positive, zero, or negative in units of nats (natural logarithmic units, in comparison to bits or binary logarithm units). Some advantageous points may be made regarding the importance of the surprise function as an information measure when data is very sparse. Strictly speaking for each Zeta Function ζ with s ) 1 there is an offset from equivalence to logarithms by a value between zero up to the small Euler-Mascheroni constant γ ) 0.5772..., but this cancels or almost so in all current applications since one Zeta Function of expected frequencies is subtracted from another for the observed frequencies. In any event, the author avoids the concept of error here by taking the position that it is really the surprise measure S() discussed below that matters, which in the FANO context behaves as “very loge-like” for s ) 1. Most importantly, it is functions in terms of this measure which directly arise as the Bayesian expectation of information functions15 over the possible interpretations of the data. For large data, given large values for o[A], o[B], o[C],..., for s ) 1 then Ss(A; B; C;...) ) I(A; B; C) ) log eP(A) - log eP(B) - log eP(C) -... consistent with Fano’s definition with probabilities determined as classical (“frequentist”) estimated probabilities P(A) ) o(A)/N, P(B) ) o(B)/N. Noting that such a classical definition of probability behaves poorly as an estimate for smaller values and fails completely for a frequency of zero (where loge[o(A)/N] takes negative infinity!), the above seems a very satisfactory position to take for both sparse and large data. It is also a pleasing position of definition, because it is consistent with 0 nats for zero frequencies (no relevant data at all), 1 nat for the information gained in the very first observation (frequency ) 1) for a hypothesis, and -1 for the first against that hypothesis. The s parameter has several interesting effects and interpretations, one being that it sets a continuous path of options ranging from observed minus expected frequencies, through information convergent with increasing n to the difference between loge observed minus loge expected frequencies, down Journal of Proteome Research • Vol. 7, No. 9, 2008 3929

research articles

Robson a

Table 4. FANO.csv, Example Spreadsheet Output File information complexity

type

saw

expected

event1

2.1

2

association

10

0.82

2.06

2

association

15

1.5

2.04 2.04 2.04 2.04 2.02

3 3 3 3 3

covariance covariance covariance covariance covariance

195.63 195.68 196.95 196.47 192.79

25 25 25 25 25

LDH(LACTATE_DEHYDROGENASE_U/L):) low-0.25 Echocardiogram:)>-0.97 Blood_transfusion:)>-0.35 Lobectomy_or_pneumonectomy:)>-0.99

2.01

3

covariance 191.37

25

Echocardiogram:)>-0.97

event2

UPROT:)low-0.71 Peripheral_vascular_disorder:)>-0.79 Echocardiogram:)>-0.97 Physical_therapy_exercises_ manipulation_etc:)>-0.22 Electrocardiogram:)>-0.78

a

Here, the input is structured metadata from spreasheet with colum headings such as Hypertension, in comparison with the relatively unstructured data of Table 1. Either yield rules which can be expressed on a spreadsheet, however, since the new columns relate to rule features, not the original medical metadata.

to logic. When Riemann t ) 0 (see below) set s ) 0 to obtain the difference between the frequency of abundance of observed or expected events. Alternatively, one may set s ) 2 to obtain second intrinsic moment of the information, and so on, or set s ) 1000 or some other arbitrarily large number to give a result of eq 1. which can be -1,0,+1. That is, tertiary logic, in which the answer must be -1, 0, or +1 nats. This means effectively, “yes”/“don’t know”/“no”. In addition, a reciprocal probability 1/[1 - P(s,n)], which is a kind of pseudofrequency reminiscent of many Evidence Based Medicine metrics, is known to relate to ζ(s,n). It relates to the a priori number of records or rule incidences on records (i.e., examples of the rule which contributed to the rule’s evaluation), that is, number n(s,n) ) s, which have all items identical.22,23 Three related commands setting the value of s are accessible for advanced research purposes. 3.1.2. set Riemann zeta t). Set the imaginary component t of Riemann zeta parameter as (here, e.g.) 1. Currently available for s > 1 and with experimental code for >0. 3.1.3. use real part ). May be on/off. Do/do not use real part in ranking. Ranking takes the absolute of the complex number (s,t) set with set Riemann zeta t). 3.1.4. set zeta interpolation). May be on/off. Do/do not interpolate value of zeta function for real number arguments. Use instead nearest integer. The standard Zeta Function called by FANO is valid for real (and complex) values of s of 1 or more. A lower value calls the more accurate, computationally more intense, and rather more general algorithm for ζ(s,n) with the algebra given in a previous paper20 is valid for 0 or more. It has allowed more general exploration of S(s,n) (generalizing eq 1) as a surprise measure. In addition, there is a certain symmetry of s and n in that functions of reciprocals of ζ(s,n) relate to the probability that all of s records will be novel in content, and hence to the a priori probability of obtaining matches.22,23 A further example of basic research is the exploration of the effect of n in ζ(s)t+0.5,n) on the nontrivial zeros of the Riemann Zeta Function at a real value of s ) 0.5, which may have possible significance in biology for the stable states of complex systems and even more relevantly to certain classes of complex inference networks where stable solutions are sought. The result is that the location of the zeros is remarkably stable in location down to ca. n ) 3. Another, and more relevant here, is identification of classifying the reliability of sets of rules with respect to those that retain the same rank order from most positive to most negative, even though the values of the 3930

Journal of Proteome Research • Vol. 7, No. 9, 2008

information change, except in cases where approximations, notably from nonexhaustive, statistical (“Monte Carlo”) sampling (see below). Surprisingly, the rank order of rules is rather insensitive to s of 1 or more; the circumstances when exceptions arise are now understood.22 For values below 1, the rank order is not preserved in general, but strong features invariant on certain aspects of sampling method do retain rank order. This leads to a convenient empirical classification in which rank order retained for 1 or more is deemed reliable, and all those retained for lower values of s are considered strong. Rules altered in their role by changing s for values or greater, including ideally their role in subsequent inference, are called nonrobust or fragile. 3.1.5. set prior). Set the value of parameter a in eq 1. This parameter represents an absolutely prior probability density and is the same for observed and expected frequencies of occurrence. This is a number added to the number n of observed or expected events. If e[A; B; C;...] ) 0, then this parameter is seen to represent some kind of expected frequency of abundance based on intuition. FANO does not allow the user to have a different value of a for observed and expected information functions, though the position is philosophical and could be easily changed in the program. Large positive values of a will represent a strong belief that observed and expected frequencies of occurrence are the same, and much data is required to overcome this and return significantly nonzero information values. Normally, the recommended argument is zero (the default), but there is also a theoretical argument due to Wilks,3 University of Cambridge and others that using -1 is more correct, because when adding up all the underlying probabilities for more complex events to deduce the number of simpler events, the prior beliefs should be added up too, with bizarre results otherwise. It corresponds to Dirchlet’s “-1” prior density (i.e., reciprocals of probabilities). 3.2. The Major Modes of Association Analysis. Association addresses the relative probability or information with which states, events, measurements, or observations will come together on the same record. It relates to numbers of times things occur and occur together, not the trend in numerical data which is a matter of multivariate analysis (see below). Association relates to the Chi-square test, and has the same notion of observed and expected frequency, except that test only indicates that associations depart from expectations on a random basis somewhere within the table, and do not distinguishing positive and negative association. In contrast, association analysis pinpoints the key “players”, and the exact relations,

Clinical and Pharmacogenomic Data Mining 4 measured both in sign and in strength, are pinpointed and presented as rules. The individual items like smoker, Weight:)>250 pounds, are coded as prime numbers in FANO.22 Any record or rule is a composite number as the product of such, and number theoretic functions are used to manage the combinations that must be counted, to introduced time saving procedures.22 Management such as assigning the lowest primes to the most abundant items are almost all “under the hood”, done automatically by FANO. However, making the analysis practical does require consideration of various types of segmentation of the records, an element of random sampling in high complexity cases with at least a double pass, and so on, described in detail Sections 3.3 and 3.4. Fortunately, the same time performance issues in a number theoretic approach are those in any approach, so the same choice of tactic is governed by finding what is the limit of achievable with the available current resource, and then working up through the approximations to raise that limit. 3.2.1. associations). Set associations on/off, for all associations of pairs, triplets, and so forth such as (A,B), (A,B,C), and so forth, form complexity 2 up to indefinite complexity. Association in FANO is the tendency for things to occur on the same records more, or less, than expected. The usual reason for using this command is as associations)off when one wants to do only covariances and/or multivariances. That is because association analysis can cover both qualitative (e.g., text) and numerical data. When studied by associations, numerical data is pooled above and below the mean, or above and below a specified value set by partition state). This pooling greatly reduces the number of types of state A,B,C,..., and so forth. For example, many possible entries such as Age:)0, Age:)1,..., Age: )120 are processed into just two states Age:)>43.2 and Age: )40 (with metadata/qualifier “Age:)” added in preparation of the input file), pollen allergy, (without metadata), measles (without metadata), broken leg (without metadata) and so on. The number of occurrences of broken leg in a population might be of interest including counting separate incidences for the same patient. Even for such unstructured records, one may sometimes wish to ignore duplicates. For immunological reasons, for example, our interest may be in seeing the patient as having had a pollen allergy or influenza at least once, but the number of times is irrelevant. In such a case, duplicates should be switched off. The default if the command is not present is ‘on’. 3.3. The Two Major Modes of Multivariance Analysis. These commands are for quantitative data only, such as Weight:)200 pounds, correlating it say with height in meters. The relationship between associations and covariances is that the associations merely signal a match 1 or mismatch 0 between data, which can be qualitative or quantitative, while covariances show degrees of difference or deviation and reflect a positive, zero, or negative trend. Covariance may involve correlating variance between more than two parameters, and FANO reserves the term multivariance for high complexity multiparameter correlation. Note that FANO usually uses the convention that just one set of parameters running contra to end other makes the correlation negative, since that is appropriate to inference using the rules. 3.3.1. covariances). Set covariances on/off. The lower dimensional forms are analgous to simplets in associations (see above), so that covariances)off is analogous to simplets)off, and the maximum complexity of the simplets is analogously defined by maximum correlation complexity). However, the maximum value would normally be 3, as there is for various reasons discussed below a natural break point between complexity 3 and 4. The approach to 4 and higher is somewhat experimental and prone to ongoing experimentation, as high dimensionality is the crux of the challenge to analysis of medical data with multiple genomic and proteomic parameters. Journal of Proteome Research • Vol. 7, No. 9, 2008 3931

research articles

Robson

Incidentally, the analogy breaks down with simplets off) in any event, in that the manner of generating multivariance rules is not random sampling as for associations, but a very different approach. However, see Note Added in Proof. Covariances, unlike associations, reflect any common trend in values the same (positive covariance), or opposite (negative covariance) direction. FANO makes them look like associations, so that their information content can be estimated and they can be coranked with associations. The means to do this is based on a fuzzy set argument. As the starting point for explanation of this, recall that classical covariance (bi-variance) measure is of the form (subject to required normalization). F(X,Y) )

∑

i

(Xi-j)(Yi - ) ⁄ √

[∑

i

(Xi-j)2 ·

∑

i

]

(Yi-)2 (5)

An analogous multivariate description is F(X1,X2,...) )

∑ ∏ [(X i

j

i,j - )] ⁄

√

∏∑ j

i

[(Xi,j - )]2 (6)

As mentioned above and for reasons discussed below, FANO normally calculates covariances directly only for complexities 2 and 3, and higher complexities in a different method. For explanation and comparison of these methods, eq 6 is best replaced in the following general way. Each component is assigned a coefficient c(i) which would be 1 if the component contributes 100%, and 0 if 0% (see ref 21), as follows:-

{∑ z a, b }

F(X0, X1, . . . , XB) ) | (

)|c(b)

a)0,1,2,. . .,A

∏

1⁄(Σbf(c(b)))

b)0,1,2,. . .,B

⁄ Z(X0, X1, . . . , XB)

(7)

Say each a is a particular choice of an array from a set of A identical but noisy expression arrays and b one of B biomarkers or one aspect of such materially represented as a fluorescent spot. Here z(a,b) is the classical z-value z(a,b) ) [Xa,b - b]/ σb which in turn is equal to the classical t value t(a,b)/(n 1) for n - 1 degrees of freedom. FANO is normally set to handle bivariate (complexity 2) and trivariate (complexity 3) cases directly, and what is meant is that this is by using the above equations in a manner similar to Pearson’s classic bivariate correlation coefficient. Here this means that that f(c(b)) ) c(b) and Z ) 1. The resulting value, however, is conceptually based on the z-value of the classical z-test: in considering complexity C, it is analogous to a onedimensional z-value to the Cth power, or more correctly a product of C z-values. Bivariance and trivariance raise fewer issues than higher multivariance in regard to normalization and sensitivity, that is, the ability to yield a value suggestive of a strong correlation between a few parameters against the background noise of many other parameters contributing randomly correlating. Nonetheless, the temptation to take a general encompassing treatment (without a “break point” in treatment between 3 and 4), and to introduce commands to allow choice of flexibility of f((b)) and Z, is huge. Indeed, experimental versions of FANO have explored use of commands to redefine these options. The function f(c(b)) could for example be meaningfully set as f(c(b) + k) and f(|c(b)| + k), and also the normalizing term Z. After all, such alternatives reflect a matter of user choice. Difference choices could be used to make the measures variously analogous to (a) multivariate z, t, or different powers and roots of 3932

Journal of Proteome Research • Vol. 7, No. 9, 2008

Pearson-like measure such as the correlation coefficient (which is the square of Pearson’s basic measure), or to (b) no less valid new measures for highly multivariate data which have appropriate sensitivity to detect a few correlating parameters among various amounts of data for many random ones. For example, divisor Z of eq 7 could be defined as the same formula as the numerator except that that it is the result that would be obtained if it is run on perfectly correlated data, say with all columns 0,1,2,3,..., and so forth. Equally, the reference data could be purely random, or show degrees of correlation/ randomness, in which case the final results represent correlation relative to some specified degree of correlation/randomness. Even for complexities 4 and higher, however, the principal and current version should usually simply takes f(c(ab)) ) c(b) and Z ) 1 on the grounds that issues of normalization and sensitivity are largely matters of human (or at least statistician) taste that tend to evaporate given the following consideration. In multivariate mode (higher than complexity 2 and 3), FANO seeks to consider initially all parameters. In effect, it makes a starting attempt at assuming significance of a “giant rule” which is as complex as the entire record of every patient. It then seeks to optimize the fit to the data of F as a function of the coefficients (see next Section 3.3.2), irrespective of how it may be normalized and sensitized. In this optimization, F is constrained to lie between -1 and +1. When values of a coefficient approach zero, that parameter is in effect “edited out”, implying a low complexity rule. The principal difference from classical approach is not choice of normalization and sensitivity, but of using weighting coefficients as power terms c. The departure is (arguably) not heuristic but simply set up in such a way to enable optimization of its form to be consistent with the data, by minimization as a function of the coefficients (see Methods). In addition, eq 7 could be used with fixed values of the coefficients to introduce the effects of weightings of different data (typically, different columns), and this includes the case of setting the coefficients as 0 or 1 to edit the contribution out or in, respectively. See discussion of multivariance below. For variances between data of any complexity, FANO calculates the so-called effective frequencies of covariance. What is required is an effective observed frequency Eo[ ] and an effective expected frequency Ee[ ] so that these can then be applied to eq 1, as if they were observed frequencies o and expected frequencies e, respectively. By analogy with eq 1, there are a number of possible methods1 including those based on using effective frequency Ntot(1 + F(X0, X1,...))/2. Here Ntot is the number of data items other than unknown qualified by the same metadata. In the version discussed here, we proceed differently. The effective expected frequency is in practice calculated first, and used to define the effective observed frequency. Ee[X1,X2, . . . , Xk] ) Ntot · p1p2 . . . pk

(8)

Here probability-like measure p is the proportional size of the state above the mean in each column or set of metadata. It reduces in practice to the proportional frequencies as the number of items, with specific metadata, above the mean, compared with the total frequency. That is, it reduces to the same use of expected frequencies as for association when the same numerical data is partitioned into two states above and below the mean. This option is selected by covariance prior)on. In the default equivalent to using covariance prior)off, the

Clinical and Pharmacogenomic Data Mining 4 simple initial approach is taken assuming absence of prior knowledge, in the sense that the partioning for each state is “50:50”. That is, pi ) 0.5 for all i ) 1, 2,..., k and thus E[o(X 1, X 2,...)|D] ) Ntot * (0.5)k where k is the number of columns or metadata. It is equivalent to assumming that the numerical values are randomly distributed arround the mean. The so-called effective observed frequency is then given by Eo[X1,X2, . . . ] ) Ee[X1,X2, . . . ,Xk] × (1 √ |F(X1,X2, . . . )|),F(X1,X2, . . . ) < 0 Eo[X1,X2, . . . ] ) Ee[X1,X2, . . . ,Xk] + (Ntot Ee[X1,X2, . . . ,Xk]) × √ |F(X1,X2, . . . )|, F(X1,X2, . . . ) g 0 (9) For analysis of records in spreadsheet format, Ntot corresponds to the number of records which do not have “don’t knows” for that metadata. The effect of the above is to “pivot” the resulting measure around the expected value so as to give appropriate values close to the limits (0, Ntot) and to the expected value (lying between 0 and Ntot). Note the test (condition) F(X1, X2,...) < 0 in eq 9. The sign of the correlation value for higher dimensionality is meaningful for bivariance and trivariance, but for high complexity data is of little meaningful interest in routine statistical use, which represents another reason for a discontinuity in method between complexity 3 and 4. The resulting value depends on the number of anticorrelating sets of parameters: odd numbers of negatively correlating parameters yield a negative value, otherwise a positive one. FANO takes a step to resolving the problem simply by a matter of definition: if just one parameter anticorrelates with any other, the result is negative. Even so, this causes conceptual problems regarding the conversion of correlations to analogous associations form. For example, weight and height correlate positively, and weight and cholesterol level may correlate positively, and height and cholesterol level might correlate negatively. So what does the analogous association rule mean as the information measure I(Weight; Height; Cholesterol)? The clue is that the words Weight and so forth are metadata without values. What is meaningful and analogous to association is I(Large Weight; Large Height; Low Cholesterol), I(Low Weight; Large Height; High Cholesterol) and so on. See for example discussion of partion state) in Section 3.4.11. FANO thus curates such rules to analogous association ones by, for example, I(Height:)>av, Weight:)>av, Cholesterol: )>av) where av is the average (mean) value relevant to each case that is, a value such as 205 pounds. 3.3.2. high dimensional covariance). As a means to facilitate comparison with the treatment of low dimensional (low complexity) covariance, the purpose of this command was introduced above. It sets high dimensional covariance either to off, normally confining analysis of variance between data to complexity 2 and 3, or to two arguments value%(precision)) value(iterations) which implies that all higher complexity possibilities are considered. For example: High dimensional covariance) 0.01%)500.The synonym is multivariance). As described above, eq 7 can be optimized as a function of coefficients c(i) to fit the data for all metadata (i.e., in spreadsheet terms, every column) where that data is quantitative.21 This command (when not set to off) invokes an extended simplex-like global minimization technique47,48 for rough surfaces with discontinuous derivatives. The first “%” argument is the accuracy required for convergence of the mulitvariance strength on a 0...0.100% scale. The default is 0.0001%. This process of convergence will be repeated the number of times

research articles shown by the final parameter (above, 500), the default being 100. Although a fresh random start is taken, some information is taken from the preceding passes to try and find a global solution, first assuming that the minima located may follow a trend. The rules are reported as type multivariate. If the effective expected or observed frequency is less than one, a so-called “reduced multivariate” frequency is reported, which is computed by dividing by the complexity. The reason is that the effective expected frequency provides some indications of the kind of data levels which we need to obtain in order to verify and really explore the correlation, and clearly rounding to zero (the normal action for frequencies less than one) obscures the insight which would otherwise be obtained. Since many islands of internal correlation can exist as “eigensolutions” without correlating with each other, different runs might locate them as different, but equally valid results (though the strength of the correlation, and hence the information content, may be different). Thus FANO contains the capability to repeat many optimization runs with “fresh” starts. As discussed elsewhere,21 the kind of optimization problem being attacked here is nontrivial and has the same challenges as the notorious protein folding problem. Thus, it is common practice to remove many parameters from records to focus on others and so reduce the difficulty of the task. Happily, however, the same methodology provides a means to do this. Discovering what parameters have little correlation with the rest of the data is mathematically much less difficult than identifying specific combinations of specific parameters with strong correlations. The coefficients c(i) approaching zero in optimization indicate those parameters which do not correlate with any other one or more columns. This is extremely useful for eliminating metadata (in spreadsheets, columns of data) from subsequent analysis. It means that normal procedure for refining the determination of correlations will be to repeat many runs with different selected metadata. This is much less often the case with associations, and also suggests combining early feedback from the direct treatment of quantitative data truly as associations, by pooling quantitative data into ranges (typically above and below the mean or a selected value, as discussed above). 3.3.3. optimization history). If high dimensional covariance) is applied, this shows the optimization history on the xml ouput file FANO.xml. Represented in a section separate to the ranked rules, it reports on the success or otherwise degree of convergence and minima located. It allows the user to see readily any other correlations detected between items and the rest of the data, which compared with the final result of optimization will represent weaker but nonetheless equally valid “eigensolutions”. 3.4. Fine Control of Association Sampling Efficiency. These commands seek to render analysis as tractable when it might be otherwise too demanding on processing time and/or memory. They relate largely to focus on certain types of data or rule, or impose limitations or sampling approximations which are required for efficiency, as follows. 3.4.1. maximum number of items per event). This sets the maximum complexity allowed for association rules both by exact, exhaustive sampling and random sampling (see below). Usually, the argument of this command is equal to or greater than that set by the sister command maximum exhaustive association complexity) below, which specifies the limit for exhaustive sampling. Rules up to a complexity of typically 5-6 Journal of Proteome Research • Vol. 7, No. 9, 2008 3933

research articles are usually calculated exactly, by counting all events in all records, exhaustively. Higher complexity rules will be calculated statistically by “Monte Carlo” sampling. The command maximum number of items per event) may operate on chunks of a record set by maximum items per record sample) as described in the section on sampling. If the further command simplets)off is used to suppress calculation of simplets exactly, they will not be calculated by statistical sampling either. Keeping down the complexity, where possible and appropriate, is a key tactic. Certainly, when records are very long, one important thing to do is try the argument of 5, and decrease this until there are no machine memory problems. At first inspection, 5 seems rather small. After all, a complex disease such as cardiovascular disease might involve some 80 genetic and other factors. However, the number of rules which can be generated from the complexity C is very close to 2C, which can quickly become astronomic. FANO versions use several tricks also described below to tackle this difficulty.22 In principle, any argument greater than 1 is allowed for this command, but above 10 will cause a method to come into play which exploits number theory using actual divisions of very large numbers, and calculation can be slow. For 10 and below, number theory was used behind the scenes, but rather to set up “hard wired” combinatorial code which avoids the large number divisions. In future versions, it is likely that combinatorially generating code in this way can be made a feature at FANO run time. One simple and effective trick of reducing huge combinations of events, not of that sophisticated type, is possible when numeric data is analyzed (both as association and variance between data, namely, covariance, multivariance). Data is pooled above and below the mean, or above and below a specified value defined by partition state). This pooling greatly reduces the number of types of state. For example, many possible entries such as Age:)0, Age:)1,..., Age:)120 are processed into just two states Age:)>43.2 and Age:)0.5 m1:)0.5 3:)4-times 2:)4-times

m2:)>0.5 2:)4-times 2:)4-times 2:)4-times 4:)4-times 2:)4-times m2:)>0.5 5:)4-times 5:)4-times 2:)4-times 2:)4-times 4:)4-times 2:)4-times 4:)4-times 4:)4-times m3:)>2 m3:)>2 m2:)>0.5 m3:)

Clinical and Pharmacogenomic Data Mining: 4 ... - ACS Publications

Recommend Documents