Ind. Eng. Chem. Res. 1998, 37, 2215-2222
Automatic Classification for Mining Process Operational Data X. Z. Wang* and C. McGreavy Department of Chemical Engineering, The University of Leeds, Leeds LS2 9JT, U.K.
In process plant operation and control, modern distributed control and automatic data logging systems create large volumes of data that contain valuable information about normal and abnormal operations, significant disturbances, and changes in operational and control strategies. These data have tended to be underexploited for a variety of reasons, including the large volume and lack of effective automatic computer-based support tools. This paper considers a data mining system that is able to automatically cluster the data into classes corresponding to various operational modes and thereby provide some structure for analysis of behavioral responses. The method is illustrated by reference to a case study of a refinery fluid catalytic cracking process. Introduction In process plant operation and control, modern distributed control and automatic data logging systems collect large volumes of data that contain valuable information about both normal and abnormal operations as well as the effects of significant disturbances and changes in operational and control strategies. The volume of data is generally so large and data structure too complex for it to be used for characterization of behavior by manual analysis. However, recent developments in machine learning techniques offer the capability of extracting knowledge from such data that can be used for process operational decision support in fault detection and diagnosis, gross error detection, and data reconciliation, as well as control performance monitoring. Back-propagation neural networks (BPNN) with a limited number of hidden neurons have been employed to learn human operators control skills by learning from the operational records of skilled process operators (Kuespert and McAvoy, 1994). There are now also techniques for feature extraction, dynamic trend interpretation, as well as pattern recognition (Stephanopoulos and Han, 1996; Dong and McAvoy, 1996; Bakshi and Stephanopolous, 1996; Dai et al., 1995; Joshi et al., 1995; Janusz and Venkatasubramanian, 1991; Cheung and Stephanopolous, 1990a,b; Whiteley and Davis, 1992a, 1994; Wang et al., 1996). Marlin (1995) has divided the process monitoring and diagnosis into two levels: the immediate safety and operations of the plant usually monitored by plant operators, and the long-term performance analysis monitored by supervisors and engineers. The emphasis of the reported work has been on supporting the operators level and dealing with real time data such as dynamic trend interpretation, real time data reconciliation, and pattern recognition for fault diagnosis. Little attention has been paid to the use of the accumulated data over days, weeks, and even months of operation as a vast multivariate information source for analysis of operational problems (Santen et al., 1997). The accumulative history enables features to be identified that can help supervisors and engineers monitor the long-range performance of the plant to * To whom all correspondence should be addressed. Fax: +44 113 2332405. Telephone: +44 113 233 2427. E-mail: [email protected]
identify opportunities for performance improvement and causes of poor operation. The problem in doing this analysis is the overwhelming volume of multivariate data, which is not only a problem for process operations but for every sector where database technologies are used including process design, business process, safety, and research. Indeed, the problem of “data rich but information poor” has been widely recognized. In recent years, there have been significant developments in automating data analysis, originally in computer industry and then in space, telecommunication, business, and marketing industries. A research community has since been developed under the label of data mining and knowledge discovery in databases (KDD). Data mining and KDD combines technologies of database management and information modeling, artificial intelligence, and domain knowledge to provide a framework for classification and clustering, regression, dependency modeling, link analysis, sequence analysis, as well as summarizing (Fayyad and Simoudis, 1997; Fayyad et al., 1996). Successful practical applications have been reported that often involve combined use of data warehouse and data mining technologies which can be adaptations of traditional approaches such as principal component analysis and such newly developed technologies as Bayesian graph learning (Buntine, 1996). This paper illustrates how a Bayesian automatic classification method (which is discussed in the next section) developed by NASA (Cheeseman and Stutz, 1996; Cheeseman et al., 1989, 1988; Hanson et al., 1997) can be applied to the analysis of operational data of a refinery fluid catalytic cracking process. The purpose is to explore the applicability of data mining techniques to computer-aided process operational decision support systems. In the following, the Bayesian automatic classification approach will be briefly introduced first. Then, the application of this approach to a refinery fluid catalytic cracking processes for analyzing operational data will be presented. The approach is also compared with neural network approaches. AutoClasssBayesian Automatic Classification The approach can be described as follows: given a number of instances (some times called cases, observations, samples, objects, or individuals), each of which is
S0888-5885(97)00620-9 CCC: $15.00 © 1998 American Chemical Society Published on Web 03/24/1998
2216 Ind. Eng. Chem. Res., Vol. 37, No. 6, 1998 Table 1. Example of Data Structure attributes instances
x1 x2 . xi . xn
described by a set of attributes, devise a classification scheme for grouping the objects into a number of classes such that instances within a class are similar, in some respect, but distinct from those from other classes. The techniques can be divided into two types: supervised classification, which means assigning a new instance to one of an existing set of possible classes, and unsupervised classification, which means identifying the classes from a given set of unclassified instances, which includes determining the number and their descriptions. The automatic classification approach used here is based on unsupervised Bayesian classification developed by NASA (Cheeseman and Stutz, 1996; Cheeseman et al., 1988, 1989; Hanson et al., 1997), which finds the most probable set of class descriptions given the data and its prior expectations. This approach has several advantages over other methods. First, the number of classes is determined automatically. Second, data can be real or discrete and missing values of attributes are allowed. Third, instances are not assigned to classes absolutely, instead with a certainty degree measure. Last, the approach makes the classification by taking all attributes into consideration, permitting uniform consideration of all the data. Overview of Bayesian Classification. AutoClass is based on Bayes’ theorem, which is a formula for combining probabilities. Given observed data D and a hypothesis H, this theorem states that the probability that the hypothesis explains the data p(H|D) (called the posterior probability of the hypothesis given the data) is proportional to the probability of observing the data if the hypothesis were known to be true p(D|H) (the likelihood of the data) times the inherent probability of the hypothesis regardless of the data (p(H), the prior probability of the hypothesis). Bayes’ theorem is commonly expressed as
p(H) p(D|H) p(D)
For the classification purpose, the hypothesis H is the number and descriptions of the classes from which the data D is believed to have been drawn. Given D, we are to determine H to maximize the posterior p(H|D). For a specific classification hypothesis, calculation of the likelihood of the data p(D|H) involves a straightforward application of statistics. The prior probability of the hypothesis p(H) is less transparent and is taken up in the next section. Finally, the prior probability of the data, p(D) in the denominator of eq 1 needs not be calculated directly; it can be derived as a normalizing constant or ignored as long as we seek only the relative probability of hypotheses. Application to Classification. AutoClass is concerned with classification given a data matrix in the form of Table 1 in which the n rows represent instances and the m columns consist of attributes. For instance, the attributes of a database about the operation of a
fluid catalytic cracking process may include reaction temperature and pressure, regeneration temperature and pressure, and oxygen and carbon monoxide contents in flue gas. The attributes are allowed to be real numbers, such as 510 °C, as well as discrete values, such as manual, auto, or cascade modes of a controller. Given a database like Table 1, AutoClass is used to determine the number of classes and descriptions of each class. The fundamental model of AutoClass is the classical finite mixture model of Everitt (1981) and Titterington et al. (1985). This is a two part model. The first part is the probability of an instance being drawn from a class Cs (s ) 1, k), which is denoted λs. Each class Cs then is modeled by a class distribution function, p(xi|xi ∈ Cs, θs), giving the probability distribution of attributes conditional on the assumption that instance xi belongs to class Cs. These class distributions are described by a class parameter vector, θs, which for single attribute normal distribution would consist of the class mean, µs, and variance σ2s . Thus, the probability of a given datum coming from a set of classes is the sum of the probabilities that it came from each class separately, weighted by the class probabilities: k
∑ λsp(xi|xi ∈ Cs, θs)
It is assumed that the data is unordered and independent of each other given the model. Thus, the likelihood of measuring an entire database is the product of the probabilities of measuring each object n
p(xi|θ,λ,k) ∏ i)1
For a given value of the class parameters, we can calculate the probability that instance i belongs to a class using Bayes’ theorem:
p(xi ∈ Cs|xi,θ,λ,k) )
λsp(xi|xi ∈ Cs, θs) p(xi|θ,λ,k)
These classes are “fuzzy” in the sense that even with perfect knowledge of an objects attributes, it will be possible to determine only the probability that it is a member of a given class. The problem of identifying a mixture is broken into parts: determining the classification parameters for a given number of classes and determining the number of classes. Rather than seeking an estimator of the classification parameters (i.e., the class parameter vectors, θ, and the class probabilities, λ), we seek their full posterior probability distribution. The posterior distribution is proportional to the product of the prior distribution of the parameters p(θ,λ|k) and the likelihood function p(x|θ,λ,k):
p(θ,λ|k) p(x|θ,λ,k) p(x|k)
The pseudo-likelihood p(x|k) is simply the normalizing constant of the posterior distribution, obtained by normalizing (integrating) out the classification parameters; that is, in effect, treating them as “nuisance” parameters:
Ind. Eng. Chem. Res., Vol. 37, No. 6, 1998 2217
∫∫ p(θ,λ|k) p(x|θ,λ,k) dθ dλ
To solve the second half of the classification problem (i.e., determining the number of classes k), we calculate the posterior distribution of the number classes k. This distribution is proportional to the product of the prior distribution p(k) and the pseudo-likelihood function p(x|k):
p(k) p(x|k) p(x)
In principle, we can determine the most probable number of classes by evaluating p(k|x) over the range of k for which our prior p(k) is significant. In practice, the multidimensional integrals of eq 6 are computationally intractable and we must search for the maximum of the function and approximate it about that point. AutoClass Attribute Model. In AutoClass it assumed that attributes are independent in each class. This assumption permits an extremely simple form for the class distributions used in eq 2: m
p(xi|xi ∈ Cs, θs) )
wis ) p(xi ∈ Cs|xi,θˆ ,λˆ )
p(xij|xi ∈ Cs, θsj) ∏ j)1
To find a solution to this system of equations, we iterate between eqs 10 and 11 (treating w as a constant) and eq 4 (treating λ and θ as constants). On any given iteration, the membership probabilities are constant, so eq 11 can be simplified by bringing wis through the derivative, giving:
ln p(θˆ s) +
s ) 1...k
wis ∑ ∂θ i)1
µˆ sj )
Search Algorithm. As mentioned earlier, AutoClass breaks the classification problem into two parts: determining the number of classes and determining the parameters defining them. It uses a Bayesian variant of Dempster and Laird’s EM (expectation and maximization) algorithm (Dempster et al., 1977) to find the best class parameters for a given number of classes (the maximum of eq 5). To derive the algorithm, we differentiate the posterior distribution with respect to the class parameters and equate with zero. This process yields a system of nonlinear equations that hold at the maximum of the posterior:
wisxij ∑ i)1
wisx2ij ∑ i)1 Ws
The update formulas are then:
-1 xij - µsj 1 p(xij|xi ∈ Cs, µsj, σsj) ) exp 2 σsj x2πσsj
n + k(w′ - 1)
The class distribution is thus as shown in eq 9:
Ws + w′ - 1
p(xi|θˆ s, xi ∈ Cs)wis] ) 0 ∏ i)1
Thus far, our discussion of the search algorithm has been related to a general class model with an arbitrary θsj. We now apply eq 12 to the specific AutoClass model of eqs 8 and 9. For real valued attributes, the equations for the updated µˆ sj and σˆ sj are a function of the prior information and the empirical mean, xjsj and σ2sj of the jth attribute in class Cs, weighted by wis:
where θsj is the parameter vector describing the jth attribute in the sth class Cs. AutoClass models for real valued attributes are Gaussian normal distributions parametrized by a mean and a standard deviation, and thus, θsj takes the following form:
λˆ s )
µ (θsj) ) σ sj sj
wis ∑ i)1
ln p(xi|θˆ s) ) 0
where wis is the probability that the datum xi was drawn from class s (previously given by eq 4) and Ws is the total weight in class Cs:
σˆ 2sj )
w′ xjj′ + Wsxjsj w′ + Ws
w′(σ′j)2 + Wsσ2sj + w′ + Ws + 1 w′ Ws
(w′ + Ws)(w′ + WS + 1)
s )1...k, j ) 1...m
(σ′j - xjsj)2
s ) 1...k (14)
Equations 10, 13, and 14 do not, of course, give the estimators explicitly; instead they must be solved using some type of iterative procedure (Ayoubi and Leonhardt, 1977). Perhaps the simplest way of estimation of parameters by maximum likelihood estimate method is that suggested by Wolfe (1969), which is essentially an application of EM algorithm (Dempster et al., 1977). Whereas, by the Bayesian parameter estimation method, AutoClass uses a Bayesian variant of Dempster and Laird EM algorithm. Initial estimates of the λs, µs, and σ2sj parameters are obtained by one of a variety of methods (Everitt and Hand, 1980), and these are then used to obtain first estimates of p(s|xi); that is, the weights wis and hence Ws, the E-step. These parameters are then inserted into eqs 10, 13, and 14 to give revised parameter estimates, which is essentially the M-step. The process is continued until some convergence criterion is satisfied (Everitt and Hand, 1981).
2218 Ind. Eng. Chem. Res., Vol. 37, No. 6, 1998
EM Algorithm Steps. The steps of EM algorithm are explained next. In the E-step, the starting values of µ, σ, and λ are obtained using a variety of cluster analysis methods, which are then used to obtain first estimate of weights, w ˆ (which is an estimate of w) using eq 4. For instance, assuming that the data has three classes (k ) 3) (i.e., λ ) λ1, λ2, and λ3) and two attributes (j ) 1, 2). Then we will have 3 * 2 ) 6 µ and 6 σ. In total, we will have 6 + 6 + 3 ) 9 parameters to estimate, yielding 3 weights (wi,s)1, wi,s)2, and wi,s)3) for observations. The M-step requires the calculation of µ, σ, and λ using eqs 10, 13, and 14. The iteration of the two steps is continued until the parameters are maximized. Operational Status Classification of a Fluid Catalytic Cracking Process Data Mining for Process Monitoring and Diagnosis. Complex chemical processes such as FCC require monitoring and diagnosis by excellent automation as well as people (Marlin, 1995). Plant control and computing systems provide monitoring features for two sets of people who perform two different sets of functions: (1) the long-term performance analysis, monitored by supervisors and engineers; and (2) the immediate safety and operations of the plant, usually monitored by plant operators. Obviously, both types of monitoring and diagnosis require people to make and implement decisions. The supervisors and engineers monitor the long-range performance of the plant to identify opportunities for improvement and causes for poor operation. Usually a substantial amount of data involving a long time period is used in the analysis. While there has been significant progress in computer support on on-line process monitoring and control, we must never forget that many of the important decisions in plant operation that contribute to longer-term safety and profitability are based on monitoring and diagnosis and implemented by people “manually”. This process involves the analysis of large volumes of multivariate data. Tools such as AutoClass that are able to identify similarities and differences between data cases and automatically group them into classes can provide very useful support for people to get insight into the data and making strategic decisions. The plant operators require very rapid information so that they can ensure that the plant conditions remain within acceptable bounds. If undesirable situations occur, the operators need to respond rapidly and intervene to restore acceptable performance. Because the person cannot monitor all variables simultaneously, a computer automated data analysis tool will provide very useful support. Despite the fact that AutoClass, like back-propagation neural networks (BPNN), is not a recursive approach, which means that its learning always requires the combined use of real-time and previous data, it has the advantage of being an unsupervised approach that can learn from daily operational data. As will be discussed in the section Comparison with Neural Networks, supervised BPNN is not able to use daily operational data. Process Description. A simplified flowsheet of the FCC process is shown in Figure 1. Briefly, the fresh feed and recycle sludge oil are preheated, mixed, and entered into the riser tube reactor where they contact regenerated catalyst and start the cracking reactions. Product vapor and catalyst solids are separated into
Figure 1. The simplified FCC flowsheet.
Figure 2. Reaction temperature response for cases 15, 17, and 19.
gaseous products and liquid. The spent catalyst, including residually adsorbed hydrocarbons and coke deposit, passes to the steam stripping section and enters the regenerator where the coke on the catalyst is burnt off with air, and in so doing supplies the heat for the endorthemic cracking reactions. Because this is a heavy oil cracking process, coke burning provides extra heat than what is required by cracking reactions that is taken away by water and steam from the regenerator and the heat exchanger outside the regenerator. The main controllers include reaction temperature, regenerator pressure, compressor inlet pressure (which is reactor pressure indirectly), reactor catalyst hold up, as well as various flow rate controls. A customized dynamic training simulator including hardware and software was developed in 1992 for the LuoYang Refinery of SINOPEC (China PetroChemical Incorporation) and has been successfully used for training of operators in startup and shutdown, normal operations, as well as fault diagnosis and emergency treatment. The same simulator has also been used by several other refineries for training, although they have slightly different process configurations. Case Data Generation. To generate an instance (or a data pattern), the simulator is run at normal mode. When all parameters become stable, a disturbance or fault is introduced and at the same time, data recording is started. For each variable in a run, 60 data points are recorded. For example, the dynamic trend represented by 15 (reaction temperature) in Figure 2 is composed of 60 data points when the valve opening on the top of the distillation column changes from 100 to 90%. Six process parameters are chosen to be recorded, including reaction and regeneration temperatures (TRA
Ind. Eng. Chem. Res., Vol. 37, No. 6, 1998 2219 Table 2. Summary of the 42 Simulated Cases cases
description of cases
1-11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33-42
normal operation fresh feed pump (P1321) failure a step increase of 15% in fresh feed flow rate a step decrease of 50% in fresh feed flow rate the valve opening on top of the distillation column changes from 100% to 90% 100% to 80% 100% to 60% 100% to 40% 100% to 30% 100% to 20% manual valve (V20) controlling catalyst to heat removal system, 75% f 80% 75% f 90% 75% f 100% 75% f 60% 75% f 50% 75% f 40% 75% f 35% 75% f 30% recycle sludge oil flow rate increased to 300% fresh feed flow rate decreased to 10% of that of normal operation recycle sludge oil pump (P1329) failure fresh feed flow rate increased by 9% normal operation
and TRG), reactor and regenerator pressures (PRA and PRG), and oxygen and carbon monoxide volumetric contents in the flue gas from the regenerator (PTO2 and PTCO). These parameters are recognized as the major parameters for monitoring FCC operation, although more precise characterization can be expected if more parameters are included. Altogether 42 data instances have been generated and are summarized in Table 2. With the prior knowledge of the data, it is possible to test the automatic classification capabilities. Because each data instance involves six variables and each variable is represented by 60 data points, the database is a 360 × 42 matrix. In the literature, various methods have been reported to reduce the dimension of a dynamic trend without losing its important features. These methods include wavelets approaches (Bakshi and Stephanopolous, 1996; Dai et al., 1995; Joshi et al., 1995), episode representation (Janusz and Venkatasubramanian, 1991; Cheung and Stephanopolous, 1990a,b), neural networks (Whiteley and Davis, 1992), as well as fuzzy logic (Wang et al., 1996). In this work, the crude data is used because our focus here is not on dimension reduction of dynamic data. Classification Results. The 42 data cases are fed to the AutoClass tool. Although the classification using AutoClass is an automatic process, it still has many options that allow users to properly control the learning and results output. For example, users can choose from attribute probability models including (i) the single multinomial model, that implements a single multinomial likelihood model term for symbolic or integer attributes that is conditionally independent of other attributes given the class; (ii) the single normal CN model that models real valued attributes with a conditionally independent Gaussian normal distribution (this model assumes that there are no missing values); (iii) the single normal CM model that also models real valued attributes but allows missing values; and (iv) the multi-normal CN covariant model, also for real valued attributes, expresses mutual dependencies within the class, and the probability that the class will produce any particular instance is then the product of any independent and covariant probability terms.
Table 3. AutoClass Clustering Results of the 42 Cases in Table 2 classes
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42 21, 22, 23 31, 32 13, 29 15, 16, 17, 18, 19, 20 24, 25, 26, 27, 28 12, 14, 30
2 3 4 5
In this work we have chosen the single normal CN model, as indicated by eq 9. The approach is based on generation of alternative classification schemes that are ranked (the first being the best). It then remains to analyze the classification schemes to determine which is the most acceptable classification. The best classification for the current problem is shown in Table 3. Analysis of the Classification Results. The classified results are analyzed by comparing Table 3 and Table 2. Class I includes cases 1-11 and 33-42, which correspond to normal operations as can be seen from Table 2. Thus, it is reasonable to assign these cases into a single class. The significance of being able to automatically distinguish between normal operational data and abnormal ones that represent moderate to significant disturbances as well as faults is that process upsets can be spotted. Class III includes cases 15-20, which correspond to decreases in the opening of the valve 401-ST opening from 100% to 90, 80, 60, 40, 30, and 20%, which cause the differential pressure (PRG - PRA) between the regenerator and reactor to decrease. Consequently, the regenerated catalyst circulation rate will reduce and the reaction and regenaration temperatures will be influenced. The closed-loop dynamic responses of reaction and regeneration temperatures for cases 15, 17, and 19 are shown in Figures 2 and 3. It can be seen that although they are not identical, they show greater similarity than those cases in, for examples, class I (i.e., normal operation). Class IV includes cases 24-28 that correspond to the changes of the opening of the hand-operated valve V20, which first cause regeneration temperature to change
2220 Ind. Eng. Chem. Res., Vol. 37, No. 6, 1998
Figure 3. Regeneration temperature response for cases 15, 17, and 19.
Figure 6. Dynamic responses of the O2 content in flue gas for cases 21, 22, and 23.
Figure 4. Regeneration T responses for cases 24, 26, and 28.
Figure 7. Reaction temperature responses for cases 13, 29, 31, and 32.
Figure 5. Dynamic responses of the O2 content in flue gas for cases 24, 26, and 28.
due to heat transfer changes and then affect all other parameters. Cases 24-28 refer to a reduction in the opening of V20 from its normal operation value of 75% to 60, 50, 40, 35, and 30%. All these operations cause the regeneration temperature to increase, so it is quite reasonable that they should be grouped into one class. The changes in regeneration temperature, TRG, and oxygen volumetric percentage in flue gas, PTO2, are shown in Figures 4 and 5. It is important that cases 21, 22, and 23 are grouped in a different class (class II), although they also represent V20 opening changes, because they are increases in opening that have different effects on associated variables. Figure 6 shows the
PTO2 changes for cases 21, 22, and 23. It is clear that they are different from cases 24, 26, and 28 shown in Figure 5. Class V includes cases 12, 14, and 30, which are clearly in one class. These cases represent feed pump P1321 failure and feed flow rate decreases of 50 and 90%, which mean a sharp decrease in feed flow rate. Class II includes cases 21, 22, 23, 31, 32, 13, and 29. As already discussed, it is reasonable that cases 21, 22, and 23 should be grouped together because they represent an increase in the opening of V20 that will cause a decrease in regeneration temperature. Both cases 13 and 32 represent slight increases (15 and 9%, respectively) in fresh feed flow rate, whereas case 29 represents a three times increase in sludge oil flow rate and case 31 represents a sludge oil pump failure. Because sludge oil is only very small compared with the fresh feed at normal operations (8000 kg/h sludge oil 150 000 kg/h fresh feed), it is not surprising that a three times increase in sludge oil has a similar effect on process operation as slight fresh feed increases of 15 and 9%. The major difference between the four cases (12, 32, 29, and 31) is that cases 12, 29, and 32 represent increases in feed whereas case 31 indicates a decrease, as indicated in the early stage of reaction temperature responses (Figure 7). Because the process is under closed loop control, such a difference is not significant enough to regard case 31 as a different class. It is interesting that cases 21, 22, and 23 and 31, 32, 13, and 19 are grouped together, which would not have been expected. The explanation is that they are not very significant disturbances and all affect in similar form the opera-
Ind. Eng. Chem. Res., Vol. 37, No. 6, 1998 2221 Table 4. ART2 Clustering Results of the 42 Cases in Table 2 classes
A B C D E F G H I J
1-10, 13, 21, 29, 31-42 11 12 14 15, 16, 17 18, 19, 20 22 23 24, 25, 26, 27, 28 30
tions of the process, mainly by influencing the heat balances of the two reactors, and the heat balance is the dominant factor in FCC operations. Comparison with Neural Networks. Various computer-based technologies have been used to develop process operational decision support systems using online process operational data. Knowledge-based expert systems make use of expertise and experience of human experts. Previous research on automatically extracting fuzzy rules from operational data has been reported (Wang et al., 1997a). Neural network-based machine learning uses data to train the networks, but the way of making use of data depends on the type of learning (i.e., supervised or unsupervised). Supervised learning approaches, such as BPNN, are associated with assignment or identification of a case to previously established classes. Although supervised learning normally gives more accurate predictions, there are often difficulties in finding training data. For process fault identification and diagnosis, a BPNN network needs both symptoms as well as faults. Therefore the daily collected data by distributed control systems cannot be used by BPNNs as training data. Obviously, faults would not be generally deliberately introduced to an operating process to generate training cases. Dynamic simulators have proved to be an effective way to generate training cases (Wang et al., 1997b). But there is another problem with supervised training; supervised training namely, is not effective in dealing with new cases that are beyond the range of training patterns. Unsupervised learning is designed to find similarities and differences so is able to deal with unfamiliar cases very effectively. More importantly, unsupervised learning is able to effectively use daily operational data collected by distributed control systems. The ATR2 neural network is a representative unsupervised neural network approach (Whiteley and Davis, 1994) that has been applied to the same problem, and the best results are shown in Table 4. ATR2 classifies the 42 cases into 10 classes. This result is reasonably good (compare Tables 2 and 4). Class A includes all normal operation cases. Classes B, C, D, G, H, and J are individual cases. Classes E, F, and I each represent a class of similar operations. However, the classification gives too much detail and is strongly influenced by a vigilance or threshold value that needs to be input by users based on trial and error. Two extreme vigilance values are those that classify each case into a single class or place all in only one class. There is no fundamental rule to follow in selecting a vigilance value, other than trial and error, which is clearly a major shortcoming compared with AutoClass, which automatically seeks the best classifications. Regarding respect process operational data analysis, the classification in Table 4 involves too much detail to be very useful.
Increasing the vigilance value to get fewer classes leads to poor classification. Concluding Remarks A data mining system has considerable potential in gaining insight into the characterization of the behavior of complex processes such as fluid catalytic cracking. It is demonstrated that data mining is able to automatically convert operational data into clusters that represent significantly different operational modes. Although most of the classification results are what would have been expected, some certainly are not. It is only after more detailed thought and inspection that they can be seen to be valid classes. The significance of the results is that they prove that process operational data can be converted into high-value information through application of appropriate data mining technologies. Comparison with supervised and unsupervised neural networks, BPNN and ART2, has shown that a major advantage with the Bayesian-based automatic classification approach used in this work is that it avoids the need to arbitrarily introduce vigilance factors, which are empirical and have to be estimated by trial and error. As has been noted by several researchers (Cheeseman and Stutz, 1997; Fayyad and Simoudis, 1997), data mining is rarely a one-shot process of throwing some databases at a tool. Instead, it is a process of finding classes and interpreting the results. In the present case, several classifications have been found and ranked by the program. Ultimately, domain experts must analyze the results to determine which is the best. It is important to note that although the approach is well founded, there are problems to be solved in dealing with large databases. These problems arise from data that is incomplete (i.e., values of some data attributes are missing), complex in structures and types, dynamic, redundant, noisy, and sparse. The approach used here is able to deal with only some of these issues, but not all. Therefore, there is considerable scope for further developments to make it possible to combine domain knowledge with case-based reasoning and technologies, such as normalization, visualization, and neural networks. Acknowledgment Support for this work by EPSRC (Grant reference: GR/L61774) is gratefully acknowledged. Special thanks are extended to Mr. B. H. Chen for helpful discussions on ART2 neural networks. Nomenclature ART ) adaptive resonance theory BPNN ) back-propagation neural networks Cs ) the sth class in a classification scheme p ) probability H ) hypothesis D ) data FC ) flow rate controller FCC ) fluid catalytic cracking process HC ) catalyst hold-up controller or liquid level control i ) the ith instance in a database, i ) 1, n j ) the jth attribute, j ) 1, m k ) the number of classes for a classification scheme KDD ) knowledge discovery in databases PC ) pressure controller
2222 Ind. Eng. Chem. Res., Vol. 37, No. 6, 1998 PTCO ) volumetric percentage of CO in regenerator flue gas PTO2 ) volumetric percentage of O2 in regenerator flue gas TC ) temperature controller TRA ) reaction temperature, °C TRG ) regeneration temperature, °C wis ) the probability that the instance xi was drawn from class Cs Ws ) the total weight in class Cs, eqs 10 and 11 x ) instance in a database; xi is the ith instance θ ) parameter vector for class distribution θs ) parameter vector for ith class distribution; for single attribute normal distribution, it would consist of the class mean µs and variance σ2s λ ) the probability vector of an instance being drawn from classes λs ) the probability of an instance being drawn from a class Cs ˆ ) symbol on top of a parameter that indicates an estimate of the parameter
Literature Cited Ayoubi, M.; Leonhardt, S. Method of fault diagnosis. Control Engineering Practise 1977, 5, 683. Bakshi, B. R.; Stephanopoulos, G. Compression of chemical process data by function approximation and feature extraction, AIChE J. 1996, 42, 477. Buntine, W. A guide to the literature on learning probabilistic networks from data. IEEE Trans. Knowl. Data Eng. 1996, 8, 195. Cheeseman, P.; Stutz, J. Bayesian classification (AutoClass): theory and results. In Advances in Knowledge Discovery and Data Mining; Fayyad, Usama M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Eds.; AAAI Press/MIT: 1996 (http://fi-www.arc.nasa.gov/fi/projects/bayes-group/group/ autoclass/). Cheeseman, P.; Stutz, J.; Self, M.; Taylor, W.; Goebel, J.; Volk, K.; Walker, H. Automatic classification of spectrum from the infrared astronimical satelite (IRAS). In NASA reference publication #1217; National Technical Information Service: Springfield, VA, 1989. Cheeseman, P.; Kelly, J.; Self, M.; Stutz, J.; Taylor, W.; Freeman, D. AutoClass: a Bayesian classification system. In Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI; Morgan Kaufmann: San Francisco, CA, 1988. Cheung, J. T. Y.; Stephanopoulos, G. Representation of process trendssI. A formal representation framework. Comput. Chem. Eng. 1990a, 14, 495. Cheung, J. T. Y.; Stephanopoulos, G. Representation of process trendssII. The problem of scale and qualitative scaling. Computers Chem. Eng. 1990b, 14, 511. Dai, X.; Joseph, B.; Mortard, R. L. Process signal features analysis. In Wavelet Application in Chemical Engineering; Motard, R. L., Joseph, B., Eds.; Kluwer Academic: Norwell, MA, 1995. Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc., Ser. B 1977, 39(1), 1.
Dong, D.; McAvoy, T. J. Nonlinear principal component analysis - based on principal curves and neural networks. Comput. Chem. Eng. 1996, 20, 65. Everitt, B. S.; Hand, D. J. Finite Mixture Distributions; Chapman & Hall: London, 1981. Everitt B. S.; Hand, D. J. Cluster Analysis, 2nd ed.; John Wiley & Sons: New York, 1980. Fayyad, U., Piatetsky-Shapiro, G.; Smyth P. The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 1996, 39, 27. Fayyad, U.; Simoudis, E. Data mining and knowlde discovery. In Tutorial Notes at PADD’97-1st Int. Conf. Prcac. App. KDD & Data Mining; London, 1997. Hanson, R.; Stutz, J.; Cheeseman, P. Bayesian classification theory; http://fi-www.arc.nasa.gov/fi/projects/bayes-group/ group/autoclass/autoclass-c-program.html, 1997. Janusz, M. E.; Venkatasubramanian, V. Automatic generation of qualitative descriptions of process trends for fault detection and diagnosis. Eng. Applic. Artif. Intell. 1991, 4, 329. Joshi, A.; Kumer, A.; Motard, R. L. Trend analysis using the Frazier-Jawerth transform. In Wavelet Application in Chemical Engineering; Motard, R. L., Joseph, B. Eds.; Kluwer Academic: Norwell, MA, 1995. Kuespert, D. R.; McAvoy, T. J. Knowledge extraction in chemical process control. Chem. Eng. Commun. 1994, 130, 251. Marlin, T. E. Process Control: Designing Processes and Control Systems for Dynamic Performance; McGraw-Hill: New York, 1995. Santen, A.; Koot, G. L. M.; Zullo, L. C. Statistical data analysis of a chemical plant. Comput. Chem. Eng. 1997, 21, s1123. Stephanopoulos, G.; Han, C. Intelligent system in process engineering: A review. Comput. Chem. Eng. 1996, 20, 743. Titterington, D. M.; Smith, A. F. M.; Makov, U. E. Statistical analysis of finite mixture distributions; John Wiley & Sons: New York, 1985. Wang, X. Z.; Chen, B. C.; Yang, S. H.; McGreavy, C. Neural net, fuzzy sets and digraphs in safety and operability studies of refinery reaction processes. Chem. Eng. Sci. 1996, 51, 2169. Wang, X. Z.; Chen, B. H.; Yang, S. H.; McGreavy, C.; Lu, M. L. Automatic generation of production rules from data for process operational decision support. Comput. Chem. Eng. 1997a, 21, s661. Wang, X. Z.; Lu, M. L.; McGreavy, C. Machine learning dynamic fault models based on fuzzy set covering method. Comput. Chem. Eng. 1997b, 21, 621. Whiteley, J. R.; Davis, J. F. Knowledge-based interpretation of sensor patterns. Comput. Chem. Eng. 1992, 16, 329. Whiteley, J. R.; Davis, J. F. A similarity-based approach to interpretation of sensor data using Adaptive Resonance Theory. Comput. Chem. Eng. 1994, 18, 637. Wolfe, J. H. Pattern Clustering By Multivariate Analysis; U. S. Naval Personnel & Training Research Lab., 1970.
Received for review September 3, 1997 Revised manuscript received December 8, 1997 Accepted January 15, 1998 IE970620H