**Use of Sparse principal component analysis and multidimensional visualization technique for process fault detection and diagnosis**

Shriram Gajjar, Murat Kulahci, Ahmet Palazoglu

University of California, Davis

Chemical process operations are typically subject to process or operational disturbances. Therefore, timely and effective detection and diagnosis of faults (monitoring) are critical to ensure safety and process stability and to maintain optimal levels of operation. For process monitoring, techniques based on first principle models have been studied for more than two decades but their contribution to industrial practice has not been pervasive due to substantial cost and time required to develop a sufficiently accurate model for a complex chemical plant. On the other hand, in a large-scale unit, a Distributed Control System (DCS) collects data from sensor arrays distributed throughout the plant and stores the data at high sampling rates. This data contains information about the underlying process characteristics and can be used for process monitoring. For process monitoring, a plant operator monitors several DCS screens and using his/her experience and domain knowledge, focuses on critical process variables to anticipate and prevent abnormal process operations. In the absence of such experience or domain knowledge, however, more automated techniques are required to inform and advise plant operators. A control chart is one of the primary techniques of statistical process monitoring of real time data. However, monitoring hundreds of variables simultaneously using univariate control charts is difficult. Moreover, 2-D charts limit our ability to visualize and interpret high-dimensional data. To overcome this challenge, Inselberg^{ }established the concept of parallel coordinates in 1985 [1]. In plane, parallel coordinates induce duality while in 2-D, they make cluster identification and pattern recognition easier. Moreover abundance of data and multiple variables to monitor at once make the task of monitoring unfeasibly difficult. Thus, principal component analysis (PCA), a technique that can capture the critical information in reduced dimensions, is widely used for process monitoring.

PCA-based monitoring methods, which build statistical models from normal operation data and partition the measurements into a principal component subspace (PCS) and a residual subspace (RS), are among the most widely used multivariate statistical methods [2]. In these approaches, the dimensions of the PCA model, i.e., the number of principal components retained, must be decided and this decision has an important role on the process monitoring performance. However, the approach to the determination of the number of PCs to be retained is not unique, especially due to the influence of sensor noise [3]. The choice of the number of PCs retained is a crucial step for the interpretation of monitoring results or subsequent analysis because it could lead to the loss of important information or the inclusion of undesirable interference. To tackle this challenge, a number of well-known techniques for selecting the number of PCs have been proposed. A simple approach is to choose the number of PCs for the explained variance to achieve a predetermined percentage, such as 85%, termed as cumulative percent variance (CPV) (Jackson, 1991). Other methods, including cross validation, average eigenvalue approach, variance of reconstruction error (VRE) criterion, and fault signal-to-noise ratio (fault SNR), have been proposed to determine the number of the retained PCs [3-6].

Using PCA for dimension reduction has drawbacks, that is, each PC is a linear combination of all *m* variables and the loadings are typically nonzero. Such nonzero loadings make it difficult to interpret the derived PCs. In order to get modified PCs with some possibly zero loadings, alternate approaches have been proposed by Cadima and Jolliffe [7], McCabe [8], Tibshirani [9] and Zou and Hastie [10]. In this paper we use the method proposed by Zou, Hastie [11] in which sparse loadings are obtained by imposing the lasso (elastic net) constraints on the coefficients (i.e. loadings) of the PCA model. In this method when the penalty (lasso) term vanishes, the results obtained are the exact PCA results. SPCA essentially is an optimization of the trade-off between variance captured by PCs and the sparsity. It allows the user to control the sparsity of the loadings and improve the ability to identify the important variables.

One of the challenges in using SPCA is in deciding the penalty parameters or choosing the number of non-zero variables/loadings. Zou, Hastie [11] use penalty parameters such that the sparse approximation explains almost the same amount of variance as the ordinary PCA does. In this paper, we vary the number of non-zero variables for each PC and until the variance captured by each PC in SPCA is approximately the same as the variance of the corresponding PC in the ordinary PCA. This method simplifies the process of selecting penalty parameters and provides a more intuitive solution for chemical processes. It also illustrates the key trade-off between sparsity and information retention.

Moreover, the fault detection ability has been shown to depend on the PCs retained in the PCA model [12]. Togkalidou, Braatz [13] also indicated that including components with smaller eigenvalues in the PCA model and excluding those with larger eigenvalues could improve the prediction quality. In most such approaches, not the magnitudes of component loadings but only significant data variations are considered to extract PCA components. The advantage of doing so is that this does not require making any a priori assumptions about data structure. The downside is, however, that the resulting loadings of the extracted PCA components are difficult to interpret. Motivated by this perspective, the present paper deals with several of the limitations inherently associated with the interpretation of loadings of retained PCs using PCA. In many cases, the traditional PCA can be altered in such a way that the obtained loadings would have a clear interpretation without significant loss of information extracted in each PC. Such an approach might help in the application of PCA as better understanding of the impact of PC loadings can clearly facilitate process monitoring.

In this paper, first we illustrate the advantages of using SPCA with a synthetic example. Second, we compare fault detection rates and diagnosis using SPCA and PCA for data obtained from the Tennessee Eastman benchmark process. In summary, this paper will focus on the use of parallel coordinates for multidimensional visualization using SPCA and discusses its accuracy for fault detection, fault diagnosis and fault propagation.

**References**

1. Inselberg, A., *The plane with parallel coordinates.* The Visual Computer, 1985. **1**(2): p. 69-91.

2. Cinar, A., A. Palazoglu, and F. Kayihan, *Multivariate Statistical Monitoring Techniques*, in *Chemical Process Performance Evaluation*. 2007, CRC Press. p. 37-71.

3. Tamura, M. and S. Tsujita, *A study on the selection of model dimensions and sensitivity of PCA-based fault detection.* Computers & Chemical Engineering, 2007. **31**(9): p. 1035-1046.

4. Valle, S., W. Li, and S.J. Qin, *Selection of the Number of Principal Components: The Variance of the Reconstruction Error Criterion with a Comparison to Other Methods.* Industrial & Engineering Chemistry Research, 1999. **38**(11): p. 4389-4401.

5. Wold, S., *Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models.* Technometrics, 1978. **20**(4): p. 397-405.

6. Dunia, R. and S. Joe Qin, *Joint diagnosis of process and sensor faults using principal component analysis.* Control Engineering Practice, 1998. **6**(4): p. 457-469.

7. Cadima, J. and I.T. Jolliffe, *Loading and correlations in the interpretation of principle compenents.* Journal of Applied Statistics, 1995. **22**(2): p. 203-214.

8. McCabe, G.P., *Principal Variables.* Technometrics, 1984. **26**(2): p. 137-144.

9. Tibshirani, R., *Regression Shrinkage and Selection via the Lasso.* Journal of the Royal Statistical Society. Series B (Methodological), 1996. **58**(1): p. 267-288.

10. Zou, H. and T. Hastie, *Regularization and variable selection via the elastic net.* Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2005. **67**(2): p. 301-320.

11. Zou, H., T. Hastie, and R. Tibshirani, *Sparse Principal Component Analysis.* Journal of Computational and Graphical Statistics, 2006. **15**(2): p. 265-286.

12. Kano, M., et al., *Comparison of multivariate statistical process monitoring methods with applications to the Eastman challenge problem.* Computers & Chemical Engineering, 2002. **26**(2): p. 161-174.

13. Togkalidou, T., et al., *Experimental design and inferential modeling in pharmaceutical crystallization.* AIChE Journal, 2001. **47**(1): p. 160-168.

**Extended Abstract:**File Uploaded

See more of this Group/Topical: Topical A: 2

^{nd}Big Data Analytics