Predictive Modeling of Pharmaceutical Processes with Missing and Noisy Data

Boukouvala F; Muzzio FJ; Ierapetritou MG

AIChE Journal, Vol.56, No.11, 2860-2872, 2010

Boukouvala F, Muzzio FJ, Ierapetritou MG

Lack of knowledge of the first principles, that describe the behavior of processed particulate mixtures, has created significant attention to data-driven models for characterizing the performance of pharmaceutical processes which are often treated as black box operations. Uncertainty contained in the experimental data sets, however, can decrease the quality of the produced predictive models. In this work, the effect of missing and noisy data on the predictive capability of surrogate modeling methodologies such as Kriging and Response Surface Method (RSM) is evaluated. The key areas that affect the final error of prediction and the computational efficiency of the algorithm were found to be: (a) the method used to assign initial estimate values to the missing elements and (h) the iterative procedure used to further improve these initial estimates. The proposed approach includes the combination of the most appropriate initialization technique and the Expectation Maximization Principal Component Analysis algorithm to impute missing elements and minimize noise. Comparative analysis of the use of different initial imputation techniques such as mean, matching procedure, and a Kriging-based approach proves that the two former used approaches give more accurate, "warm-start" estimates of the missing data points that can significantly reduce computational time requirements. Experimental data from two case studies of different unit operations of the pharmaceutical powder tablet production process (feeding and mixing) are used as examples to illustrate the performance of the proposed methodology. Results show that by introducing an extra imputation step, the pseudo complete data sets created, produce very accurate predictive responses, whereas discarding incomplete observations leads to loss of valuable information and distortion of the predictive response. Results are also given for different percentages of missing data and different missing patterns. (C) 2010 American Institute of Chemical Engineers AIChE J, 56: 2860-2872,2010

Keywords:missing and noisy data;Kriging;response surface;pharmaceutical;data-driven models;imputation;EM-PCA