Hostname: page-component-77c78cf97d-57qhb Total loading time: 0 Render date: 2026-04-27T14:51:08.616Z Has data issue: false hasContentIssue false

Data mining and knowledge discovery in chemical processes: Effect of alternative processing techniques

Published online by Cambridge University Press:  26 April 2022

Luis A. Briceno-Mena
Affiliation:
Cain Department of Chemical Engineering, Louisiana State University, Baton Rouge, Louisiana 70803, USA
Miriam Nnadili
Affiliation:
Cain Department of Chemical Engineering, Louisiana State University, Baton Rouge, Louisiana 70803, USA
Michael G. Benton
Affiliation:
Cain Department of Chemical Engineering, Louisiana State University, Baton Rouge, Louisiana 70803, USA
Jose A. Romagnoli*
Affiliation:
Cain Department of Chemical Engineering, Louisiana State University, Baton Rouge, Louisiana 70803, USA
*
*Corresponding author. E-mail: jose@lsu.edu

Abstract

Data mining and knowledge discovery (DMKD) focuses on extracting useful information from data. In the chemical process industry, tasks such as process monitoring, fault detection, process control, optimization, etc., can be achieved using DMKD. However, the selection of the appropriate method for each step in the DMKD process, namely data cleaning, sampling, scaling, dimensionality reduction (DR), clustering, clustering analysis and data visualization to obtain meaningful insights is far from trivial. In this contribution, a computational environment (FastMan) is introduced and used to illustrate how method selection affects DMKD in chemical process data. Two case studies, using data from a simulated natural gas liquid plant and real data from an industrial pyrolysis unit, were conducted to demonstrate the applicability of these methodologies in real-life scenarios. Sampling and normalization methods were found to have a great impact on the quality of the DMKD results. Also, a neighbor graphs method for DR, t-distributed stochastic neighbor embedding, outperformed principal component analysis, a matrix factorization method frequently used in the chemical process industry for identifying both local and global changes.

Information

Type
Tutorial review
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - ND
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided that no alterations are made and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use and/or adaptation of the article.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Figure 1. Typical flow of a data mining and knowledge discovery methodology.

Figure 1

Table 1. Data preprocessing and analysis methods available in FastMan.

Figure 2

Figure 2. FastMan typical view.

Figure 3

Figure 3. Simulated NGL plant schematic (Chebeir et al., 2019).

Figure 4

Figure 4. NGL time evolution of XIC 100.PV variable and partition of operational conditions. (Left) Line plot. (Right) SOM (Darker regions represent high similarity (clusters) between datapoints, and brighter regions represent low similarity (separation between clusters)).

Figure 5

Figure 5. Effect of sampling on the visualization and classification for the NGL plant via SOM, 3D projection, and time evolution of XIC 100.PV process variable. First row shows random sampling and second row shows cNN sampling.

Figure 6

Figure 6. Effect of scaling over the visualization and projections of process data from the NGL plant: (first row) z-score and (second row) over mean.

Figure 7

Figure 7. Effect of DR techniques: (first row) PCA; (second row) t-SNE; and (third row) UMAP.

Figure 8

Figure 8. Effect of alternative clustering techniques: (a-c) DBSCAN; (d-f) HDBSCAN; and (g-i) K-Means.

Figure 9

Figure 9. Effect of min_samples parameter: (a, b) min_samples = 10 and (c-f) min_samples = 5.

Figure 10

Figure 10. Cluster analysis results: (left) SGS analysis of the clusters produced with PCA-DBSCAN combination. (right) Cluster projections over the components SOM for most influential variable.

Figure 11

Figure 11. Schematic representation of the industrial pyrolysis reactor.

Figure 12

Figure 12. Rare operation states in the pyrolisis reactor dataset. (a) raw data; (b) zoom around apparent outlier; and (c) further zoom showing a 5 hr operation region.

Figure 13

Figure 13. Hydrocarbon flows for coils 2 and 6 for pyrolysis reactor: (a) raw data; (b) data after outliers’ detection/elimination; (c) zoomed view of the section within the circle; and (d) data projected in a 3D space using PCA-DBSCAN.

Figure 14

Figure 14. Effect of number of neighbors, $ k $, over the data makeup for the pyrolysis reactor.

Figure 15

Figure 15. Effect of sampling on the process variables. cNN sampling: (a) k = 5 and (c) k = 50. (b) and (d) corresponding projection to 3D space.

Figure 16

Figure 16. Effect of sampling on the process variables. Random sampling: (a) n = 2 and (c) n = 10; (b) and (d) corresponding projection to 3D space.

Figure 17

Figure 17. Process variables time evolution: (a) coils 2 and 4 steam flow rates and (b) coils 1 and 5 hydrocarbon flowrates.

Figure 18

Figure 18. Process variables time evolution:(a) 3D projection plots and (b) the SOM for the PCA-DBSCAN.

Figure 19

Figure 19. Projection of the main cluster as predicted by PCS-DBSCAN into SOM: (a) time evolution plot and (b) SOM.

Figure 20

Figure 20. Projection of clusters when min_sample in DBSCAN is reduced from 10 to 5: (a, d) time evolution plot; (b, c, e) cluster projection over the SOM; and (c) subspace greedy search results.

Figure 21

Figure 21. (a) Time evolution of the hydrocarbon flows of coils 1 and 5. Circles identify the operating regions identified by the DR and clustering combination of t-SNE-DBSCAN; (b) clustering results in 3D plot.

Figure 22

Figure 22. Cluster (corresponding to two subregions) projected into the time evolution as well as into the SOM and component plots.

Figure 23

Figure 23. Contributing variables as identified by the subspace greedy search. Variables with the highest scores have a greater influence on the distribution of the data (i.e., cluster separation).

Submit a response

Comments

No Comments have been published for this article.