Efficient product portfolio optimization with SAT-based association rule mining using Apriori algorithm

Thorsten Schmidt; Steffen Marbach; Frank Mantwill

doi:10.1017/pds.2025.10159

Efficient product portfolio optimization with SAT-based association rule mining using Apriori algorithm

Published online by Cambridge University Press: 27 August 2025

Thorsten Schmidt

Steffen Marbach and

Frank Mantwill

Show author details

Thorsten Schmidt*: Affiliation:
Helmut Schmidt University, University of the Federal Armed Forces Hamburg, Germany
Steffen Marbach: Affiliation:
Helmut Schmidt University, University of the Federal Armed Forces Hamburg, Germany
Frank Mantwill: Affiliation:
Helmut Schmidt University, University of the Federal Armed Forces Hamburg, Germany
*: thorsten.schmidt@hsu-hh.de

Article contents

Abstract:
Introduction
State of the art
Methodology
Results and discussion
Conclusions
Footnotes
References

Abstract:

Managing high-variant product portfolios effectively is a crucial competitive advantage in offering mass customized products on saturated markets. Association Rule Mining (ARM) is a field of data mining determining frequent itemsets from historic transactions and deriving patterns of conclusion. This paper introduces a new approach to transfer ARM to feature-based configuration e.g. in the German automotive industry. Combined, existing apriori product knowledge is used in constraints to effectively lowering runtime by reducing the number of candidate-sets through introduction of a Boolean satisfiability check. For an efficient implementation, three different Apriori algorithms are tested and benchmarked on a generic dataset for different parameters. Results show a significant improvement in using SAT-based pre-screening while efficiency of the implementation depends on the given example.

Keywords

portfolio management constraint modelling optimisation machine learning association rule mining

Information

Type: Article
Information: Proceedings of the Design Society , Volume 5: ICED25 , August 2025 , pp. 1455 - 1464

DOI: https://doi.org/10.1017/pds.2025.10159 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
Copyright: © The Author(s) 2025

1. Introduction

Offering mass customizable products to meet customer specific requirements is a crucial competitive advantage on saturated markets. Customizable configuration comes at the cost of variant-induced complexity and associated complexity costs. Continuously optimizing product portfolios is therefore necessary to benefit from offering customized-to-order variant-rich products. Optimization includes streamlining the internal variance while offering as much external variance to the customer as possible.

The rising requirement of traceability and documentation of growing complexity increases the resulting data fuelling the need for managing large databases. Besides technical documentation, product sales are typically stored in databases of item-based transactions. Analysing these transactions for discovering underlying patterns of interest requires techniques of data mining. Data mining uses Machine Learning algorithms and is part of the field of Knowledge Discovery in Databases (Reference Al-Maolegi and ArkokAl-Maolegi & Arkok, 2014).

Managing product portfolios addresses high-variant configurations and includes the two main stages of portfolio identification and portfolio evaluation (Reference Jiao and ZhangJiao & Zhang, 2005). Variant-rich products are described by their features and are restricted by additional constraints that can be of technical, legal or marketing nature. Portfolio identification requires the discovery of frequent feature-sets, which is an unsupervised Machine Learning task. Deriving associations from these frequent feature-sets in the form of “customers who bought this, might also like this...” is the discipline of Association Rule Mining (ARM) (Reference Agrawal, Imieliński and SwamiAgrawal et al., 1993). ARM is a well-known field for product basket analysis of item-based transactions in e-Commerce and retail. ARM is performed in mass production to deal not only with product variety but also with process variety mapping (Reference Jiao, Zhang, Zhang and PokharelJiao et al., 2008).

The addressed research question of this work is to how ARM can be transferred to feature-based configurations without violating existing constraints and how to be implemented efficiently.

The contributions of this paper are trifold: First, the well-studied methods of ARM are applied and transferred from item-based transactions data to feature-based configuration data found in high variant product portfolios. In order to achieve this transferability, the feature-based product description is briefly introduced and the used notation in the German automotive industry is outlined. Second, the usage of apriori product knowledge in the form of additional constraints needs to be considered to achieve feasible results by a combination of satisfiability and ARM. In order to achieve this, constraints are described, and the Boolean satisfiability problem is introduced. Third, three different approaches are implemented based on the original Apriori algorithm from Agrawal et al. (Reference Agrawal, Imieliński and Swami1993), an advanced Apriori algorithm from Al-Maolegi & Arkok (Reference Al-Maolegi and Arkok2014), as well as a new implementation using Boolean arrays. All three approaches are tested against each other and benchmarked for efficiency assessment.

Thereby, the goal of this paper is to apply a method for portfolio optimization on products with high variance and therefore support designers to plan their product portfolios and provide insights to the domain of development and production planning. To use existing apriori knowledge and assure satisfiability of the results, two approaches need to be combined and implemented efficiently. Using this combined application is a novel approach for the early phases of Engineering Design as well as important support for designers in maintaining product life cycle updates for mature products.

Providing this work engages to use and gain knowledge about product portfolio optimization and provide a new capability in the form of a digital assistance to support Engineering Design. Providing information from available sales data has the potential to strengthen the data-driven decision support in product development. Possible applications include but are not limited to portfolio streamlining, recommender systems, feature-packaging and product architecture decisions in the early design phases.

This paper is structured in five sections, starting with a brief introduction into the state of the art followed by the applied methodology and presented results. Finally, the findings are extracted, discussed and future research is concluded.

2. State of the art

The following section introduces the state of the art documentation for variant-rich products and with the Apriori algorithm for ARM and the Boolean satisfiability problem for logical verification the two most important concepts for mining patterns in customer transactions and modern configurators.

2.1. Feature-based product description of high-variant products

In order to highlight the particularities of a feature-based product description found e.g. in the German automotive industry, the original item-based transaction needs to be introduced.

In item-based transactions, customers usually chose items from a catalogue of items. These items can be formalized as an item-catalogue ${\cal J}$ = {I ₁, I ₂, …, I _w}. Every transaction T is a subset of these items such that T ⊆ ${\cal J}$ and all transactions are combined in ${\cal T}$ = {T ₁, T ₂, …, T_z }. Each transaction T is assigned a unique identifier Transaction-ID.

Table 1 shows an item-based set of z = 5 exemplary transactions in a Boolean notation where items I ₁, I ₂, I ₃, …, I _w from an item-catalogue ${\cal J}$ are either chosen (1) or not chosen (0). Transactions can either be filtered for one customer or all customers depending on the context of the addressed analysis.

Table 1. Exemplary item-based transactions from an item-catalogue ${\mathcal J}$ with w items

In mass customized products, customer usually define their configuration by choosing features from a catalogue of feature-families (often referred to and used synonymously as “options” or three digit “primary number” codes in automotive). In the customer specific configuration process, exactly one feature F_v,f per feature-family ${\cal F}$ _f have to be chosen. On top of these exclusive-OR (XOR) family-constraints, there are additional technical constraints restricting the possible combination of features to configurations for various reasons. Product configurators assist customers in finding a satisfiable configuration typically starting with a default configuration that can be modified iteratively in a reconfiguration process. The semantic description of products based on features is also known as Feature Model (Reference Benavides, Trinidad, Ruiz-Cortés, Pastor and Falcão e CunhaBenavides et al., 2005) e.g. in Software Product Lines.

To formalize the documentation, there is a set of customer specific configurations ${\cal C}$ = {C ₁, C ₂, …, C_y } where each C is a subset of features F_v,f and the union of all feature-families ${\cal F}$ _f so that C ⊂ F_v,f and C = ∪ ${\cal F}$ _f and F_v,f = ∪ ${\cal F}$ _f for each f. Each configuration is assigned a unique identifier Configuration-ID. Table 2 shows a feature-based set of y = 5 exemplary configurations in a Boolean notation where features F _1,1, F _2,1, F _3,1, F _4,2, F _5,2, …, F_m,n from feature-family ${\cal F}$ ₁, ${\cal F}$ ₂, …, ${\cal F}$ _n is either chosen (1) or not chosen (0). The family sizes vary from min. | ${\cal F}$ _f |_min = 1 for n = m to max. | ${\cal F}$ _f |_max = m for n = 1.

Table 2. Exemplary feature-based configurations from a feature-family catalogue with m features out of n feature-families

Both family-constraints and additional technical constraints are combined in a ruleset, which makes the configuration a rule-based process. The intra family-constraint of ${\cal F}$ ₂ = {F _4,2, F _5,2} is exemplary formulated in Equation 1. Constraints can be reformulated for example from requirement to prohibition without changing their logical statement exemplary formulated in inter family-constraint in Equation 2.

(1)

$$Intra\ family - constraint:{F_{4,2}}\,\underline \vee \,{F_{5,2}}$$

(2)

$$Inter\ family - constraint:F_{1,1} \mathop \longrightarrow \limits^{forces} F_{4,2} {\rm{}}\mathop \Leftrightarrow \limits^{(1)} {\rm{}}F_{1,1} \mathop \longrightarrow \limits^{prohibits} F_{5,2} $$

A valid, contradiction-free and complete ruleset is known as a consistent variance scheme. The ratio of a F _v,f over ${\cal C}$ is called installation rate of that feature. The sum over the installation rate of one ${\cal F}$ _f always equals 1 for an individual configuration C (∑_v F_v,f = 1) for each f and ∑_f ${\cal F}$ _f = n.

Feature-based configurations are comparable to item-based transactions with additional constraints that need to be considered. Since they are already described using dependencies in the form of rules, finding rules in the same semantic prospects promising results for managing complex product portfolios.

2.2. Association Rule Mining (ARM) with Apriori algorithm

Mining for patterns and dependencies in the form of association rule A → B is the matter of subject in the field of ARM. ARM is a well-known field of study and performed in two stages: first enumerating frequent itemsets, second generating association rules (Reference Agrawal, Imieliński and SwamiAgrawal et al., 1993; Reference SrikantSrikant, 1996).

ARM is conducted by an increasing number of approaches and algorithms. The Apriori algorithm is a specialized and the most popular approach for frequent itemset mining originally introduced by Agrawal et al. (Reference Agrawal, Imieliński and Swami1993). Other algorithms include Partition (Reference Savasere, Omiecinski and NavatheSavasere, 1995), FP-growth (Reference Han, Pei and YinHan et al., 2000), ECLAT (Reference ZakiZaki, 2000), LCM (Uno et al., 2004) or lately ASP (Reference Gebser, Guyet, Quiniou, Romero and SchaubGebser et al., 2016). Apriori has been further improved many times for example in M-Apriori by Al-Maolegi & Arkok (Reference Al-Maolegi and Arkok2014) or introduced in a declarative approach for discovering frequent, closed and maximal patterns in item sequences by Coquery et al. (Reference Coquery, Jabbour, Sais and Salhi2012) and Jabbour et al. (Reference Jabbour, Mana, Dlala, Raddaoui, Sais and Hooker2018). ARM has many more applications and was recently combined in constraint-based approaches. Guns et al. (Reference Guns, Nijssen and Raedt2011) introduced a constraint programming technique to model and solve constraint-based itemset mining task. Dlala et al. (Reference Dlala, Jabbour, Sais, Yaghlane, Bramer and Petridis2016) and Boudane et al. (Reference Boudane, Jabbour, Sais and Salhi2016) presented related approaches to model itemsets in the form of constraints or in propositional formula in declarative approaches (Reference Henriques, Lynce and ManquinhoHenriques et al., 2012).

In item-based ARM, let there be two disjoint sets A and B as sets of items ${\cal J}$ so that for A ⊂ ${\cal J}$ , B ⊂ ${\cal J}$ , and A ∩ B = Ø (Reference Agrawal, Imieliński and SwamiAgrawal et al., 1993). The frequency of an itemset A is given as support(A) in Equation 3. ARM is finding a pattern in the form of A (antecedent) → B (consequence), so that A also contains B in the respective transactions ${\cal T}$ with a support and confidence defined in Equation 4 and 5 where support refers to the frequency of occurring pattern and confidence refers to the strength of the implication.

(3)

$$support\ (A) = {{|\{ T \in {\cal T}|A \subseteq T\} |} \over {|{\cal T}|}} = {{number\ of\ transactions\ containing\,A} \over {total\ number\ of\ transactions}}$$

(4)

$$support\ (A \cup B) = {{|\{ T \in {\cal T}|A \cup B \subseteq T\} |} \over {|{\cal T}|}} = {{number\ of\ transactions\ containing\,A \cup B} \over {total\ number\ of\ transactions}}$$

(5)

$$confidence\ (A \to B) = {{|\{ T \in {\cal T}|A \cup B \subseteq T\} |} \over {|\{ T \in {\cal T}|A \subseteq T\} |}} = {{suppport(A \cup B)} \over {support(A)}}$$

Applying Apriori iterates over Apriori-Gen subroutines before a termination criterion is met. In each step, a candidate-set L_k consisting of k items is created and checked whether it satisfies the specified min. support minsupp and min. confidence minconf. This k is referred to as depth of the itemset and describes the number of items in the candidate-set. Every L _k with support(L_k ) > minsupp becomes a new frequent itemset. In the next step, all remaining frequent itemsets are formed to new candidate-sets L _k+1 by combining survived frequent items to permutations of every possible combination. The procedure terminates when no more L _k surpasses the specified threshold and L_k = Ø or k = n (Reference Agrawal, Imieliński and SwamiAgrawal et al., 1993). Note that modified approaches might have generalized termination criteria like weighted utility measures by considering quantities (Reference Hidouri, Jabbour, Raddaoui, Yaghlane, Song, Song, Kotsis, Tjoa and KhalilHidouri et al., 2020). Subsequently, all conclusions are calculated for each frequent itemset and checked for confidence(L _k) > >minconf accordingly.

Al-Maolegi & Arkok (Reference Al-Maolegi and Arkok2014) introduced an improved Apriori algorithm for mining association rules by transposing the transactions into a list of items with their corresponding transactions to reducing time for scanning the whole database through reorganization. The improved M-Apriori algorithm calculates support and confidence over conjunction rather than disjunction of sets using intersection (cf. Figure 2).

Apriori algorithm is simple to understand and reasonable to implement yet suffers from limitations regarding runtime of a vast number of rapidly growing candidate-sets that need to be checked, generally: w^k − 1 with w = number of items (Reference Al-Maolegi and ArkokAl-Maolegi & Arkok, 2014) respectively their binomial coefficients $\Big(\matrix{w \cr k \cr } \Big)$ for k-permutations of w. Therefore, the number of candidate-sets to be checked for each k is crucial for the performance and applicability of Apriori. Moreover, efficiency relies mainly on the efficient calculation of support and confidence over the disjunction of sets.

2.3. Boolean satisfiability problem (SAT)

The Boolean satisfiability problem (SAT), also known as propositional satisfiability, is a NP-complete problem in theoretical logic and includes verification of product configuration (Reference Janota, Botterweck and Marques-SilvaJanota et al., 2014). SAT is part of the Constraint Satisfaction Problem (CSP) which is a field of problems in the domain of computer science about solving combinatorial questions using techniques from Artificial Intelligence, automatic theorem proving, reasoning, and operations research (Reference Eén, Sörensson, Giunchiglia and TacchellaEén & Sörensson, 2004). SAT has many applications in configuration and neighbouring domains (Reference Rossi, van Beek and WalshRossi et al., 2006). The task of SAT is to find a satisfiable variable assessment for a given propositional term or prove that there is none (Reference Davis, Logemann and LovelandDavis et al., 1962). SAT-solvers are used to efficiently find one or enumerate all solutions to a propositional term based on the original Davis-Putnam-Logemann-Loveland (DPLL) algorithm through backtracking, Boolean propagation and resolution (Reference Davis, Logemann and LovelandDavis et al., 1962). Another algorithm is the Conflict-Driven Clause Learning (CDCL) introduced by Marques-Silva & Sakallah (Reference Zhang, Madigan, Moskewicz and MalikZhang et al., 2001) and Biere et al. (Reference Biere, Heule, van Maaren and Walsh2021).

In configuration tasks, SAT is used to find and prove a feasible solution for customized solutions e.g. for a “lazy and eager” interactive reconfiguration of default product configuration (Reference Janota, Botterweck and Marques-SilvaJanota et al., 2014). Consideration of the Boolean model enumeration can be found in Jabbour et al. (Reference Jabbour, Sais and Salhi2013).

SAT-solvers use a binary notation for their variables (in SAT: literals) and conjunction form for their constraints (in SAT: clauses). Literals can represent an item I as well as a feature F_v,f . The encryption needs to be described accordingly. The binary notation of features is known as One-Hot encoding. Typically, the constraints are formulated in Conjunctive Normal Form (CNF). CNF represents a propositional formula in a conjunction of i clauses, where each clause represents a disjunction of j literals (x _i,1∨ x _i,2∨…∨x_i,j ). Literals are positive or negated propositional variables {x_i,j , ¬x _i,j}. A general formulation of a CNF is described in Equation 6. CNF is used by most SAT-solvers as a standard and efficient formulation. CNF can be compressed (Reference Tseitin, Siekmann and WrightsonTseitin, 1983) or translated in other formulations (Reference Sinz and van BeekSinz, 2005). The intra family-constraint previously described in Equation 1 can be reformulated in CNF notation in Equation 7. The inter family-constraint in Equation 2 is reformulated in CNF in Equation 8.

(6)

$$Conjunctive\ Normal\ Form: \wedge _i \vee _j\! (\neg )x_{i,j} $$

(7)

$$Reformulated\ (1)\ in\ CNF:F_{4,2} \mathop{-}\limits^{\vee}F_{5,2} \Rightarrow (F_{4,2} \vee F_{5,2} ) \wedge (\neg F_{4,2} \vee \neg F_{5,2} )$$

(8)

$$Reformulated\ (2)\ in\ CNF:F_{1,1} \to F_{4,2} {\rm{}} \Rightarrow {\rm{}}(\neg F_{1,1} \vee F_{4,2} )$$

All existing apriori product knowledge can be written in constraints and formulated in CNF. This allows to guarantee the validity of configurations to be verified by a SAT-solver along each step in the process.

3. Methodology

The following section describes the underlying dataset for benchmarking results as well as the methodological setup of the combined approaches and their implementation using different algorithms.

3.1. Description of the SAT-based benchmark setup and the used dataset

This research is based on a parameter study with various parameters under consideration. The setup consists of a pipeline that combines an implementation of the Apriori algorithm and a SAT-solver that checks the satisfiability of configurations against the constraints. Satisfiability is checked before Apriori and works as a filter for pre-screening only satisfiable candidate-sets of features. The result for the feature-based approach are satisfiable and frequent feature-sets. For benchmarking results, selecting the SAT-Filter is optional and the parameters minsupp and minconf are varied. For efficiency assessment, three different implementations of the Apriori algorithm are tested. Based on gathered frequent sets, all permutations of conclusions are calculated. Figure 1 illustrates the original item-based ARM setup (left) compared to our feature-based pipeline for SAT-checked configurations introduced in this paper (right).

Figure 1. Schematic comparison of the original ARM and the proposed feature-based ARM setup

Benefiting from the consideration of additional constraints using SAT, our advanced approach is able to streamline checked combinations by reducing all feature combinations which do not satisfy the additional constraints in the first place. The difference between all possible naive combinations and actual needed checks can be tremendous and consists of two parts. First, combinations that are already checked in other candidate-sets and combinations that have already been checked in previous iterations are excluded by proper implementation of Apriori-Gen. Second, combinations that are prohibited by constraints or required by constraints are additionally excluded by using the SAT-Filter.

For each k, all checked candidate-sets are counted, calculated for support and confidence and the calculation time is documented. All skipped candidate-sets are just counted and saved. To achieve reproducible and robust results, synthetic date of a generic high-variant product is used for testing and benchmarking. The synthetic data simulates a customer buying pattern with an underlining randomness and is used solely to test and validate the proposed algorithm. Synthetic data is widely used for validation while being in control of the determined length and specified feature distribution (Reference Savasere, Omiecinski and NavatheSavasere, 1995).

As part of this contribution, a synthetic dataset has been generated and made public. The used synthetic data is based on a generic product description of binary configurations and constraints in CNF. The CNF consists of 100 features (variables), 609 constraints (clauses) and a total of ≈ 10⁹ feasible and variable products. The detailed generation of the CNF has been described before by Schmidt et al. (Reference Schmidt, Marbach and Mantwill2024). The CNF is publicly available at GitHub^{Footnote 1}. The generated synthetic configuration data consists of 500,000 binary configurations and is also publicly available at GitHub^{Footnote 1}. Installation rates were calculated with two floating point precision. The used SAT-solver is the SAT4J^{Footnote 2} -Solver in Java. All presented results are performed on a Server CPU Intel® Xeon® 12-Core, 24 thread machine with 512 GB of RAM running at 3.40 GHz.

3.2. Efficient implementation using M-Apriori and B-Apriori

For a time and memory efficient implementation, three different approaches of the Apriori algorithm have been implemented and tested against each other for evaluation and benchmark. An efficient implementation requires efficient generation of sets and fast calculation, for which three different approaches were compared in terms of their performance given limited memory allocation.

The first Apriori is based on the original work by Agrawal et al., (Reference Agrawal, Imieliński and Swami1993) in transaction-based basket analysis of collections of items described in sets and subsets of items and scanned in each iteration.

The second M-Apriori is based on the improved Apriori implementation of Al-Maolegi & Arkok (Reference Al-Maolegi and Arkok2014) which is built on a rearranged storage of item-based transactions and implemented using an optimized intersection of integer arrays in Java. The transactions are only scanned and stored once.

The third B-Apriori approach is the contribution of this paper based on the improved M-Apriori and transferred to a feature-based description of configurations. B-Apriori is implemented using Boolean arrays in Java taking advantage of the One-Hot encoded product description and only scanned once.

Figure 2 shows the comparison between the three different Apriori implementations along the process from scanning and storing transactions, to building frequent sets to calculating conclusions.

Figure 2. Comparison of the Apriori, M-Apriori and B-Apriori approaches

While the original approach relies on building subsets in sets of elements, M-Apriori and B-Apriori are based on an optimized intersection function for integer and Boolean arrays in Java. Support and confidence are calculated accordingly using conjunction (M-Apriori and B-Apriori) instead of disjunction (Apriori) respectively.

4. Results and discussion

The following section presents the results and discusses implications of different parameter settings.

4.1. Impact of SAT-Filter regarding additional constraints in ARM

The first result addresses the influence of the integration of a SAT-Filter considering additional product knowledge. Available knowledge is used in constraints formulated in CNF. Taking restrictive constraints into account possibly reduces the amount of combinations that need to be checked while adding the extra step for checking satisfiability. Therefore, the evaluation whether applying the SAT-Filter or not is dependent on the presented case. Figure 3 shows the total number of checked candidate-sets on a logarithmic scale for depth k and four different levels of minsupp = 0.2, 0.4, 0.6, and 0.8 each with (true) and without (false) SAT-Filter.

Figure 3. Chart of cumulated checked candidate-sets (minsupp, SAT)

Pairwise comparison for same level of minsupp reveals a declining ratio of checked candidate-sets without SAT-Filter compared to SAT-Filter. The declining ratio can be explained by the exponential growth of candidate-sets for rising k. Therefore, the usage of SAT-Filter reduces the amount of candidate-sets to be checked tremendously, especially for lower levels of minsupp or extensive data with high variance. As shown in Figure 3, not only the total amount of checked candidate-sets is significant lower for using SAT as well as smaller for higher levels of minsupp, the maximum depth before ARM is terminating is also remarkable lower for SAT and higher levels of minsupp. Reducing the amount of checked candidate-sets directly correlates to calculation time and makes analysis possible e.g. overnight.

4.2. Comparison for different levels of support and confidence

Another important decision is the level of minsupp and minconf since it has the second biggest impact on the result. Support and confidence need to be calculated for each candidate-set individually, therefore setting a proper threshold is crucial for determining how many combinations are needed and when the algorithm will eventually terminate. Therefore, calculation time is also dependent on the level of support and confidence. Finding a proper threshold is multifactorial and depending on the context but primary a matter of the particular use case and asked question. Setting a reasonable level for minsupp and minconf is crucial and should be considered carefully (“as high as necessary, but as low as possible”). Since each candidate-set needs to be checked, there is a linear correlation between checked candidate-sets and calculation time. Figure 4 shows the elapsed time cumulated for depth k for SAT (true) and No-SAT (false) in logarithmic scale as well as for minsupp levels 0.2, 0.4, 0.6, and 0.8. Comparisons between the different levels of minsupp in the group of No-SAT reveal a significant growth in calculation time for lower levels of minsupp. A comparable pattern is found for the group of SAT. Consequently, the decision for minsupp and minconf has a significant impact on the number of checked candidate-sets and therefore on the time needed to calculate frequent feature-sets.

Figure 4. Chart of cumulated calculation time (minsupp, algorithm, SAT)

4.3. Benchmark of different Apriori, M-Apriori and B-Apriori implementations

To successfully introduce ARM into portfolio optimization of variant-rich feature-based configurations, an efficient implementation of Apriori algorithm is needed both runtime and memory efficient. For that reason, three different approaches have been implemented and tested with and without SAT and on four different level of minsupp = 0.2, 0.4, 0.6, and 0.8. The three implemented approaches include the original Apriori (Reference Agrawal, Imieliński and SwamiAgrawal et al., 1993), an improved M-Apriori (Reference Al-Maolegi and ArkokAl-Maolegi & Arkok, 2014) based on integer arrays and a novel B-Apriori based on Boolean arrays. All three different approaches are implemented in Java and publicly available on GitHub¹ along with the used dataset described in section 3.1. Efficiency metrics include runtime speed while small allocation of memory. Since ARM is both runtime and memory intensive, efficient implementation is crucial for application. Figure 5 illustrates all three different implementations using SAT (true) and No-SAT (false) with minsupp = 0.8 (top left), 0.6 (top right), 0.4 (bottom left), and 0.2 (bottom right).

Figure 5. Benchmark of different Apriori implementations (algorithm, SAT)

As a result, M-Apriori is faster than Apriori in every setting. B-Apriori is faster than M-Apriori for smaller number of candidate-sets to be checked but loses its advantage of memory efficient implementation for larger candidate-sets and even falls behind Apriori for minsupp = 0.4 and 0.2. For minsupp = 0.2, M-Apriori runs out of memory for k = 10 and fails to terminate properly.

To summarize the results based on Apriori as a baseline, M-Apriori is faster but more memory intensive which is a problem for larger candidate-sets and eventually runs into memory error. B-Apriori is even faster for small candidate-sets yet slower for larger sets. B-Apriori is the most memory efficient enabling it to handle even the largest datasets. This can be explained by the fact that integer arrays take up more memory in RAM but get shorter over time while Boolean arrays stay the same length while taking up very little memory. All three implementations finished for minsupp = 0.6 and 0.8 in combination with SAT in less than 1 second, which makes them not representative for illustration.

5. Conclusions

This paper proposes a data-driven product development approach by using ARM for portfolio optimization for variant-rich product portfolios. ARM is a well-studied Machine Learning technique that has been improved and applied many times for various different domains during the last 30 years. This paper demonstrates the transfer of ARM from the original item-based approach used in transactions to a feature-based approach for variant-rich configurations in development of mass customizable products. Since there is a linear correlation between calculation time and conducted checks, an apparent approach to improve the analysis is by reducing the number of checks to be made and make the calculation more efficient. Consideration and satisfiability of additional constraints that describe dependencies in between specified features is possible and reasonable. Furthermore, using this type of apriori product knowledge for pre-screening candidate-sets dramatically reduces the number of candidate-sets to be checked and therefore the time and memory needed to execute the analysis. On top of that, three different approaches have been implemented and validated for efficiency leading to B-Apriori for small (runtime advantage) and very large datasets (memory advantage) and M-Apriori for mid-sized datasets before it runs out of memory. Since there is no clear evidence for the supremacy of a single algorithm, further research is needed to combine the advantages of memory efficient Boolean arrays and decreasing in length integer arrays. All tests have been conducted on publicly available synthetic data from a generic product and are open source available. The results are limited to the used dataset and chosen parameter setup. Choosing a reasonable minsupp is crucial and heavily dependent on the research question and underlying variance scheme. The analysed data is suited for benchmarking, yet still relatively small compared to the real-world data.

Finally, identifying frequent feature-sets efficiently is the first step of an effective portfolio evaluation.

Mining conclusions from frequent feature-sets in the form of association rules support product developers in a data-driven decision making and can be included as a hybrid assistance system in the configuration process or in the early phase of architecture decisions in the development process.

Footnotes

¹ Apriori-SAT-Filter on GitHub: https://github.com/SteffenHub/Apriori-SAT-Filter. Used data: data/input_data: cnfBuilder100VarsVariance9995700077, Decimal: 100Decimal, Sales: randomCarBuilder_result_500000

² 2 “SAT4J”: The Boolean satisfaction and optimization library in Java: https://www.sat4j.org/

References

Agrawal, R., Imieliński, T. & Swami, A. (1993). Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD Record, International Conference on Management of Data, Volume 22, Issue 2, (pp. 207-216). New York, NY, USA. https://doi.org/10.1145/170036.170072 CrossRef Google Scholar

Al-Maolegi, M. & Arkok, B. (2014). An improved Apriori algorithm for association rules. International Journal on Natural Language Computing (IJNLC), 3(1), (pp. 21–29). https://doi.org/10.5121/ijnlc.2014.3103 CrossRef Google Scholar

Benavides, D., Trinidad, P. & Ruiz-Cortés, A. (2005). Automated Reasoning on Feature Models. In: Pastor, O., Falcão e Cunha, J. (Eds), Advanced Information Systems Engineering (CaiSE). Lecture Notes in Computer Science, Volume 3520, (pp. 491-503). Springer, Berlin, Heidelberg. https://doi.org/10.1007/11431855_34 CrossRef Google Scholar

Biere, A., Heule, M.J.H., van Maaren, H. & Walsh, T. (2021). Handbook of Satisfiability - Second Edition. Volume 336 of Frontiers in Artificial Intelligence and Applications. IOS Press. ISBN: 978-1-64368-160-3Google Scholar

Boudane, A., Jabbour, S., Sais, L. & Salhi, Y. (2016). A SAT-Based Approach for Mining Association Rules. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), (pp. 2472-2478). AAAI Press. https://dl.acm.org/doi/10.5555/3060832.3060967 Google Scholar

Coquery, E., Jabbour, S., Sais, L. & Salhi, Y. (2012). A SAT-Based Approach for Discovering Frequent, Closed and Maximal Patterns in a Sequence. European Conference on Artificial Intelligence (ECAI-12). https://doi.org/10.3233/978-1-61499-098-7-258 CrossRef Google Scholar

Davis, M., Logemann, G. & Loveland, D. (1962). A machine program for theorem proving. Communications of the ACM, Volume 5, Issue 7. (pp. 394–397). https://doi.org/10.1145/368273.368557 CrossRef Google Scholar

Dlala, I.O., Jabbour, S., Sais, L. & Yaghlane, B.B. (2016). A Comparative Study of SAT-Based Itemsets Mining. In: Bramer, M., Petridis, M. (Eds.), Research and Development in Intelligent Systems XXXIII (SGAI 2016). (pp. 37-52). Springer, Cham. https://doi.org/10.1007/978-3-319-47175-4_3 CrossRef Google Scholar

Eén, N. & Sörensson, N. (2004). An Extensible SAT-solver. In: Giunchiglia, E., Tacchella, A. (Eds.), Theory and Applications of Satisfiability Testing (SAT 2003). Lecture Notes in Computer Science, Volume 2919. (pp. 502-518). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24605-3_37 CrossRef Google Scholar

Gebser, M., Guyet, T., Quiniou, R., Romero, J. & Schaub, T. (2016). Knowledge-based sequence mining with ASP. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) (pp. 1497–1504). AAAI Press. https://dl.acm.org/doi/10.5555/3060621.3060829 Google Scholar

Guns, T., Nijssen, S. & Raedt, L.D. (2011). Itemset mining: A constraint programming perspective. Artificial Intelligence, Volume 175, Issues 12-13 (pp. 1951-1983). https://doi.org/10.1016/j.artint.2011.05.002 CrossRef Google Scholar

Han, J., Pei, J. & Yin, Y. (2000). Mining Frequent Patterns Without Candidate Generation. Proceedings of the 2000 ACM SIGMOD Record, International Conference on Management of Data, Volume 29, Issue 2 (pp. 1-12). Dallas, Texas, USA. http://doi.org/10.1145/342009.335372 CrossRef Google Scholar

Henriques, R., Lynce, I. & Manquinho, V. (2012). On When and How to use SAT to Mine Frequent Itemsets. https://arxiv.org/abs/1207.6253 Google Scholar

Hidouri, A., Jabbour, S., Raddaoui, B. & Yaghlane, B.B. (2020). A SAT-Based Approach for Mining High Utility Itemsets from Transaction Databases. In: Song, M., Song, I.Y., Kotsis, G., Tjoa, A.M., Khalil, I. (Eds.), Big Data Analytics and Knowledge Discovery (DaWaK 2020). Volume 12393 (pp. 91-106). Springer, Cham. https://doi.org/10.1007/978-3-030-59065-9_8 CrossRef Google Scholar

Jabbour, S., Sais, L. & Salhi, Y. (2013). Boolean satisfiability for sequence mining. In: Proceedings of the 22nd ACM international conference on Information & Knowledge Management (CIKM ’13). Association for Computing Machinery (pp. 649–658). New York, NY, USA. https://doi.org/10.1145/2505515.2505577 CrossRef Google Scholar

Jabbour, S., Mana, F.E., Dlala, I.O., Raddaoui, B. & Sais, L. (2018). On Maximal Frequent Itemsets Mining with Constraints. In: Hooker, J. (Ed.), Principles and Practice of Constraint Programming. Volume 11008 (pp. 554-569). Springer, Cham. https://doi.org/10.1007/978-3-319-98334-9_36 CrossRef Google Scholar

Janota, M., Botterweck, G. & Marques-Silva, J. (2014). On lazy and eager interactive reconfiguration. In: Proceedings of the 8th International Workshop on Variability Modelling of Software-Intensive Systems (VaMoS ’14), Article 8 (pp. 1–8). New York, NY, USA. https://doi.org/10.1145/2556624.2556644 CrossRef Google Scholar

Jiao, J. & Zhang, Y. (2005). Product portfolio identification based on association rule mining. Computer-Aided Design. Volume 37, Issue 2 (pp. 149-172). https://doi.org/10.1016/j.cad.2004.05.006 CrossRef Google Scholar

Jiao, J., Zhang, L., Zhang, Y. & Pokharel, S. (2008). Association rule mining for product and process variety mapping. International Journal of Computer Integrated Manufacturing. 21(1), (pp. 111–124). https://doi.org/10.1080/09511920601182209 CrossRef Google Scholar

Rossi, F., van Beek, P. & Walsh, T. (2006). Handbook of Constraint Programming. Elsevier Sciene Inc. New York, NY, USA. ISBN: 978-0-08046-380-3Google Scholar

Savasere, A., Omiecinski, E. & Navathe, S.B. (1995). An Efficient Algorithm for Mining Association Rules in Large Databases. In: Proceedings of the 21th International Conference on Very Large Data Bases (VLDB ’95) (pp. 432 - 444). Morgan Kaufmann Publishers Inc. https://dl.acm.org/doi/10.5555/645921.673300 Google Scholar

Schmidt, T., Marbach, S. & Mantwill, F. (2024). Generation of Rule-Based Variance Schemes Towards a Data-Driven Development of High-Variant Product Portfolios. In: Proceedings of the 26th International DSM Conference (DSM 2024), (pp. 79-88). Stuttgart, Germany. https://doi.org/10.35199/dsm2024.09 CrossRef Google Scholar

Sinz, C. (2005). Towards an Optimal CNF Encoding of Boolean Cardinality Constraints. In: van Beek, P. (Ed.), Principles and Practice of Constraint Programming - CP 2005. Lecture Notes in Computer Science, Volume 3709. (pp. 827–831). Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564751_73 CrossRef Google Scholar

Srikant, R. (1996). Fast algorithms for mining association rules and sequential patterns, University of Wisconsin, 1996, Doctoral Thesis. https://dl.acm.org/doi/10.5555/924822 Google Scholar

Tseitin, G.S. (1983). On the Complexity of Derivation in Propositional Calculus. In: Siekmann, J.H., Wrightson, G. (Eds.), Automation of Reasoning. Symbolic Computation (pp. 466-483). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-81955-1_28 CrossRef Google Scholar

Uno, T., Asai, T., Uchida, Y. & Arimura, H. (2004). An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases. In: Suzuki, E., Arikawa, S. (Eds.), Discovery Science. Volume 3245. (pp. 16-31). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30214-8_2 CrossRef Google Scholar

Zaki, M.J. (2000). Scalable Algorithms for Association Mining. Knowledge and Data Engineering. IEEE Transactions on Knowledge and Data Engineering, 12(3). (pp. 372-390). https://doi.org/10.1109/69.846291 CrossRef Google Scholar

Zhang, L., Madigan, C.F., Moskewicz, M.H. & Malik, S. (2001). Efficient conflict driven learning in a Boolean satisfiability solver. In: Proceedings of the 2001 IEEE/ACM International Conference on Computer-aided Design (ICCAD ’01), (pp. 279–285). IEEE Press. https://dl.acm.org/doi/10.5555/603095.603153 Google Scholar