Machine learning in requirements elicitation: a literature review

Abstract A growing trend in requirements elicitation is the use of machine learning (ML) techniques to automate the cumbersome requirement handling process. This literature review summarizes and analyzes studies that incorporate ML and natural language processing (NLP) into demand elicitation. We answer the following research questions: (1) What requirement elicitation activities are supported by ML? (2) What data sources are used to build ML-based requirement solutions? (3) What technologies, algorithms, and tools are used to build ML-based requirement elicitation? (4) How to construct an ML-based requirements elicitation method? (5) What are the available tools to support ML-based requirements elicitation methodology? Keywords derived from these research questions led to 975 records initially retrieved from 7 scientific search engines. Finally, 86 articles were selected for inclusion in the review. As the primary research finding, we identified 15 ML-based requirement elicitation tasks and classified them into four categories. Twelve different data sources for building a data-driven model are identified and classified in this literature review. In addition, we categorized the techniques for constructing ML-based requirement elicitation methods into five parts, which are Data Cleansing and Preprocessing, Textual Feature Extraction, Learning, Evaluation, and Tools. More specifically, 3 categories of preprocessing methods, 3 different feature extraction strategies, 12 different families of learning methods, 2 different evaluation strategies, and various off-the-shelf publicly available tools were identified. Furthermore, we discussed the limitations of the current studies and proposed eight potential directions for future research.


Introduction
Requirement elicitation is one of the important processes of product development.Several conventional requirement elicitation techniques, such as interviews, meetings, and brainstorming, are used to collect precise and individualized requirements.However, due to the evergrowing demands of end-users and the rapid pace of product iterations, the use of only traditional methods to elicit requirements would be quite insufficient.
The fourth industrial revolution is triggering a pervasive digital transformation in many fields of human activities.Particularly, engineering is being transformed into "Digital Engineering" (Zimmerman, 2017;US DoD, 2018).In digital engineering, digital data and models will be shared in the engineering life cycle (US DoD, 2018); engineering artifacts and processes will be digitalized with standardized digital representation, unique identifier, and the augmented metadata about their attributes, including provenance, thus making those digital artifacts machine-processible, uniquely identifiable, traceable, and accountable (Huang et al., 2020).The digital engineering transformation brings both opportunities and challenges for requirements elicitation.
The evolution of digital transformation has led to improved productivity, quality, and customer satisfaction through agile and robust big data collection, analysis, learning, and decisionmaking processes.Success stories, advancing technologies, and growing customer demands are why digitalization has become necessary for various fields.For example, in recent years, there has been a growing number of studies involving a digital transformation in requirement engineering, such as identifying requirements from documents (Wang et al., 2019), automatically classifying the requirements (Casamayor et al., 2012), and prioritizing the requirements (Maiti and Mitropoulos, 2017).By applying advanced technologies and shifting the existing process to a new digitized paradigm, it may be possible to solve the problem.
Traditionally, expert experience or intuition has been used to direct requirement elicitation activities.Each decision is based on a combination of implicit and explicit domain expertise (Maalej and Thurimella, 2013).Developing a computer model that mimics expert reasoning with knowledge is expensive to construct and maintain.A datadriven strategy, unlike knowledge-based systems, does not require codifying the rules and knowledge for decision-making.The term data-driven refers to a decision-making strategy based on data analytics, interpretation, and prediction rather than pure intuition (Provost and Fawcett, 2013).Over the past 15 years, several studies have been published on the application of machine learning (ML) to requirements engineering, followed by reviews that summarize these studies (Meth et al., 2013;Wong et al., 2017;Lim et al., 2021).Different from those existing studies, this literature review includes 86 studies from 2007 to the present, and categorizes the included works from 7 perspectives, including tasks, data collection, data cleansing and preprocessing, textual feature extraction, learning, evaluation, and the open-source tools.
The rest of this paper is structured as follows.In Section "Related works", literature reviews related to the proposed review are summarized; and Section "Review methodology", the scope and methodology of the literature review, as well as search strategies, criteria for inclusion and exclusion, and the data extraction template, are presented.Section "Results" shows the primary results of the literature review.Section "Findings" summarizes the major findings from the review by analyzing the included works and categorizing them into various categories from seven different research concerns.In Section "Open issues and future works", the current role of ML in requirement elicitation and its limitations are discussed.In addition, the open issues and potential future works in this field are discussed.In Section "Limitation of this review", we discuss the potential threat to validity of the review and the measures we took to address these limitations.Finally, Section "Conclusion" concludes the paper.

Related works
To our knowledge, eight existing review articles, as shown in Table 1, are relevant to our study.Meth et al. (2013) conducted a review mainly focused on the automated approach applied for requirements elicitation, mainly focusing on the degree of the automation of proposed approaches.Binkhonain and Zhao (2019) introduced ML algorithms in the requirements elicitation domain by dividing the 24 related articles into 3 sections: NLP techniques, ML algorithms, and evaluation.Perez-Verdejo et al. (2020) applied topic models and visualization techniques to analyze ML-based requirement classification articles.Wong et al. (2017) identified various software requirements elicitation methods, including manual, rule-based, and ML-based approaches.Shabestari et al. (2019) proposed a systematic literature review that covers early product development phases, including various activities such as requirements elicitation, requirement identification, and requirement categorization.Similarly, Sampada et al. (2020) focus on the early requirement phases but are more concerned with requirements elicitation and documentation.Ahmad et al. (2020) reviewed a collection of articles for identifying requirements for Q&A platforms.
Among the existing studies, one existing work proposed by Lim et al. (2021) is the closest to our research, which was conducted almost concurrently with ours.Both works aim to introduce the current state of the works in data-driven requirements elicitation; however, the focuses of the two works are different.Lim et al. (2021) focus more on data sources, data types, learning techniques, and degree of automation.In comparison, the present review focuses more on technical details, such as text features.Our work aims to provide a comprehensive overview of current work and include a more detailed investigation into the types of requirement elicitation tasks, existing methods, algorithms, and tools.This review could provide a more practical guide to requirements elicitation researchers, and engineers to leverage the existing techniques in their projects.

Review scope
The review adheres to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) reporting structure, aiming to answer the five research questions below: The review scope is defined in Table 2 according to Cooper's taxonomy for literature review, including focus, goal perspective, coverage, organization, and audience (Cooper, 1988;Cooper et al., 2019).First, in this work, the emphasis is on practical solutions that can be applied; therefore, the theoretical works are not our focus.Second, this study aims to synthesize and integrate existing studies to identify the specific requirement elicitation tasks supported by ML; thus, criticism of the field or related works is not a goal of this article.Third, this paper does not take an espousal perspective to advocate for or against ML-based requirements elicitation.Instead, it demonstrates how the existing work would convert requirements elicitation challenges into ML problems.Fourth, the coverage of the literature is a nonexhaustive set of research articles that are retrieved by search queries and filtered by inclusion and exclusion criteria.Fifth, the work applies a methodological organization that group and organize similar methodologies or tools together, presenting a modular organization to the target audience.Finally, the targeted audiences are mainly requirements analysts, engineers, and scholars.

Databases and search strategies
Seven bibliographic databases, including Scopus, Web of Science, Google Scholar, IEEE Xplore, Springer Link, ACM digital library, and ASME digital library are adopted to guarantee the coverage of the review.Three search strategies are applied: (1) the query expanding strategy is used to add synonyms, inflectional, and derivational morphological forms to the original term; (2) a wildcard character is used to capture multiple forms of a keyword by replacing one or more characters with a star symbol (*) or question marks (?); and (3) a query scoping strategy is applied when the search term is too general to retrieve a related result, such as adding terms "system engineering" or "requirement engineering" in addition to the search string.
Based on the above consideration, the search strings are defined as follows: ("Requirement" OR "Demand" OR "Need") AND ("Elicit" OR "Collect" OR "Gather" OR "Detect" OR "Identify" OR "Classify") AND ("ML" OR "Machine Learning" OR "Deep Learning" OR "Text Mining" OR "Data Mining" OR "NLP" OR "Natural Language Processing" OR "Neural Network" OR "Automated" OR "Data-driven" OR "Decision Support").
Due to search functions being different across the seven academic search engines, in the actual search pattern certain differences may exist.For example, the Web of Science engine supports additional Boolean operators, such as the "NEAR" operator that provides additional restrictions on the "AND" operator by considering a fixed-size context window.

Inclusion exclusion criteria and paper screening
The next step of the literature review is selecting studies by screening the title, abstract, and full text of the works found in the previous steps.We applied the inclusion/exclusion criteria in Table 3.

Data extraction table
Research information was collected from each included article with a data extraction form.Basic information about the study (author, title, year of publication, etc.) and content related to research interests (data source, preprocess, feature extraction, etc.) was collected.This includes 14 data elements described in Table 4.The required data fields are designed as open questions, which require reviewers to collect, summarize, and categorize data from the collect works.
Based on the search strategies applied on the 7 included scientific search engines, 975 papers were retrieved.Upon initial screening and the title screening, 915 works were forwarded to the title-abstract screening.A subset of 774 was irrelevant and thus discarded.As a result, 129 papers are retained for the full-text screening.In accordance with the inclusion-exclusion criteria, 43 articles were excluded, and finally, 86 articles were selected.The complete process of study selection is shown in Figure 1.

Tendencies of the publications
The overall trend for the 86 articles is shown in Figure 2, and an increasing trend can be observed.The reviewed studies came from 30 different countries, and 17 of them were conducted in more than one country.The average number of publications in each country is 3.8, with seven countries having more publications than the average, which are the United States (n = 28, 25.0%), China (n = 14, 12.5%), Germany (n = 13, 11.6%), Canada (n = 8, 7.1%), Singapore (n = 5, 4.5%), South Korea (n = 5, 4.5%), and the United Kingdom (n = 5, 4.5%).Thirty-nine of the studies are conference papers (n = 39, 45.3%) and 31 are journal papers (n = 31, 36.0%).In addition, eight workshop papers (n = 8, 9.3%) and eight book sections (n = 8, 9.3%) are included, respectively.The included conference papers are collected in 23 unique conference proceedings, with 16 works appearing in the Proceedings of the IEEE

The data source for ML-based requirement elicitation
The requirement specification (RS) is a textual document that systematically stores system definition and system requirements (Bourque and Fairley, 2014).In the reviewed articles, 17 works applied available requirement specifications to build ML-based solutions to support requirement elicitation.The majority (n = 12) of the requirements specifications in the selected studies are written in English, and two of the requirement specification are bilingual (Ko et al., 2007;Lyutov et al., 2019), three are written in the non-English language (Falessi et al., 2010;Ott, 2013;Gulle et al., 2020).
A total of 27 works are based on already existing corpora, mainly DePaul's NFR corpus (Cleland-Huang et al., 2007), SecReq dataset (Knauss et al., 2011), and PURE dataset (Ferrari et al., 2017).The DePaul's NFR corpus and SecReq dataset are labeled datasets for specified tasks and the PURE corpus is unlabeled that contains multiple raw requirement documents.

Requirement representations and features
The most frequently applied textual feature in the included works is the Bag-of-Word language model (n = 49), which is an effective and efficient method to convert text into the numerical format.Apart from the BOW model, various grammatical features are introduced by the selected papers, including POS n-gram (Kurtanovic and Maalej, 2017b), the frequencies of POS (Noun/ Verb/Adj/Adv/Modal) (Hussain et al., 2008;Liu et al., 2013;Kurtanovic and Maalej, 2017a), the frequency of keywords (Halim and Siahaan, 2019), and the number of syntax sub-tree (Kurtanovic andMaalej, 2017a, 2017b;Dalpiaz et al., 2019).
Some statistics of sentences are applied to represent text, such as the number of characters (Abualhaija et al., 2019), the number of words (Kurtanovic and Maalej, 2017b), the number of sentences (Qi et al., 2016), the number of paragraphs (Parra et al., 2015), and the number of words per sentence (Ormandjieva et al., 2007).
Metadata of UGC data is applied by several works to make requirement representation more informative.A few articles have utilized metadata to provide supplementary information on requirement representations, such as the average star ratings (Maalej et al., 2016) and the total number of reviews (Martens and Maalej, 2019).In addition, platforms contain metadata about users, such as the total number of reviews/ratings of the user performed (Martens and Maalej, 2019) and the platform level of the user (Qi et al., 2016), are also included in the feature construction.
Domain knowledge is one of the supportive information to represent requirements.The domain knowledge is reflected by domain-specific terms, for example, the number of design terms (Parra et al., 2015) and the number of keywords from the domain (Hussain et al., 2008;Stanik et al., 2019).
Among included works, a large proportion of recent works have used word embedding techniques to represent requirements.Word2vec (Mikolov et al., 2013), FastText (Joulin et al., 2017), and BERT (Devlin et al., 2019) are the most widely applied embedding models across the included works (n = 14, 16%).

Machine learning techniques
Most of the selected works (n = 67) applied classification algorithms for classifying textual documents.Naïve Bayes (n = 33), Support Besides, the reviewed articles present a variety of other supervised machine learning algorithms including logistic regression (n = 14), K-nearest neighbors (n = 8), and random forest (n = 5).

Model evaluation methods
The evaluation metrics are employed differently in supervised and unsupervised approaches due to the mechanical differences between their learning methods.Manually annotated data corpus for supervised ML algorithms is applied for training and validation purposes.Hence, comparing machine predictions with actual values on a labeled dataset is a simple, straightforward way to evaluate a learning algorithm.For regression models, metrics Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are common error functions to reflect the accuracy of regression methods (Chai and Draxler, 2014).Common metrics for classification tasks are precision, recall, accuracy, f-score, and area-under-curve (AUC) (Chai and Draxler, 2014).The precision (n = 58), recall (n = 56), and F1 score (n = 49) are the most applied metrics to evaluate a supervised classifier by the included works.Due to the differences in data and research questions, it is difficult to compare the included works.

Tools
Most of the included works are built upon existing open-access tools and libraries.Scikit-learn1 and Waikato Environment for Knowledge Analysis2 (Weka) are the two most popular ML tools mentioned in the included articles (Hall et al., 2009;Pedregosa et al., 2011).Seventeen works applied Scikit-learn to build different kinds of algorithms such as Naïve Bayes, Support Vector Machine, and Random Forest.Another popular tool in the reviewed articles is the Weka, with 19 articles reporting that they applied Weka for building their solutions.Both Scikit-learn and Weka provide ready-to-use learning algorithms and have numerous tricks for preprocessing and feature extraction.For example, Wang et al. (2018a) used the StringToWordVector package from Weka to produce TF-IDF word vectors.In contrast, Dekhtyar and Fong (2017) applied TfidfVectorizer from the Scikit-learn library for the same purpose.
For natural language processing (NLP), the most popular tool is the Natural Language Toolkit (NLTK),3 a Python library designed specifically for human language processing (Loper and Bird, 2002).The NLTK library is applied in selected papers for numerous preprocessing and feature extraction tasks, such as tokenization (Rahman et al., 2019), sentiment analysis (Noei et al., 2021), part-of-speech tagging (Halim and Siahaan, 2019), lemmatization (Guzman and Maalej, 2014), and stemming (Jha and Mahmoud, 2019).For POS and dependency parsing tasks, tools from the Stanford NLP group are mentioned, such as Stanford parser, CoreNLP, and POS tagger.TensorFlow and its high-level wrapper Keras are the most often used neural network libraries in the listed studies.

Findings
The purposed work is a literature review paper, which collect, review, analysis, synthesis, and report existing works based on PRISMA methodology, which aim to provide audience a summarized knowledge in ML-based requirement eliciation.The findings from the literature review will be discussed in this section.The section is organized according to the order of our research questions.In addition, the articles included in this review are categorized according to the different perspectives on the research questions.The summarization of our categorization is illustrated in Figure 3.

What requirements elicitation activities are supported by ML?
After analyzing the selected 86 papers in-depth, 15 different ML-based requirement elicitation tasks are identified (as shown in Figure 4).The identified tasks can be categorized into four main categories, which are Preparation, Collection, Inspection, and Negotiation.
Preparation refers to a set of activities that engineers must undertake before the elicitation of requirements to ensure that the process is supported by sufficient knowledge.A total of five articles are proposed to extract knowledge about the design from textual documents.For example, Liu et al. (2007) proposed an SVM-based design knowledge acquisition framework that can collect research articles according to organizational design knowledge taxonomy.
In addition, extracting user preferences, requests, and complaints from massive UGC is also considered a Preparation task.The ML-based text mining algorithms would be used to extract useful information from UGC, providing engineers with insights and knowledge about the target product.For example, Maalej et al. (2016) proposed a supervised method to automatically classify user app reviews into four predefined categories: user experience, bug report, feature quest, and ratings.Liu et al. (2013) present a regression model which enables engineers to estimate the usefulness of customer reviews.UGC  helpfulness analysis helps determine whether users' feedback is constructive.However, evaluating usefulness is a subjective activity that often entails a viewpoint.In a data-driven approach, the annotators represent the viewpoint.This review identifies two perspectives including designer-perspective (Liu et al., 2013;Qi et al., 2016) and consumer-perspective (Chen et al., 2016).
Stakeholder preference (or tendency, rationale) is another activity categorized as Preparation.Since UGC is the cumulative contribution of users over some time, it incorporates their preferences and emotions about the product, product functions, and product features.For example, combining the LDA and sentiment analysis techniques can help engineers to explain which features of the product are loved by users (Guzman and Maalej, 2014;Zhou et al., 2020), and which are the most dissatisfied product characteristics (Fu et al., 2013).
The second group of tasks is Collection, which includes tasks related to directly extracting requirements or identifying specific types of requirements from a given collection of documents.In selected articles, all ML-based solutions in this category are supervised methods.The first type of collection task is requirement identification, which refers to the activity to determine whether a given sentence or paragraph is a user requirement.For example, Kengphanphanit and Muenchaisri (2020) proposed a requirement identification framework named ARESM, which can distinguish whether a given text is a requirement or non-requirement.
Requirement classification is another task in the Collection category.The objective of this task is to categorize the given requirements based on a certain concern.For example, Hussain et al.The Inspection and Negotiation could happen at any stage during a requirement engineering process.Inspection refers to the ML-based methods applied to inspect and assure the quality and validity of the requirements.The Inspection category includes equivalent requirement detection (Falessi et al., 2010), requirement quality support (Ormandjieva et al., 2007;Parra et al., 2015), and requirement dependency analysis (Deshpande et al., 2019), and fake review detection (Martens and Maalej, 2019).The Negotiation category includes activities to support resolving requirement-related conflicts, and there are three types of tasks were identified under this category.An SVM classifier was used by Khelifa et al. (2018) to automatically classify users' change requests into functional change and technical change, thereby assisting project managers to negotiate requirements and make appropriate decisions.In a recent paper, Lyutov et al. (2019) presented a supervised learning-enabled workflow that facilitates the automatic transmission of customer requirements to the corresponding department to facilitate the process of requirement negotiation.Moreover, an ML-based software refactoring recommendation method is proposed to assist decision-makers in deciding which major update should be applied according to customers' requests (Nyamawe et al., 2019).
What data sources are used to build ML-based requirement elicitation solutions?
Based on an in-depth analysis of included studies, we found that current studies heavily rely on three types of data sources: Textual Documents, UGC, and Existing Requirement Datasets (Fig. 5 the dataset contains 358 FRs and 326 NFRs from 15 different RS.Follow-up studies applied the DePaul NFR dataset to build binary classifiers to distinguish between FR and NFR (Hussain et al., 2008;Canedo and Mendes, 2020), or multi-class classifiers to assign requirements to finer categories (Abad et al., 2017;Rahman et al., 2019).
SecReq is another publicly available requirement dataset, which was created to assist in the early stages of security requirement elicitation (Houmb et al., 2010).The dataset contains three projects, which are Electronic Purse, Customer Premises Network, and Global Platform Specification.Three projects contain 511 requirements that are tagged as security-related requirements (sec) and non-security-related requirements (non-sec).Three works trained and tested their data-driven requirement elicitation methods with SecReq corpus (Dekhtyar and Fong, 2017;Li, 2018;Kobilica et al., 2020).
The PURE dataset has 79 requirement specifications including about 35,000 sentences with an average length of 15 words (Ferrari et al., 2017).Unlike the previously described two datasets, the PURE is not labeled; rather, the authors made it open for a variety of applications.Deshpande et al. (2019) studied requirement dependencies with the PURE corpus, and EzzatiKarami and Madhavji (2021) merged both DePaul NFR and PRUE datasets for constructing a bigger training set for their study.
User-generated data (UGC) is another important source for data-driven requirements elicitation.Research shows that the needs of system users are hidden in rich UGC, such as user feedback, social networks, online software markets review, and product discussion forums (Maalej et al., 2015(Maalej et al., , 2016;;Lu and Liang, 2017;Perini, 2018).UGC contains any form of data generated by users, like numerical ratings, textual product reviews, and videos.In total, half of the included studies (n = 43) applied UGC to build their ML-based solutions.The UGC source includes mobile application platform user reviews (Apple App Store and Google Play Store), e-commerce user reviews (Amazon and other online retailers), social media (Twitter and Facebook), and crowdsourcing platforms.
What technologies, algorithms, and tools are used to build ML-based requirement elicitation?
This subsection answers RQ3 and RQ4.Our study identified the technical approaches and algorithms used by the included studies and divided them into three categories: Textual Data Cleansing and Preprocessing, Textual Features Extraction, and Machine Learning (ML) (Fig. 6).The ML models are evaluated by two strategies, which are Manual evaluation and Metrics-based evaluation.In addition, we categorized many open-source tools identified from the reviewed articles into two categories: ML tools and NLP tools.

Textual data cleansing and preprocessing
Twenty different techniques were identified from the included papers specifically for cleaning and preparing data, which we categorized under the Textual Data Cleansing and Preprocessing category.In addition, due to the functional features of these Tokenization is a procedure to break a given sequence of text down into smaller parts, such as breaking a document into sentences (sentence tokenization) or breaking a sentence into individual words (word tokenization).Text filtering is a group of preprocessing methods, which aim to eliminate redundant, erroneous, non-representative, inconsistent, and ineligible data from a text document.In the reviewed articles techniques include stopwords removal, rare word filtering, non-English word removing, URL removing, special character handling, empty value handling, punctuation removal, emoticon handling, non-informative/irrelevant word removing, and inconsistent information removal are considered under this classification.Text normalization aims to transform a text sequence into a standard form to reduce its randomness.Stemming and lemmatization are the most common text normalization methods.In a document, a word has various forms, and some of these forms can be converted to one another by adding or removing the prefix or suffix (Manning et al., 2008).Stemming is a crude heuristic procedure that removes the tails from words to get word stems, which are the fundamental word units, such as for word requirements, the word stem is required (Manning et al., 2008).In comparison, lemmatization yields a basic dictionary form of a word.For example, the lemmatization of requirements will yield requirements.Case folding is another popular text normalization approach that changes all letters in a word into lower cases (Manning et al., 2008).In addition, slang translations, abbreviation translations, typo corrections, and acronym substitutes are considered text normalization procedures since they convert text into a more generic form.

Textual features extraction
Textual Features Extraction includes a set of techniques to convert natural text into numbers.We found three major textual data representation strategies from the reviewed articles: Bag-of-word, Rule-based, and Embedding features.The Bag-of-word considers a sequence of text as a set (or multi-set) of the word regardless of word order and grammar (Manning and Schütze, 1999).Various BOW representation strategies can be found in the included works, such as using simple raw counts for words, a bag of bigram or trigram (Kurtanovic and Maalej, 2017a), and BOW with TF-IDF weighting (Li et al., 2018).
In addition to BOW features, studies included in this review also applied rule-based handcraft features, such as POS n-gram (Kurtanovic and Maalej, 2017b), the number of Noun/Verb/Adj/ Adv/Modal (Hussain et al., 2008;Liu et al., 2013;Kurtanovic and Maalej, 2017a), frequency of POS of the keywords (Halim and Siahaan, 2019), and the count of syntax sub-tree (Kurtanovic andMaalej, 2017a, 2017b;Dalpiaz et al., 2019).In addition, textual descriptive statistics are also applied to represent requirements, including the number of characters (Abualhaija et al., 2019), word count (Kurtanovic and Maalej, 2017b), sentence count (Qi et al., 2016), paragraphs count (Parra et al., 2015), and the number of words per sentence (Ormandjieva et al., 2007).Furthermore, temporal features including verb tense (Stanik et al., 2019), the number of elapsed days (Liu et al., 2013), and temporal tags, such as time, duration, and time set (Abad et al., 2017), were used to represent the temporal information of the requirements.For UGC-based research, some platforms provide metadata that can be extracted to represent user comments.Metadata features include star ratings (Maalej et al., 2016), review count (Martens and Maalej, 2019), and the number of links (Parra et al., 2015).
Moreover, some studies applied document quality features to represent textual requirements, including the number of subjective/objective sentences in a review (Liu et al., 2013), the number of ambiguous expressions in a requirement (Ormandjieva et al., 2007;Parra et al., 2015), and the number of the sentence referring product feature appeared in a user review (Liu et al., 2013;Qi et al., 2016).Additionally, some articles introduce domainspecific features, such as the number of design terms (Parra et al., 2015) and the number of keywords from the input text (Hussain et al., 2008;Stanik et al., 2019).
In recent years, word embedding has gained popularity in a range of NLP applications.The selected articles used a range of embedding techniques, including Word2vec (Mikolov et al., 2013), FastText (Joulin et al., 2017), Glove (Pennington et al., 2014), and BERT (Devlin et al., 2019) to represent words.Three strategies associated with embedding features are identified in the included studies: training the embedding from scratch using a pre-trained embedding and fine-tuning the previously trained language models.

Machine learning
In this review, the learning algorithms applied by the included studies are categorized into two categories: supervised and unsupervised learning.Under supervised learning categories, only three studies applied regression models (Liu et al., 2013;Chen et al., 2016;Qi et al., 2016).The regression methods can help engineers to predict a numerical value to reflect the helpfulness of a given user review.The rest of the methods in supervised learning are all classification algorithms.Topic modeling and clustering techniques are two frequently applied Unsupervised Learning methods and the LDA is the most widely applied unsupervised method in the papers included.

Evaluation methods
The quality of models can be reflected in the evaluation metrics, which are a set of formulas and units of measurement that reflect how well the learning algorithm could perform (Hossin and Sulaiman, 2015).For different types of learning tasks, the evaluation methods are used differently.In the included studies, the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are employed for regression models.Both MAE and RMSE are negatively oriented, which means the better the model, the lower the errors.Precision, Recall, and F1 score are most frequently applied for classification models.On the other hand, the unsupervised method is evaluated by two strategies: internal and external evaluation.The included works applied intra and inter-cluster similarity (or distance), Silhouette score, and perplexity to assess the clustering results for internal evaluation.In the case of external evaluation, domain experts are asked to evaluate the models' results manually.Additionally, a truth set can be built to evaluate the clustering results, similar to a supervised classifier.

Available tools
The included studies widely mentioned two types of tools: ML tools and NLP tools.The NLP tools such as NLTK and CoreNLP are applied to preprocess and extract the features from the textual data.The most widely mentioned ML tools are Weka and Scikit-learn, which integrate multiple ML algorithms and quickly build a data-driven solution.Keras is a popular deep learning library among the included studies, which contains the most popular neural network architectures with compact and straightforward APIs.
Table 5 lists the tools mentioned in the reviewed articles, arranged by their uses.In total, seven types of tools are extracted: ML tools (conventional), deep learning tools, language, and topic model tools, multi-purpose NLP tools, sentiment analysis tools, parsing tools, and single-purpose tools.

How to construct an ML-based requirements elicitation method?
According to the included studies, building an ML-based requirements elicitation method contains four major steps: study design, data preparation, model construction, and model implementation.The first step is to design the ML-based requirements elicitation study by considering two fundamental elements: identifying the requirements elicitation subtasks and available datasets.In this literature review, we identify four major tasks and three types of data sources that support requirements elicitation from the reviewed articles.
Different ML-based requirement elicitation tasks require different datasets and data annotation strategies.Hence, studies should be designed differently accordingly.Both requirement documents and UGC data were applied in the selected research.Through the detailed review of the included papers, we identified that the requirements are usually stored in plain text format, and the corresponding tasks are mostly focusing on requirement text classification.However, the UGC data contains additional metadata that describes the data in many aspects, such as ratings and timestamps.With these additional data, researchers can design studies such as the prediction of usefulness associated with user reviews (Liu et al., 2013), and analysis of user preferences on a timely basis (Fu et al., 2013).
In addition, sentiment is another reason that causes differences in requirement document analysis and UGC analysis.Documents describing the requirements are usually written with neutral language; therefore, analyzing the sentiment of each requirement may not be as significant as analyzing the sentiment of UGC.As a result, sentiment analysis does not appear in the requirement document analysis tasks but is commonly used in UGC-based research.
With the defined task and dataset, the next step is to construct an ML pipeline, which relies heavily on the understanding of machine learning and NLP techniques.Data cleansing, data preprocessing, feature extraction, model training, and model evaluation are part of this phase.Though the model construction pipeline can be independent of domain knowledge for unsupervised learning, domain expertise is still necessary to validate and evaluate the models.Finally, the model implementation is an important final step to build an ML-based requirements elicitation.Multiple aspects must be considered, such as organizational culture, management, security, development, and operation procedure.

Open issues and future works
It is important to note that eliciting requirements is not one single activity, rather it comprises multiple sessions and operations that work together as a whole.However, there is no very detailed definition or uniform approach to this stage in academia and industry.For example, Young (2004) suggested a twenty-eight-step requirement gathering activities checklist including planning, managing, collecting, reviewing, tracing, etc. Wiegers and Beatty (2013) summarizes 21 best practices for requirements elicitation, including defining scope, identifying stakeholders, reusing existing requirements, modeling the application environment, and analyzing requirements feasibility.Using a single ML model cannot accomplish so many different tasks.Therefore, ML techniques are only able to accomplish partial tasks involved in requirement elicitation.Furthermore, most of the included studies are all focusing on resolving a particular task with ML, rather than designing a complete system that supports requirement elicitation.In this regard, most of the ML-based methods developed so far have a supporting or complementary role to traditional methods.For example, in an ML-aided requirement elicitation system, conventional methods, such as interviews, questionnaires, and brainstorming, are responsible for producing and collecting requirement-related data.ML algorithms, however, are responsible for analyzing data or supporting follow-up data-related activities.
In Section "What requirements elicitation activities are supported by ML?", we summarized 15 ML-based requirements elicitation subtasks from included studies and categorize them into four groups.Most works were classified as Preparation (n = 37) and Collection (n = 41) tasks, and only eight articles were identified as Validation (n = 5) and Negotiation (n = 3) tasks.One reason for this is that the validation and negotiation are hard to articulate due to the high complexity of the tasks.For example, tasks from Negotiations require collaboration, discussion, and trade-offs between stakeholders from many aspects.Therefore, most of the challenges related to these tasks are related to background knowledge, communication, budgets, or other limitations imposed by the real world.As a result, it is difficult to model these tasks correctly.
It is still challenging to build an ML-based solution to fully automate requirement elicitation.First, since requirement elicitation is a comprehensive process composed of a variety of tasks and goals, it is difficult to develop an end-to-end ML model to fully automate the requirement elicitation process.Second, requirements could come from a large variety of sources, particularly in the big data era.In terms of data type and format, the datasets included in the study were highly heterogeneous.For example, sentiment analysis may be useful when analyzing UGC data, but it is not valuable when analyzing neutral document data.Hence, using the model specifically designed for UGC, such as ASUM (Jo and Oh, 2011), cannot perform as expected on document data, and vice versa.Third, the ML-based requirement elicitation approach is automatic but easily affected by errors and failure.Unlike rule-based systems that can be debugged and fixed locally in the coded knowledge body, it is difficult to directly tune the ML model when dealing with known errors.In addition, the interpretation of ML models is still an open challenge in academia and industry.For example, deep neural networks learn features automatically, which makes it more challenging to analyze the reasons behind ML-based solutions.Furthermore, only a few research considered the changing nature of the requirements.Due to the dynamic nature of the requirements, in practice, requirement elicitation requires engineers to identify and modify requirements based on the unpredictable nature of user needs (Xie et al., 2017).Besides, in terms of both content and type of task, the current research is monotonous.The vast majority of studies still focus on classification and clustering.
To tackle these challenges, the following future research directions are suggested by the authors.First, although there are growing interests and works in building ML-based requirement elicitation methods, there is still a vacancy for a systematic guide on how to integrate the ML-based components into the requirement elicitation framework.Multiple aspects of the integrated system should be considered, such as how humans and machines interact in requirements elicitation, what is the input-output of the system and each subsystem, and what specific tasks should be performed by machines when expert involvement is required, among others.Hence, a systematic study and guidance of AI system design, engineering, implementation, and management are required.
Second, there is a lack of in-depth analysis of ML-based requirement elicitation failure and errors.For example, research papers and projects typically rely on statistical metrics for ML model validation and evaluation.This type of evaluation can tell us how good or bad a model is, but neglects to address the question of what leads a model to perform unexpectedly.Future studies should address this issue by introducing methods and techniques to explore the factors that affect the performance of ML-based requirement elicitation.
Third, the ML-based methods, especially deep learning models are lacking transparency.Because deep neural networks derive their features not from experience and knowledge, but from data, which is more effective but less intuitive.Since requirement elicitation is knowledge-intensive human-involved activity, the engineers not only expect models to solve the problems but also to explain them.The significance of Explainable AI (XAI) is increased along with the widespread adoption of deep learning methods in recent years (Xu et al., 2019).In the future, research in ML-based engineering of requirements will also need to leverage XAI techniques and methods to investigate the nature of decision-related requirements.
Forth, a broad range of NLP tasks could be incorporated into the requirements elicitation.Apart from text classification, many other NLP techniques can be utilized to support requirements elicitation, such as neural summarization, text generation, neural conversational bots, question asking, question answering, and text to speech.Due to its wide range of tasks, requirements elicitation provides an excellent opportunity to practice cutting-edge NLP methods.Future research works should try more to incorporate these methods into requirement elicitation.As an example, neural text generation technologies such as Seq2Seq (Sutskever et al., 2014), GAN (Goodfellow et al., 2014), and T5 Text-to-Text transformers (Matena et al., 2019) have the potential to produce new mock requirements based on a particular context, which may provide innovative data-driven ideas from a new perspective.
Fifth, aside from natural text, user needs also can be mined from other data formats.E-commerce platforms, for instance, allow individuals to upload videos and pictures to share usage experiences, complaints, and feedback.Although techniques such as neural image description (Vinyals et al., 2015;Karpathy and Li, 2017) and neural video description (Yao et al., 2015) are not as mature as text classification techniques, they are also of great research value and can play a major role in requirement engineering as well.
Sixth, due to the data-intensive nature of ML methods, more requirements related to high-quality text data should also be introduced.However, some interest-related requirements are requested to be kept confidential by the relevant stakeholders.Hence, sharing high-quality requirement data with the requirement engineering community is challenging.Masking sensitive data or substituting entities can be effective means of modifying sensitive requirements, which can facilitate the sharing of information within the requirement engineering community.Another strategy to address insufficient training data is to develop a language model specifically for requirements engineering.Research shows that transfer learning techniques can overcome the limitations of insufficient data (Howard and Ruder, 2018).Future works could also consider building neural language models that are specifically trained with requirement specifications.
Seventh, since user-requirement elicitation is a human-centric activity, analyzing user behavior may provide valuable insight into understanding and eliciting requirements.As the study of representation learning, such as user embedding, is being applied to a variety of different domains, including recommendation and healthcare systems (Miotto et al., 2016;Pan and Ding, 2019).Analyzing user behavior can help to predict user preference and explore potential requirements change.
Last, future work should address the issues caused by the dynamic nature of user requirements.In practice, stakeholder requirements are not always static; however, in the studies reviewed, ML algorithms were used to read the static text to identify requirements.Further research on ML-based methods should be focusing on changing requirements and reducing their impact are urgently needed.

Limitation of this review
We used PRISMA as the research framework to identify the primary research studies in this review.Unlike other popular methods, such as snowballing approach, in this study, we did not exhaustively identify further relevant studies by iterating through the reference lists.This review chose to use minimum evidence to reflect the current state of ML-based requirements elicitation rather than providing an exhaustive result.Thus, some relevant studies may have been omitted from this review.In addition, there is a paradox between literature review and search query generation.Before a literature review is completed, it is not easy to define a set of exact keywords to represent the topic.Simultaneously, the absence of good search queries and keywords could defy the effort to retrieve relevant papers effectively.Hence, it is challenging to develop a perfect set of search queries at the initial stage that covers all of the aspects related to the field.To deal with these issues, we dynamically adjusted the search queries for seven academic databases to reduce bias and loss in the search results.
Numerous publications are excluded due to a lack of technical details; this does not imply that those articles are unimportant to this field.Various ideas and concepts may still be derived from these works.Moreover, only one of the similar works by the same author has been retained in the study; however, it is difficult to define a clear boundary to decide which work to keep.As a precaution to minimize the risks associated with inclusion-exclusion criteria, the authors discussed and evaluated the articles through meetings in cases wherever it was challenging to decide individually.
Additionally, human errors could not be avoided in the data extraction phase due to its nature of subjectivity.As data extraction table in Table 4 illustrated, the reviewer needs to enter two types of data manually.The first type of data is the descriptive data, which can be accessed from the academic research databases and the websites of journals.However, the second type of data requires reviewers to assess and extract information based on personal understanding.Therefore, the data extraction process inevitably contains a certain amount of bias and subjectivity.In addition, since the requirement elicitation is an interdisciplinary problem, many definitions are disputed.For example, the definition of the requirement and requirement elicitation are all defined differently by various researchers.Besides, some information was not explicitly stated in the reviewed articles, which led to difficulties in corresponding information retrieval.To overcome this limitation, the author team iterated and adjusted the data extraction table before reaching a final agreement.

Conclusion
The review provides an overview of the current research on ML-based requirements elicitation.First, we categorized the included studies into four ML-based requirement elicitation tasks: Preparation, Collection, Validation, and Negotiation.Second, we examined the data sources and corpora used by the included studies to develop the machine learning models for requirements elicitation.As a result, we identified three types of data sources for building ML solutions, which are Textual Documents, UGC, and Existing Requirement Datasets.Third, in this review, general ML pipelines are extracted from the included studies: text cleansing and preprocessing, textual feature extraction, machine learning, and evaluation.Furthermore, we identified 19 tasks among the selected works and assigned them to three types of text cleaning and preprocessing groups: filtering, normalizing, and tokenizing.For the text feature extraction part, we classified the included works into three groups according to the technique used to extract the features.BOW language models and handcrafted features are frequently found in reviewed publications, but in recent years, an increasing trend towards using embedding features has been observed.In addition, we discovered the most popular algorithms, such as Naive Bayes, Support Vector Machines, Decision Trees, and Neural Networks in this review.Precision, Recall, and F1 score are the most prevalent evaluation metrics applied to assess model performance.Finally, we listed the most popular NLP tools, which are NLTK and CoreNLP, and the most commonly applied machine learning tools, Weka and Scikit-learn.
Apart from the main findings, one major observation is that most research focuses on requirements categorization tasks.There is a notable majority of papers in the collection that are focused on supervised text classification, followed by topic modeling and clustering techniques.Second, we noticed that the existing articles are more focused on using machine learning to solve specific and fine-grained problems in requirements elicitation, such as classifying NFRs and extracting main topics from massive user reviews.It has, however, been relatively rare for research to examine how to integrate machine learning-based requirements acquisition methods into existing requirements elicitation workflows.Hence, the lack of expertise in designing, engineering, implementing, and configuring ML-based requirement elicitation systems calls for further research.Furthermore, most studies lack concrete evidence that machine learning can assist designers and engineers in reducing time and effort in requirement extraction.Last, although supervised learning is prevalent in this field, we have found only two publicly accessible labeled datasets from the 86 reviewed papers: DePaul's NFRs dataset (Cleland-Huang et al., 2006) and SecReq (Knauss et al., 2011).
Thus far, ML-based solutions have been monolithic in eliciting requirements; however, the publications in this field provide sufficient evidence that ML can support requirements activities both theoretically and practically.A number of labor-intensive, errorprone activities from requirement engineering are waiting to be supported by ML.Despite what has already been accomplished, the best is yet to come.

Fig. 2 .
Fig. 2. The number of included papers by year.

Fig. 3 .
Fig. 3.An illustration of the categorization schema of the collected studies.
(2008) proposed a decision tree algorithm that can classify natural language requirements into functional requirements (FRs) and non-functional requirements (NFRs).The NFRs/FRs classification task takes NFRs or FRs as input and classifies them further into fine-grained subcategories.Cleland-Huang et al. (2007) proposed a TF-IDF-based classification algorithm that is capable of classifying textual requirements into predefined NFR subcategories.For this purpose, Cleland-Huang et al. (2007) established a manually labeled dataset for NFR classification.The last type of task identified in the Collection is security requirement identification.Riaz et al. (2014) trained a K-NN classifier that can automatically detect six predefined security requirement types from natural text documents.Two articles introduce binary classifiers for identifying security requirements from written requirements(Li, 2018;Kobilica et al., 2020).Jindal et al. (2016) trained a decision tree to further categorize security requirements into four specific categories, which are authentication-authorization, access control, cryptographencryption, and data integrity.
Fig. 5.The data source for building ML-based requirement elicitation solutions.

Table 1 .
Related works

Table 2 .
The research scope under Cooper's literature review taxonomy

Table 4 .
Elements of data extraction table Fig. 1.PRISMA flowchart.Artificial Intelligence for Engineering Design, Analysis and Manufacturing International Conference on Requirements Engineering.The majority of the journal publications in this collection are from the Journal of Mechanical Design, Information and Software Technology, and Requirement Engineering.

Table 5 .
Tools mentioned by included works 12 Cheligeer Cheligeer et al.