Muddling-Through and Deep Learning for Managing Large-Scale Uncertain Risks

Abstract Managing large-scale, geographically distributed, and long-term risks arising from diverse underlying causes – ranging from poverty to underinvestment in protecting against natural hazards or failures of sociotechnical, economic, and financial systems – poses formidable challenges for any theory of effective social decision-making. Participants may have different and rapidly evolving local information and goals, perceive different opportunities and urgencies for actions, and be differently aware of how their actions affect each other through side effects and externalities. Six decades ago, political economist Charles Lindblom viewed “rational-comprehensive decision-making” as utterly impracticable for such realistically complex situations. Instead, he advocated incremental learning and improvement, or “muddling through,” as both a positive and a normative theory of bureaucratic decision-making when costs and benefits are highly uncertain. But sparse, delayed, uncertain, and incomplete feedback undermines the effectiveness of collective learning while muddling through, even if all participant incentives are aligned; it is no panacea. We consider how recent insights from machine learning – especially, deep multiagent reinforcement learning – formalize aspects of muddling through and suggest principles for improving human organizational decision-making. Deep learning principles adapted for human use can not only help participants in different levels of government or control hierarchies manage some large-scale distributed risks, but also show how rational-comprehensive decision analysis and incremental learning and improvement can be reconciled and synthesized.


Introduction
Traditional benefit-cost analysis and decision analysis typically involves multiple steps such as the following (Raiffa, 1968;Clemen & Reilly, 2014;Howard & Abbas, 2016): 1. Identify alternative feasible choices, decision rules, or courses of actions. This "choice set," or set of decision alternatives, may be specified explicitly as a discrete set of alternatives, such as whether or not to fund a public project, or implicitly via constraints on the allowed values of decision variables, such as quantities of limited resources available to be allocated. 2. Identify preferences and value trade-offs for possible outcomes. These may be formally represented via a net benefit function or via a (possibly multiattribute) von Neumann-Morgenstern utility function or social utility function to be maximized (Keeney & Raiffa, 1976). 3. If the outcomes for each choice are uncertain, estimate the probabilities of different outcomes for each choice (e.g., its risk profile); and 4. Optimize choices subject to feasibility constraints (e.g., on available time, budget, or limited resources) to identify and recommend a feasible choice that maximizes expected net benefit, expected utility, or expected social utility of outcomes.
These steps are all well-established parts of prescriptive decision analysis for a single decision-maker and benefit-cost analysis for a social decision-maker (Howard & Abbas, 2016;Raiffa, 1968). In 1959, political economist Charles Lindblom of Yale University pointed out that almost none of these steps can be applied in practice to the decisions and uncertainties faced by real government decision-makers, or by decision-makers in other bureaucracies. Preferences and value trade-offs may be unknown and difficult or impossible to articulate, quantify, and justify. Lindblom (1959) wrote, "Typically the administrator choosesand must choosedirectly among policies in which [different] values are combined in different ways. He cannot first clarify his values and then choose among policies," as multiattribute utility theory prescribes. Even identifying possible outcomes for each feasible choice may be impracticable if the number of possible choices is immense or possible outcomes are unknown. In addition, real-world bureaucratic and organizational decisions are almost never made by a single decision-maker. Rather than seeking to extend or refine normative decision analysis to overcome what he perceived as its fatal practical limitations for large-scale, multiperson organizational decisionmaking over time, Lindblom instead described a method of successive limited comparisons that he contrasts with the "rational-comprehensive" normative approach favored in benefit-cost analysis, decision analysis, operations research, and optimal control engineering. The rational-comprehensive approach seeks to solve decision optimization problems such as max a∈A R a ð Þ where • a is a decision variable or policy (e.g., a vector or a time series of decision variables, or a feedback control decision rule mapping observations to actions) • A is the set of feasible alternative decisions (the "choice set") • R(a) is the reward (expected utility or net benefit) from choosing a. In many traditional economic, policy, and operations research analyses, the reward function to be maximized is assumed to be known. In statistical design of experiments and machine learning, it may have to be discovered. If the reward received depends both on the decision-maker's choice a and also on other variables not controlled by the decision-maker, collectively referred to as the state and modeled as a random variable s, then R(a) is the expected reward from choosing a given the probability distribution of s. When there are many players, R is often taken to be a weighted sum of individual utility functions (Gilboa et al., 2004). • max a∈A indicates that an act a in A is to be selected to maximize R(a).
Lindblom wrote that "the attention given to, and successes enjoyed by operations research, statistical decision theory, and systems analysis" have strengthened a "tendency to describe policy formulation even for complex problems as though it followed [this] approach," emphasizing "clarity of objective, explicitness of evaluation, a high degree of comprehensiveness of overview, and, wherever possible, quantification of values for mathematical analysis. But these advanced procedures remain largely the appropriate techniques of relatively small-scale problem-solving where the total number of variables to be considered is small and value problems restricted." In contrast, for large-scale real-world decision problems faced by most bureaucracies, Lindblom considers the rational-comprehensive approach in equation (1) to be impracticable because the net benefit or reward function R is not known or agreed to; choice set A may be too large to enumerate or search effectively, or unknown and costly to develop; and often no single centralized authority is capable of, authorized to, or accountable for identifying and implementing the best choice in A. Instead of clarifying values and objectives in advance, goals and actions to achieve them are selected together as opportunities arise. The test of a "good" policy is not that it is the best means to desired ends, or that it maximizes some measure of expected net benefit, utility, or collective welfare, but that people will agree to it (possibly for different, and perhaps conflicting, private reasons). Important possible outcomes, 228 Tony Cox feasible alternative policies, and affected values and trade-offs are neglected in favor of relatively simple comparisons between the current policy and a proposed incremental modification of it. A succession of such modifications may, if all goes well, produce gradually improving policies; this is the process that Lindblom refers to as successive limited comparisons, or, more colloquially, as muddling through. He states that "Making policy is at best a very rough process. Neither social scientists, nor politicians, nor public administrators yet know enough about the social world to avoid repeated error in predicting the consequences of policy moves. A wise policy maker consequently expects that his policies will achieve only part of what he hopes and at the same time will produce unanticipated consequences that he would have preferred to avoid. If he proceeds through a succession of incremental changes, he avoids serious lasting mistakes in several ways" including learning from experience and being able to correct missteps fairly quickly. Of course, this view is optimistic if a single misstep could lead to disaster, ruin, or the destruction of the decision-making organizations, but Lindblom does not dwell on these grim possibilities. To model and evaluate the muddling through approach more formally, however, we will have to consider possibilities for safe learning, i.e., surviving and avoiding disastrous decisions during learning (Garcia & Fernandez, 2015). Lindblom proposes muddling through not only as a descriptive theory of bureaucratic decision-making, but also as a normative one: "Why then bother to describe the method in all of the above detail? Because it is in fact a common method of policy formulation and is, for complex problems, the principal reliance of administrators as well as of other policy analysts.
And because it will be superior to any other decision-making method available for complex problems in many circumstances, certainly superior to a futile attempt at superhuman comprehensiveness." In short, muddling through by successive incremental adjustments of policy is proposed as both more desirable and more widely practiced than the rational-comprehensive approach. Since Lindblom's essay, revolutions have occurred in computer science, game theory, collective choice theory, automated and adaptive control, artificial intelligence, robust optimization and risk analysis, machine learning, computational statistics and data science, and the intersection of these fields with political economy, law-and-economics, and management science. It is timely to reexamine the extent to which Lindblom's critique of rational-comprehensive techniques for risk management decision support still applies; the extent to which the ferment of ideas and technical developments in artificial intelligence and other fields dealing with multiagent control has overcome his objections; how both the strengths and the limitations of muddling through can be understood better, and the technique applied more successfully, in light of progress since 1959; and whether there are circumstances in which muddling through provides a viable alternative or complement to decision analysis. The following sections undertake such a reexamination.

Developments in rational-comprehensive models of decision-making
An individual, team, organization, or artificial intelligence that repeatedly makes decisions to achieve some overall purposes or goals must repeatedly decide what to do nexte.g., what subgoals or tasks to undertake nextand how to do it, e.g., which agents should do what, and how much planning should be local and autonomous instead of centralized or hierarchical. In teams with no central coordinator, such as robot soccer teams of cooperating autonomous agents, cooperating swarms of drones, or search-and-rescue teams with autonomous agents and limited communication, the agents may have to infer and adapt to each other's plans on the fly as they observe each other's behaviors and messages (Hunt et al., 2014;Zhao et al., 2016). In bureaucracies or other organizations where policies are formulated and adapted via muddling through, success or failure in achieving stated goals may depend on who may propose what, when, and how decisions are made about which proposals to adopt, and how these changes and their consequences are linked to incentives and rewards for those participating in policy-making and administration.
In the face of such complexities, the simple prescriptive model of optimizationbased rational-comprehensive decision-making in (1) has been generalized and extended in the following ways.
• Noncooperative game theory (Luce & Raiffa, 1957) replaces the reward function R(a) in (1) with a set of reward functions (also called "payoff functions"), one for each participant (called a "player" or "agent"). Each player has its own choice set of feasible alternatives to choose among, often called strategies in game theory, or policies in decision analysis, machine learning, and artificial intelligence.
Player i now seeks to choose a i from A i to maximize R i (a i , a i 0 ), where a i denotes the strategy selected from A i by player i; a i 0 denotes all the strategies selected by the other players; and R i (a i , a i 0 ) is the reward to player i from choosing strategy a i when the other players choose a i 0 . There is no single net benefit, social welfare, or public interest to be maximized. Rather, each player seeks to act to maximize its own reward, given the actions of the rest. A Nash equilibrium is a set of choices such that no player can improve its own reward by unilaterally modifying its own choice, given the choices of the other players. Each player's choice is a best response to the choices of the rest. A set of choices by the players is Paretoefficient if no other set of choices would give all players equal or greater rewards, and at least to some of them greater rewards. In practical applications such as deciding how to manage air pollution, antibiotic resistance, or climate change, a common challenge is that each player benefits if everyone else exercises restraint to avoid making the current problem worse, but each player also maximizes its own benefits by being unrestrained itself, whatever the other players are doing. In such cases, the unique Nash equilibrium is that no one exercises self-restraint, even though all would gain if all would do so; hence, it is not Pareto-efficient. A variety of "folk theorems" of game theory prove that both Pareto efficiency and multiperiod versions of Nash equilibrium can be achieved if players are sufficiently patient (i.e., they do not discount delayed rewards too steeply) in repeated games with discounted rewards and uncertain time horizons, where the players have a chance to observe each other's behaviors and make choices repeatedly over time. The trick is to have players make choices that punish those who do not cooperate in sustaining a Pareto-efficient outcome (Fudenberg & Maskin, 1986;Fudenberg et al., 1994;Hörner & Olszewski, 2006). • Cooperative game theory further generalizes the multiplayer choice problem by allowing players to form coalitions and to bargain or negotiate with each other. For example, in the treaty participation game model of international cooperation (or lack of it) to limit emissions in hopes of limiting undesired climate change, a coalition of signatories might choose emissions levels to maximize their collective benefits, while nonsignatories choose emission levels to maximize their individual benefits (Barrett, 2013). The final levels of cooperation and emissions achieved in multistage games of coalition formation and decision-making about emissions depend on factors such as whether coalitions, once formed, are exclusive; whether players (e.g., countries) can make and enforce conditional agreements such as that some will reduce their emissions more if and only if others do; whether binding commitments can be made and enforced; how steeply participants discount future rewards and penalties compared to current ones; and whether the timing of catastrophic consequences from failure to muster sufficient cooperation is known or uncertain (Heitzig et al., 2011;Wood, 2011;Barrett, 2013). • Team theory (Marschak & Radner, 1972) focuses on design of costly communication and agent decision rules (and, in some versions, on allocation of limited resources among the agents) for the special case of cooperating agents in an organization where all of the agents have identical preferences and goals. That is, they all seek to maximize the same reward function of their joint choices, but local observations, actions, and communications are costly. Team theory has been applied to distributed control of systems by agents with sensors and actuators at different locations, as well as to organizational design, design of compensation systems, and dynamic allocation of tasks, roles, and responsibilities within teams of cooperating agents. • Mechanism design: Institutions, social and moral norms, legal constraints and liabilities, regulations and their enforcement, wages and contractual incentives, outcome-sharing rules in principal-agent relationships and investment syndicates, and reputations in repeated transactions and long-term relationships all help to shape the rewards (positive or negative) and feedback that players receive for their choices and behaviors. Game theory studies how agents make choices in response to incentives. Mechanism design theory (Nisan, 2007) studies the inverse problem of how to design incentives, or the rules determining rewards in the games in which agents participate, to elicit choices that satisfy desired properties. These may include Pareto efficiency, self-enforcing stability (e.g., Nash equilibrium and its multiperiod extensions), implementability using information that can actually be obtained and incentives (e.g., payments) that can actually be provided, and voluntary participation. Although important impossibility theorems show that successful mechanism design satisfying most or all of these properties is impossible if preferences are arbitrary, many positive results are available when preferences satisfy restrictions (e.g., risk neutrality and "quasi-linear preferences" with utility linear in money) commonly assumed in traditional benefit-cost analyses. • Organizational design and law-and-economics: Within bureaucracies and other hierarchical organizations (e.g., principal-agent relationships), as well as in the more specialized contexts of designing contracts and auctions, mechanism design can be applied to design incentive systems to promote revelation of local information, elicit desired behaviors despite private information, and optimize delegation and trade-offs between centralization and decentralization, taking into account costs of communication, monitoring, and control and inefficiencies due to remaining private information (Mookherjee, 2006). As a prominent application of the mechanism design perspective, the modern theory of law and economics (Miceli, 2017) explains how systems of laws establishing tort liability rules for hazardous activities, remedies for breach of contracts, property rights to internalize externalities, product liability and implicit warranty principles, and so forth can be designed to maximize the expected net economic benefit from voluntary transactions, usually assuming risk-neutral participants with quasi-linear preferences. Practical designs that explain many aspects of observed legal practice account for market imperfections such as private and asymmetric information (e.g., a consumer may not know how much care a manufacturer has taken to keep a product safe, or the manufacturer may not know how much care the consumer will exercise in using the product safely), costs of litigation, misperceptions of risk by buyers, and incentives for socially valuable research and disclosure of information by sellers.

Modern algorithms for single-and multiagent decision-making
The intersection of computer science with decision models and algorithms has tremendously advanced the design and practical application of algorithms for solving 232 Tony Cox large-scale single-person and team decision optimization problems, as well as games and collective choice problems, in recent decades. Current state-of-the-art algorithms are briefly described next.
• Monte Carlo Tree Search (MCTS). Decision trees and game trees showing possible sequences of actions (choice nodes) and uncertainty resolutions (chance nodes, with probabilities for each branch) leading to rewards (utilities) at the ends (leaf nodes) of the tree are perhaps the best known rational-comprehensive models of normative decision analysis for small problems (Raiffa, 1968;Luce & Raiffa, 1957). For large problems, recent MCTS algorithms (Munos, 2014;Silver et al., 2016Silver et al., , 2018 sample possible future paths and rewards to avoid enumerating all possibilities. This decouples "rational" decision-making, based on optimizing current decisions using predicted future reward probabilities, from "comprehensive" modeling of the causal relationship between choices and reward probabilities, by selecting only the most promising choice nodes in a tree for further simulation and evaluation. MCTS can be combined with reinforcement learning (RL) techniques discussed next (Vodopivec et al., 2017) and applied to more general settings, such as those in which it is costly to observe the reward (Schulze & Evans, 2018), as is the case for many social policy interventions. • Reinforcement learning (RL) of high-reward policies through trial and error learning (Sutton & Barto, 1998. Decision-makers (agents) often initially do not know how their choices affect reward probabilities, or expected benefits, but must discover the immediate and longer-term costs and benefits of alternative policies or choices from experience. Denote true expected value starting in state s and acting optimally thereafter by an (initially unknown) value function, V(s) and let Q(a, s) denote an estimate of the value from taking each feasible action a when in each state s and then acting optimally (e.g., to maximize the discounted sum of future rewards) ever after. The initial estimates of these values may be random guesses, but they are updated in light of experience by adjusting current estimates by an amount proportional to the difference between expected and experienced rewards. The constant of proportionality is interpreted as the learning rate. For example, Q-learning uses the current estimate Q(a, s) to select which action to take next in the current state s. Then the resulting reward is used to update the estimate of Q(a, s) based on the difference between estimated and observed rewards. In many settings, estimated Q(a, s) values converge and the policy of selecting a to maximize Q(a, s) is then the optimal policy, while the estimated value of Q(a, s) when that policy is used is the true value function, V(s). This procedure is similar to value iteration in classical stochastic dynamic programming, but without the requirement that the reward function and state transition probabilities be initially known. It converges to yield optimal policies under certain conditions for Markov decision processes (MDPs), in which the actions taken affect next-state probabilities as well as probability distributions of current rewards (Krishnamurthy, 2015). The main conditions are that learning rates be kept small enough and that the MDPs are ergodic, involving no irreversible choices or fatal outcomes that would limit or prevent future exploration and adaptation (Bloembergen et al., 2015;Krishnamurthy, 2015;Xu et al., 2017). • RL using policy gradient algorithms. RL can also be based on algorithms that emphasize adjusting polices directly rather than estimating values for different actions as in benefit-cost analysis. As usual, a policy in RL is a decision rule mapping observations (e.g., the current state) to actions. In most RL algorithms, however, this mapping is randomized: thus, a policy RL specifies the probability of taking each feasible action when in each state (or, more generally, given current information, which may include imperfect observations of the current state). Polices are updated to favor selecting actions with higher expected values.
The tension between exploring further in hopes of finding a more valuable policy and exploiting what has been learned so far by selecting the actions with the highest expected values is managed carefully by choosing action-selection probabilities to avoid premature convergence to suboptimal policies. For example, a simple and effective policy in many settings is to select each action with a probability equal to the currently estimated probability that it is the best (value-maximizing) action; this is called Thompson sampling (Schulze & Evans, 2018). Such randomized sampling schemes prevent jumping to possibly erroneous conclusions about what works best in clinical trials and similar sequential decision optimization settings (Villar et al., 2015). Adjustments of policies continue until expected and experienced average rewards no longer differ. For large classes of adaptive decision problems under uncertainty, the policies arrived at by such successive incremental adjustments are the optimal policies that would be obtained by classical operations research methods (Bloembergen et al., 2015;Krishnamurthy, 2015;Xu et al., 2017). Table 1 lists important refinements and enhancements used in practice to make RL quicker and more robust to data limitations. Table 2 summarizes methods for safe learning that have proved effective in applications ranging from learning to control helicopters and quadcopters (e.g., allowing them to hover or navigate safely in cluttered environments) to learning to manage power grids and other networked infrastructures, without risking costly accidents and failures during learning. Table 3 summarizes variations and extensions of multiagent reinforcement learning (MARL) in which multiple agents act, learn, and perhaps communicate about how to control a system or accomplish a task. MARL can greatly increase the speed of learning and average rewards generated per unit time, under certain conditions (Omidshafiei et al., 2017;Gupta et al., 2017).

Enhancement Main ideas
Policy gradient RL algorithms Directly modify policies, without first estimating a value function for the states, by estimating the gradient (slope) of the reward as a function of policy parameters and adjusting those parameters incrementally to ascend the estimated slope (Arulkumaran et al., 2017).

Actor-critic architectures
Interpret the policy at any time as an "actor" and the value function as a "critic" that evaluates how well the current policy is working. Separating these two roles helps to speed convergence (Grondman et al., 2012).

Model-based RL
Fit statistical models of reward probabilities and state transition probabilities to observed state-act-reward-next-state data. Use the models to speed learning of high-reward policies (if the models are usefully accurate) (Clavira et al., 2018). Model-free RL Use empirically observed rewards to estimate state or action value functions (via iteratively updated Q values). Powerful statistical and machine learning techniques for approximating unknown functions from data, such as deep neural networks, can obtain most of the advantages of model-based RL while avoiding the potential pitfalls from using incorrect models (Mnih et al., 2015;Andrychowicz et al., 2018).

Reward shaping
Modify the original reward function received from the environment to encourage quicker learning and discovery of better policies (Mannion et al., 2017). Experience replay Use Monte Carlo simulation from frequency distributions of past experiences (e.g., state-action-reward-next state sequences) to reduce computational burden and augment sparse training data (Andrychowicz et al., 2018). Deep learning control of the learning rate Use deep learning neural networks to automatically adjust the learning rate parameter using an actor-critic architecture in which one neural network adjusts the parameter and another provides feedback on how well the adjustments appear to be working (Xu et al., 2017).

Meta-learning
Estimate crude high-level models of rewards and value functions relatively rapidly. Refine and improve them and use them to guide actions via RL as new observations are made. Such a hierarchy of modeling allows relatively rapid and effective adaptation to new conditions in nonstationary environments, including graceful compensation for and recovery from partial system failures (Lemke et al., 2015;Clavira et al., 2018). Inverse RL and imitation learning.
Use observed data on state and action sequences leading to success or failure in a task to infer successful policies for choosing actions to take in each state to accomplish it successfully. This makes it possible for agents to learn quickly from humans or other more experienced and higher-performing agents how to do complex tasks (Shiarlis et al., 2016).

Hybrids of above techniques
Example: Interleaving updates of the estimated value function with sampling from the experience replay buffer and adjustment of policies to increase expected reward ("policy gradient ascent" for rewards or "policy gradient descent" for losses, using a step size determined by the current learning rate parameter).
MARL algorithms and architectures that incorporate MCTS and enhancements to speed up convergence, safe learning, and communication and control hierarchies represent the current state of the art in machine learning models and methods for solving large-scale and distributed decision and control problems under uncertainty, including problems with sparse and delayed feedback. Although most MARL Table 2 Some principles for safe learning, i.e., learning without risking catastrophic failures.

Safe learning principle Main ideas
Risk-sensitive learning and control Modify the reward function to consider variance in return; probabilities of ruin or large loss, such as crash of an autonomous vehicle; and risk-sensitive control policies (Garcia & Fernandez, 2015). Imitation learning with safe instruction Use imitation learning from demonstrations supplied by instructors to assure that only safe examples are imitated (Garcia & Fernandez, 2015).

Knowledge-based constraints on exploration
Use knowledge-based constraints supplied by instructors to assure that that only safe changes in policies are explored during learning (Garcia & Fernandez, 2015). Maintain system stability while learning and exploring modified policies Apply feedback control theory for dynamic systems to maintain stability of the system while collecting data. Use the collected data to learn to improve control performance and to expand the safe region of the state space, i.e., the set of states for which safe control policies are available (Bernkamp et al., 2017). Keeping changes in control policies small enough to avoid destabilizing the system while learning is effective for systems that are known to have well-behaved dynamics, without large (e.g., discontinuous jump) responses to small changes in controls. Use model uncertainty to constrain exploration Create uncertainty zones around regions of potentially high loss (e.g., around pedestrians with unpredictable behaviors) based on model uncertainty estimates, and avoid them during learning (Lütjens et al., 2018). Safe policy improvement using a known safe policy as default when model uncertainty is high Engage in safe policy improvement by using known safe (i.e., catastrophe-avoiding) default policies when model uncertainty about effects of changing the policy is high. Explore for possible improvements in policies when model uncertainty is low (Petrik et al., 2016). Safe policy improvement using statistical confidence bounds to limit the risk from policy modifications Use statistical confidence bounds (e.g., derived from importance sampling and probability inequalities) for performance of modified policies to avoid those that pose unacceptable risks (Thomas et al., 2015). Convergence to Nash equilibria occurs under certain conditions if each agent uses RL and manages its learning rate appropriately (Hu & Wellman, 1998, 2003 (however, Nash equilibria may be Pareto-efficient).

Collective choice MARL
Agents initially know only their own preferences. They negotiate by proposing joint actions to each other to improve their own payoffs. Accepted proposals are binding and generate mutual gains. This cooperative negotiation leads to Pareto-superior outcomes compared to noncooperative MARL in many games (Hu et al., 2015).

MARL for teams without communication among agents
Teams of cooperating agents with the same goal (i.e., cooperating to maximize the same reward function) can learn to behave effectively in many settings even without explicit communication, by observing, modeling, and adjusting to each other's behaviors (Gupta et al., 2017). Decentralized MARL for distributed control of a system by a team of cooperating and communicating agents Decentralized cooperative learning by a team of agents based on explicit communication (e.g., over an unreliable communication network), with agents sharing experiences (data, estimated value functions, or policies), improves learning of distributed control policies to maximize average reward. Applications include control of power grids, mobile sensor networks, and autonomous vehicles  Hierarchical MARL (HMARL) MARL systems with hierarchical organizations of agents, as well as other techniques such as reward shaping, speed convergence to high-reward policies in many settings (Mannion et al., 2017).

Decentralized multilevel HMARL
In a multilevel hierarchy of agents, supervisory agents abstract and aggregate information from their subordinates, share it with their peers, pass summaries upward to their own supervisors, and pass supervisory suggestions and constraints on next actions down to their subordinates. This approach has been found to improve convergence of MARL learning in tasks requiring distributed control, such as network routing (Zhang et al., 2008). Two-level HMARL A central controller coordinates learning among the agents. Local agents manage different parts of a system, such as a supply chain network. They send to the central controller information about their current policies (e.g., represented as deep neural networks for mapping observations to actions) and observations on local costs (e.g., arising from inventory ordering, holding, and stockout costs). The central controller sends feedback to the agents (e.g., weights for the best policies learned so far by each agent) to coordinate their learning. In experimental supply chains, such two-level hierarchical MARL systems discovered policies that substantially reduce costs (e.g., by 80%) compared to the performance of human managers (Fuji et al., 2018). Hierarchy of tasks assigned to a hierarchy of agents Hierarchical deep MARL can be used to decompose a learning task into a hierarchy with high-level learning of policies over multistep goals and low-level controllers learning policies for taking the actions or steps needed to complete those goals. This task decomposition architecture combined with experience replay proved effective for learning high-reward policies in complex and rapidly changing test environments, such as managing a team of cooperating agents in a simulated basketball attack/defense game, even in the presence of sparse and delayed rewards (Tang et al., 2018). algorithms are designed for cooperating agents, Bowling and Veloso (2001) showed that convergence to Nash equilibria can also be achieved in a variety of noncooperative Markov games (generalizations of MDPs to multiple agents) if each agent uses RL but manages its learning rate to take large steps when the agent's experienced rewards are less than expected ("learn fast when losing") and small steps otherwise (when it is "winning" by receiving higher than expected rewards). The resulting WoLF ("win or learn fast") principle has been incorporated into many subsequent MARL algorithms for cooperative learning. It gives agents who are lagging in learning to contribute to the team's success time to catch up, while agents who are ahead of the rest continue to explore relatively cautiously (via relatively small incremental adjustment steps) for even better policies. In practice, MARL algorithms have been applied successfully to obtain high-reward policies for difficult distributed decision and control problems such as job shop scheduling among multiple agents (Gabel & Riedmiller, 2007); coordination of military force attacks in increasingly large-scale and realistic war game simulations (e.g., StarCraft battles) (Usunier et al., 2016); and self-organizing control of swarms of drones to perform missions or to cooperate in choosing locations to obtain full visual coverage of a complex and initially unknown environment (Pham et al., 2018). Safe MARL (SMARL) and Hierarchical MARL (HMARL) algorithms have demonstrated promising performance in controlling autonomous vehicles (Shalev-Shwartz et al., 2016) and teams of robots performing challenging tasks such as urban search and rescue in complex and uncertain environments (Cai et al., 2013), respectively. Such results suggest the potential for MARL principles and their extensions to contribute to improved control of complex distributed systems in important practical business, military, and industrial engineering applications.

Discussion: implications of advances in rational-comprehensive decision theory for muddling through
A key insight from machine learning is that policy gradient algorithms and other RL and MARL techniques that take successive incremental steps guided by experienceand in this sense muddle throughend up solving dynamic optimization problems. This finding addresses the "rational" component of Lindblom's critique by showing that muddling through and optimization are not opposed: muddling through provides one way to solve optimization problems. Likewise, RL's ability to solve adaptive dynamic optimization problems without requiring initial knowledge of the optimization problems being solvedspecifically of how different choices affect reward 238 Tony Cox probabilities and next-state transition probabilities in dynamic systems or environmentsrenders the "comprehensive" knowledge requirement no longer necessary. Sampling-based approximate optimization algorithms such as MCTS further reduce the need for a comprehensive examination and evaluation of decision options. In short, rather than being thesis and antithesis, as Lindblom framed them, optimization and muddling through have undergone a useful synthesis in modern machine learning via RL and MARL. However, fully automated RL and MARL techniques for quickly discovering optimal or near-optimal policies remain elusive. Computational complexity results for decentralized control of Markov decision processes (MDPs) and their generalizations suggest that some of these limitations are intrinsic for MARL (although not for single-agent RL with MDPs) (Papadimitriou & Tsitsiklis, 1985), and hence that discovery of high-reward policies will always be time-consuming unless there is some measure of centralized control (Bernstein et al., 2000). Of course, real organizations do not simply implement computer science algorithms, and it would be simplistic to read into the complexities of human organizational design and behavior all the limitations (or only the limitations) of RL and MARL algorithms. Nonetheless, understanding how and why these algorithms fail in some settings suggests important pitfalls to avoid in organizations that rely on muddling through, insofar as they follow the same basic principles. Conversely, success factors that turn out to be necessary for effective RL or MARL machine learning of high-reward policies in relatively simple environments may help to suggest necessary (although not sufficient) conditions for effective organizational learning within and among human organizations. The following paragraphs summarize key lessons and some comparisons with observed real-world decision processes for human organizations.
1. Collect accurate, relevant feedback data and use it to improve policies.
After each new action is taken, RL evaluates the reward received and compares it to the reward that was expected so that the difference can be used to correct erroneous expectations and update the current policy. This requires that the effects of actions be evaluated and compared to prior expectations or predictions, and also that policies then be adjusted in light of the data. In the real world, policy-making and policy-administering bureaucracies frequently violate each of these requirements. For example, finding that investments in a costly course of action have yielded lower-than-expected returns may provoke those who originally chose it to escalate their commitment to it (Molden & Hui, 2011;Schultze et al., 2012). Possible psychological and political explanations for escalating commitment range from loss aversion to seeking to manage the impressions of others, but clearly such resistance to modifying or abandoning previous choices in light of experience inhibits effective learning (Cox, 2015;Tetlock & Gardner, 2015).
In business as well as government, data needed to evaluate and compare actual to predicted performance of a policy are often not even collected, or are ignored or misinterpreted if they are collected (Russo & Schoemaker, 1989). In social policy application areas as diverse as education, criminal justice, and health care, changes in policy are often implemented without any clear predictions about expected changes in rewards or careful evaluations of actual changes in rewards (Tetlock & Gardner, 2015). These failures of design and analysis prevent the crucial learning from experience that is essential to effective muddling through. The remedy is to collect, retain, candidly communicate, and use accurate data on predicted and observed outcomes from implemented policies to improve them over time.
2. Explore via experiments to discover how to cause desired changes in outcome probabilities.
It is tempting for a policy analyst or policy maker steeped in the rationalcomprehensive tradition criticized by Lindblom to create the best possible model of how one believes the world works and then to choose the action or policy that maximizes expected utility according to this model, as in equation (1). But in reality, the causal relationship between choices of policies and resulting conditional probabilities of different consequences and rewards is often initially highly uncertain. Prudent and effective policy-making requires acknowledging and coping with this model uncertainty, rather than selecting and using a single model. RL and MARL algorithms do this via randomized selection of actions (e.g., using Thompson sampling or other randomized sampling schemes) (Schulze & Evans, 2018) to discover which policies work best and to avoid becoming stuck in local optima, but it is counter-cultural among people who believe that one should know and not guess about the best course of action before taking it (Tetlock & Gardner, 2015), and among decision analysts who believe that one should solve an expected utility optimization problem and then make deterministic decisions based on the results. Neither belief fully acknowledges or responds constructively to the reality emphasized by Lindblom that current knowledge is often simply insufficient to permit confident identification of the best policy, and that experimentation is the only practical way to discover how to do better. Fortunately, the use of randomized controlled trials (RCTs) in social policy experimentation and evaluation of interventions has become increasingly accepted and practiced recently, in areas ranging from disrupting poverty (Tollefson, 2015) to preventing delinquency (de Vries et al., 2018) to improving oral health of fifth grade students (Qadri et al., 2018) to reducing child abuse by intervening with substance-abusing parents (Barlow et al., 2019). For collective learning and decision problems, such as controlling air pollution health effects, RCTs may not be practicable or ethical, but natural experiments and 240 Tony Cox quasi-experiments provide valuable opportunities to learn from observed responses to unplanned or nonrandom interventions (Boogaard et al., 2017;Henneman et al., 2017).
3. During collective learning, agents should advance slowly when doing better than expected, but retreat quickly when doing worse.
The "win or lose fast" (WoLF) principle from MARL provides a useful heuristic for coordinating the rates at which agents on a team adjust their individual policies to prevent collective instability, so that they can eventually find and exploit a coordinated set of individual policies for maximizing team reward. In practice, destabilized policy-making processes in human organizations can manifest as "policy churn," in which new policies are proposed before old ones are well implemented and evaluated by the teams of agents implementing them (Monios, 2016). Teachers implementing education reform programs; bankers implementing new risk management regulations and requirements; medical staff implementing new infection control protocols in hospital wards; and workers in bureaucracies implementing policy changes have all been frustrated by policy churn that encourages costly activity and change without providing the opportunities for careful and thorough evaluation and improvement needed to improve outcomes. Perhaps fear of constant deflections and the resulting lack of progress explains some of the previously discussed reluctance to systematically collect and use feedback data to evaluate and improve policies. Conversely, the desire to show action and strong leadership, or to obscure the results of previous ineffective choices, might provide incentives for policy churn. In any case, the study of RL and MARL performance suggests that deliberately controlling step sizes and adjustment rates for policy updates might facilitate productive incorporation of feedback data into policy updates for a group of cooperating agents without destabilizing their learning and improvement process.

Separate actors and critics.
The RL idealization of frequent small adjustments made without significant costs, delays, or uncertainties in implementation is too simple to describe most real-world decision processes. Nonetheless, some RL and MARL principles may still be useful for human organizations. One of the most useful may be that decision and evaluation of decision performance should be kept distinct processes. Reasons abound in individual and group psychology for keeping those who make decisions about policy adjustments (analogous to "actors" in actorcritic RL algorithms) separate from those who evaluate the performance of the policies and provide feedback and suggestions for improving them (the "critics"). Among these reasons are confirmation bias, motivated reasoning, groupthink, and other heuristics and biases (Cox, 2015). RL suggests an additional reason, rooted in statistics: in deep learning RL algorithms, training one network to decide what to do next and a separate one to evaluate how well it is working has been found to prevent overly optimistic assessments of policy performance due to overfitting, i.e., using the same data to both select estimated value-maximizing actions and estimate the values from taking those actions (van Hesselt et al., 2015). The principle of separating the processes for choosing which changes to make and evaluating how well they perform can also be applied usefully to choice of learning rates (i.e., choosing how much to modify current policies in light of feedback) as well as to choice of policies (Xu et al., 2017). Possible future advances include deliberately diversifying the learning rates of different agents on the same team to obtain the advantages of both rapid exploration of new policies and thorough exploitation and refinement of old ones. This is an old concept in organizational science (e.g., March, 1991), but is still being developed in MARL research (Potter et al., 2001).
As a practical matter, separation of actors and critics can be applied fruitfully to major social learning and improvement initiatives, such as air pollution regulation, through accountability studies that revisit previous regulatory actions or other decisions to assess their results (Boogaard et al., 2017;Henneman et al., 2017). Use of such evaluation studies to evaluate and update previous policy decisionsideally, in time to be useful in guiding policy decisions elsewhereis clearly consistent with the principle of collecting and using relevant feedback data. Separation of actors and critics provides an additional principle for using feedback data to maximum advantage to improve polices and their results.

Shape rewards to promote learning and improvement.
Recently, it has been found that using causal (counterfactual) models to shape each agent's reward to reflect the estimated difference it has madethe difference between what was actually achieved and what would have been expected without each agent's contribution, or its marginal value, in microeconomic termscan speed collective learning and optimization when each agent seeks to maximize its own reward (Devlin et al., 2014). This research uses mathematical rewards that are costless to implement, so that budget constraints such as that the sum of agent rewards must not exceed the collective reward of the team, do not apply. However, it seems plausible that, even in the presence of budget constraints, rewarding each agent according to its estimated marginal contribution (or its expected marginal contributions, or Shapley values in game theory) might promote joint learning about how to contribute more effectively, as well as having other properties of efficiency and fairness familiar from microeconomics and game theory. Of course, the asymmetric information about relative roles of chance and effort typical in principal-agent problems can inhibit 242 Tony Cox accurate reward shaping in practice, and causal modeling of individual marginal contributions to team performance is challenging. Nonetheless, research on how best to use reward shaping to provide feedback and encourage effective learning, as well as to create incentives, may be useful for human organizations as well as for MARL algorithms.
6. Learn from the experiences and expertise of others. Learning from each other by sharing valuable memories, experiences, and expertise (typically encoded as causal models or trained neural nets) helps teams of MARL agents discover high-reward joint policies for controlling large-scale systems and accomplishing tasks in complex, changing, uncertain environments. In applying such ideas to human organizations, it is valuable to recognize that the "agents" may themselves be organizations, such as different schools, hospitals, or companies; or similar government bureaucracies in different states or countries. States and counties implementing pollution-reducing regulations might learn from each other's experiences about which combinations of interventions and conditions (possibly involving copollutants, weather variables, and sociodemographic characteristics of the exposed population) generate the greatest public health benefits from pollution reduction measures. As usual, effective learning in human organizations must overcome challenges from various types of learning aversion that have no clear counterparts in machine learning (Cox, 2015). For example, human bureaucracies may reorganize to visibly mimic organizational structures in more successful organizations whose reputations they covet, but without corresponding learning of the more effective policies that drive improved performance (Monios, 2016). Players preoccupied with managing the perceptions and impressions of others to shape allocations of collective efforts and rewards to their own individual advantages may be unable to achieve Pareto efficiency or to maximize any measure of collective success or reward. These threats do not arise for teams of agents trying to cooperate in maximizing the same reward function. Our recommendation that agents should learn from each other in order to speed mastery of joint policies for obtaining high rewards from the environment is primarily applicable to such teams of cooperating agents.

Conclusions
In 1973, two professors of design and city planning offered the following sober assessment of the prospects for scientifically based social policy: "The search for scientific bases for confronting problems of social policy is bound to fail, because of the nature of these problems. They are 'wicked' problems, whereas science has developed to deal with 'tame' problems. Policy problems cannot be definitively described. Moreover, in a pluralistic society there is nothing like the undisputable public good; there is no objective definition of equity; policies that respond to social problems cannot be meaningfully correct or false; and it makes no sense to talk about 'optimal solutions' to social problems unless severe qualifications are imposed first. Even worse, there are no 'solutions' in the sense of definitive and objective answers." (Rittel & Webber 1973) We believe that subsequent developments warrant greater optimism. While it is true that sufficiently heterogeneous preferences may make it impracticable or impossible to define and measure a single indisputable public good to be optimized, it is also true that agents with at least some shared goals have already achieved impressive feats of cooperation and control using MARL principles, in applications as diverse as autonomous vehicle fleet and drone swarm control, search-and-rescue via teams of cooperating autonomous robots, distributed management of supply chains, and military gaming. Such applications are admittedly far less difficult than the wicked problems referred to by Rittel and Webber, but many of the differences are of scale rather than of kind: robot teams are already using RL, MARL, and HMARL to confront, with increasing competence, the difficulties of distributed decision-making with initially unclear roles and priorities, uncertain and changing environments, opportunistic revision of goals and plans, and local information that may be time consuming and expensive to share. Multiple practical applications have demonstrated the advantages of improving via small steps rather than trying to optimize in one big decision, and this insight from Lindblom's 1959 paper remains true for machine learning as well human organizations. It has been augmented by the discovery that successive incremental improvement based on feedback at each step and careful selection of step sizes is often an effective way to solve dynamic optimization problems when they can be clearly formulated, as well as an effective way to learn how to act when not enough is initially known to formulate a clear decision optimization problem.
As artificial intelligence and machine learning algorithms are tested and improved on increasingly challenging tasks, principles for learning how to manage risks and act effectively in a variety of centralized, decentralized, and hierarchical organizational structures have begun to emerge. We have discussed several based on recent work that uses deep neural networks to approximate value functions in RL, MARL, and HMARL algorithms. These principles are only the beginning of what may soon become a substantial flow from multi agent machine learning to human management science of useful principles for improving organizational design and performance in coping with realistically complex and uncertain collective decision and policy improvement challenges. These principles will doubtless require modifications and extensions for the human world, since human psychology for both individuals and groups differs greatly from RL and MARL agent programming. 244 Tony Cox But the pace of discovery and progress in using machine learning to solve increasingly large, difficult, and important real-world problems of decision-making under uncertainty is now extremely rapid. Discovering how groups and teams of agents can organize, learn, decide, and adapt more effectively is becoming an experimental and applied science, as well as a theoretical one, in current artificial intelligence and machine learning. It seems likely that this research will produce insights and principles to help tame currently wicked problems and develop increasingly effective and beneficial polices in collective choice applications with high stakes for humans.