Knowledge-based Reasoning and Learning under Partial Observability in Ad Hoc Teamwork

Ad hoc teamwork refers to the problem of enabling an agent to collaborate with teammates without prior coordination. Data-driven methods represent the state of the art in ad hoc teamwork. They use a large labeled dataset of prior observations to model the behavior of other agent types and to determine the ad hoc agent's behavior. These methods are computationally expensive, lack transparency, and make it difficult to adapt to previously unseen changes, e.g., in team composition. Our recent work introduced an architecture that determined an ad hoc agent's behavior based on non-monotonic logical reasoning with prior commonsense domain knowledge and predictive models of other agents' behavior that were learned from limited examples. In this paper, we substantially expand the architecture's capabilities to support: (a) online selection, adaptation, and learning of the models that predict the other agents' behavior; and (b) collaboration with teammates in the presence of partial observability and limited communication. We illustrate and experimentally evaluate the capabilities of our architecture in two simulated multiagent benchmark domains for ad hoc teamwork: Fort Attack and Half Field Offense. We show that the performance of our architecture is comparable or better than state of the art data-driven baselines in both simple and complex scenarios, particularly in the presence of limited training data, partial observability, and changes in team composition.


Introduction
Ad Hoc Teamwork (AHT) is the challenge of enabling an agent (called the ad hoc agent) to collaborate with previously unknown teammates toward a shared goal (Stone et al. 2010).As motivating examples, consider the simulated multiagent domain Fort Attack (FA, Figure 1a), where a team of guards has to protect a fort from a team of attackers (Deka and Sycara 2021), and the Half Field Offense domain (HFO, Figure 1d), where a team of offense agents has to score a goal against a team of defenders that includes a goalkeeper (Hausknecht et al. 2016).Agents in these domains have limited knowledge of each other's capabilities, no prior experience of working as a team, limited ability to observe the environment (Figure 1b), and limited bandwidth for communication.Such scenarios are representative of multiple practical application domains such as disaster rescue and surveillance.
The state of the art in AHT has transitioned from the use of predetermined policies for selecting actions in specific states to the use of a key "data-driven" component.This component uses probabilistic or deep network methods to model the behavior (i.e., action choice in specific states) of other agents or agent types, and to optimize the ad hoc agent's behavior.These methods use a and computationally expensive to build the necessary models or to revise them in response to new situations.At the same time, just reasoning with prior knowledge will not allow the ad hoc agent to accurately anticipate the behavior of other agents and it is not possible to encode comprehensive knowledge about all possible situations.In a departure from existing work, we pursue a cognitive systems approach, which recognizes that AHT jointly poses representation, reasoning, and learning challenges, and seeks to leverage the complementary strengths of knowledge-based reasoning and data-driven learning from limited examples.Specifically, our knowledge-driven AHT architecture (KAT) builds on knowledge representation (KR) tools to support: 1. Non-monotonic logical reasoning with prior commonsense domain knowledge and rapidlylearned predictive models of other agents' behaviors; 2. Use of reasoning and observations to trigger the selection and adaptation of relevant agent behavior models, and the learning of new models as needed; and 3. Use of reasoning to guide collaboration with teammates under partial observability.
In this paper, we build on and significantly extend our recent work that demonstrated just the first capability (listed above) in the FA domain (Dodampegama and Sridharan 2023a).We use Answer Set Prolog (ASP) for non-monotonic logical reasoning, and heuristic methods based on ecological rationality principles (Gigerenzer 2020) for rapidly learning and revising agents' behavior models.We evaluate KAT's capabilities in the FA domain and the more complex HFO domain.We demonstrate that KAT's performance is better than that of just the non-monotonic logical reasoning component, and is comparable or better than state of the art data-driven methods, particularly in the presence of partial observability and changes in team composition.

Related Work
Methods for AHT have been developed under different names and in different communities over many years, as described in a recent survey (Mirsky et al. 2022).Early work used specific protocols ('plays') to define how an agent should behave in different scenarios (states) (Bowling and McCracken 2005).Subsequent work used sample-based methods such as Upper Confidence bounds for Trees (UCT) (Barrett et al. 2013), or combined UCT with methods that learned models from historical data for online planning (Wu et al. 2011).More recent methods have included a key data-driven component, using probabilistic, deep-network, and reinforcement learning (RL)based methods to learn action (behavior) choice policies for different types of agents based on a long history of prior observations of similar agents or situations (Barrett et al. 2017;Rahman et al. 2021).For example, RL methods have been used to choose the most useful policy (from a set of learned policies) to control the ad hoc agent in each situation (Barrett et al. 2017), or to consider predictions from learned policies when selecting an ad hoc agent's actions for different types of agents (Santos et al. 2021).Attention-based deep neural networks have been used to jointly learn policies for different agent types (Chen et al. 2020) and for different team compositions (Rahman et al. 2021).Other work has combined sampling strategies with learning methods to optimize performance (Zand et al. 2022).There has also been work on using deep networks to learn sequential and hierarchical models that are combined with approximate belief inference methods to achieve teamwork under ad hoc settings (Zintgraf et al. 2021).
Researchers have explored different communication strategies for AHT.Examples include a multiagent, multi-armed bandit formulation to broadcast messages to teammates at a cost (Barrett et al. 2017), and a heuristic method to assess the cost and value of different queries to be considered for communication (Macke et al. 2021).These methods, similar to the data-driven methods for AHT discussed above, require considerable resources (e.g., computation, training examples), build opaque models, and make it difficult to adapt to changes in team composition.
There has been considerable research in developing action languages and logics for single-and multiagent domains.This includes action language A for an agent computing cooperative actions in multiagent domains (Son and Sakama 2010), and action language C for modeling benchmark multiagent domains with minimal extensions (Baral et al. 2010).Action language B has also been combined with Prolog and ASP to implement a distributed multiagent planning system that supports communication in a team of collaborative agents (Son et al. 2010).More recent work has used B for planning in single agents and multiagent teams, including a distributed approach based on negotiations for non-cooperative or partially-collaborative agents (Son and Balduccini 2018).To model realistic interactions and revise the domain knowledge of agents, researchers have introduced specific action types, e.g., world-altering, sensing, and communication actions (Baral et al. 2010).Recent work has represented these action types in action language mA * while also supporting epistemic planning and dynamic awareness of action occurrences (Baral et al. 2022).These studies have demonstrated the expressive power and reasoning capabilities that logics in general, and non-monotonic logics such as ASP in particular, provide in the context of multiagent systems.Our work draws on these findings to address the reasoning and learning challenges faced by an ad hoc agent that has to collaborate with teammates without any prior coordination, and under conditions of partial observability and limited communication.

Architecture
Figure 2 provides an overview of our KAT architecture.KAT enables an ad hoc agent to perform non-monotonic logical reasoning with prior commonsense domain knowledge, and with incrementally learned behavior models of teammate and opponent agents.At each step, valid observations of the domain state are available to all the agents.Each agent uses these observations to independently determine and execute its individual actions in the domain.The components of KAT are described below using the following two example domains.
Example Domain 1: Fort Attack (FA).Three guards are protecting a fort from three attackers.One guard is the ad hoc agent that can adapt to changes in the domain and team composition.An episode ends if: (a) guards manage to protect the fort for a period of time; (b) all members of a team are eliminated; or (c) an attacker reaches the fort.At each step, each agent can move in one of the four cardinal direction with a particular velocity, turn clockwise or anticlockwise, do nothing, or shoot to kill any agent of the opposing team that is within its shooting range.The environment provides four types of built-in policies for guards and attackers (see Section 4.1).The original FA domain is fully observable, i.e., each agent knows the state of other agents at each step.We simulate partial observability by creating a "forest" in Figure 1b; any agent in this region is hidden from others.
Example Domain 2: Half Field Offense (HFO).This simulated 2D soccer domain is a complex benchmark for multiagent systems and AHT (Hausknecht et al. 2016).Each game (i.e., episode) is essentially played in one half of the field.The ad hoc agent is one of the members of the offense team that seeks to score a goal against agents in the team defending the goal.An episode ends when the: (a) offense team scores a goal; (b) ball leaves field; (c) defense team captures the ball; or (d) maximum episode length (500) is reached.
There are two version of the HFO domain: (i) limited: with two offense agents and two defense agents (including the goalkeeper); and (ii) full: with four offense agents and five defense agents (including the goalkeeper).Similar to prior AHT methods, agents other than the ad hoc agent are selected from teams created in the RoboCup 2D simulation league competitions.Specifically, other offense team agents are based on the binary files of five teams: helios, gliders, cyrus, axiom, aut.For defenders, we use agent2D agents, whose policy was derived from helios.The strategies of these agent types were trained using data-driven (probabilistic, deep, reinforcement) learning methods.HFO supports two state space abstractions: low and high; we use the high-level features.In addition, there are three abstractions of the action space: primitive, mid-level, and high-level; we use a combination of mid-level and high-level actions.This choice of representation was made to facilitate comparison with existing work.
Prior commonsense knowledge in these two domains includes relational descriptions of some domain attributes (e.g., safe regions), agent attributes (e.g., location), default statements, and axioms governing change in the domain, e.g., an agent can only move to a location nearby, only shoot others within its range (FA), and only score a goal from a certain angle (HFO).Specific examples of this knowledge are provided later in this section.Although this knowledge may need to be revised over time in response to changes in the domain, we do not explore knowledge acquisition and revision in this paper; for related work by others in our group, please see the papers by Sridharan and Mota (2023) and Sridharan and Meadows (2018).

Knowledge Representation and Reasoning
In KAT, the transition diagrams of any domains are described in an extension of the action language AL  (Gelfond and Inclezan 2013).Action languages are formal models of parts of natural language that are used to describe the transition diagrams of any given dynamic domain.The domain representation comprises a system description D, a collection of statements of AL  , and a history H . D has a sorted signature Σ which consists of actions, statics, i.e., domain attributes whose values cannot be changed, and fluents, i.e., domain attributes whose values can be changed by actions.For example, Σ in the HFO domain includes basic sorts such as  ℎ ,  , ,    ,  ,  ,  , and sort   for temporal reasoning.Sorts are organized hierarchically, with some sorts, e.g.,     and  , being subsorts of others, e.g.,  .Statics in Σ are relations such as  ( ,  ,  ,  ) that encode the relative arrangement of locations (in the HFO domain).The fluents in Σ include inertial fluents that obey inertia laws and can be changed by actions, and defined fluents that do not obey inertia laws and are not changed directly by actions.For example, inertial fluents in the HFO domain include: ( ℎ ,  ,  ) (1) which describe the location of the ad hoc agent, location of the ball, and the agent that has control of the ball; the value of these attributes changes as a direct consequence of executing specific actions.Defined fluents of the HFO domain include: which encode the location of the external (i.e., non-ad hoc) agents, and describe whether a defense agent is too close to another agent, and whether the ad hoc agent is far from the goal.Note that the ad hoc agent has no control over the value of these fluents, although the hoc agent's actions can influence the value of these fluents.Next, actions in the HFO domain include: which state the ad hoc agent's ability to move to a location, kick the ball toward the goal, dribble the ball to a location, and pass the ball to a teammate.Next, axioms in D describe domain dynamics using elements in Σ in three types of statements: causal laws, state constraints, and executability conditions.For the HFO domain, this includes statements such as: where Statements 4(a-b) are causal laws that specify that moving and dribbling change the ad hoc agent's and ball's location (respectively) to the desired location.Statement 4(c) is a state constraint that implies that only one agent can control the ball at any time.Statement 4(d) is an executability condition that prevents the consideration of a shooting action (during planning) if the ad hoc agent is far from the goal.Finally, the history H is a record of observations of fluents at particular time steps, i.e., (  , ,  ), and of action execution at particular time steps, i.e., ℎ (,  ).It also includes initial state defaults, i.e., statements in the initial state that are believed to be true in all but a few exceptional circumstances, e.g., the following AL  statement implies that attackers in the FA domain usually spread and attack the fort: To enable an ad hoc agent to reason with prior knowledge, the domain description in AL  is automatically translated to program Π(D, H ) in CR-Prolog (Balduccini and Gelfond 2003), an extension to ASP that supports consistency restoring (CR) rules.ASP is based on stable model semantics and represents constructs difficult to express in classical logic formalisms.It encodes concepts such as default negation and epistemic disjunction, and supports non-monotonic reasoning; this ability to revise previously held conclusions is essential in AHT.Π(D, H ) incorporates the relation ℎ(  ,  ) to state that a particular fluent is true at a given step, and  (,  ) to state that a particular action occurs in a plan at a given step.It includes the signature and axioms of D, inertia axioms, reality checks, closed world assumptions for defined fluents and actions, observations, actions, and defaults from H , and a CR rule for every default allowing the agent to assume that the default's conclusion is false in order to restore consistency under exceptional circumstances.For example, the CR-Prolog statement: allows the ad hoc agent to consider the rare situation of attackers mounting a frontal attack.Furthermore, it includes helper axioms, e.g., to define goals and drive diagnosis.Reasoning tasks such as planning, diagnosis, and inference are then reduced to computing answer sets of Π.The ad hoc agent may need to prioritize different goals at different times, e.g., score a goal when it has control of the ball, and position itself at a suitable location otherwise: () ← ℎ(( ℎ , ,  ), ).
A suitable goal is selected and included in Π(D, H ) automatically at run-time.In addition, heuristics are encoded to direct the search for plans, e.g., the following statements: #{@,  :  ()}.encourage the ad hoc agent to select actions that will minimize the total cost when computing action sequences to achieve a particular goal.We use the SPARC system (Balai et al. 2013)  where location (,  ) is in region  and superscript "*" refers to relations in D  .This coupling between the descriptions enables the ad hoc agent to automatically choose the relevant part of the relevant description based on the goal or abstract action, and to transfer relevant information between the descriptions.Example programs are in our repository (Dodampegama and Sridharan 2023b); details of the refinement-based architecture are in the paper by Sridharan et al. (2019).

Agent Models and Model Selection
Since reasoning with just prior domain knowledge can lead to poor team performance under AHT settings (see Section 4.2), KAT enables the ad hoc agent to also reason with models that predict (i.e., anticipate) the action choices of other agents.State of the methods attempt to optimize performance in different (known or potential) situations by learning models offline from many (e.g., hundred thousands or millions of) examples.It is intractable to obtain such labeled examples of different situations in complex domains, and the learned models are truly useful only if they can be learned and revised rapidly during run-time to account for previously unknown situations.KAT thus chooses relevant attributes for models that can be: (a) learned from limited (e.g., 10000 ) training examples acquired from simple hand-crafted policies (e.g., spread and shoot in FA, pass when possible in HFO); and (b) revised rapidly during run-time to provide reasonable accuracy.Tables 1 and 2 list the identified attributes in the FA and HFO domain respectively.Similar to our recent work (Dodampegama and Sridharan 2023a), the attributes are identified and the predictive models are learned using the Ecological Rationality (ER) approach, which draws on insights from human cognition, Herb Simon's definition of Bounded Rationality, and Table 2: Attributes for models of teammates and defense agents' behavior in HFO domain.Number of attributes represent the number of variables in each attribute times the number of agents.an algorithmic model of heuristics (Gigerenzer 2020;Gigerenzer and Gaissmaier 2011).ER focuses on decision making under true uncertainty (e.g., in open worlds), characterizes behavior as a joint function of internal (cognitive) processes and the environment, and focuses on satisficing based on differences between observed and predicted behavior.Also, heuristic methods are viewed as a strategy to ignore part of the information in order to make decisions more quickly, frugally, and/or accurately than complex methods.In addition, it advocates the use of an adaptive toolbox of classes of heuristics (e.g., one-reason, sequential search, lexicographic), and comparative out-of-sample testing to identify heuristics that best leverage the target domain's structure.This approach has provided good performance in many applications (Gigerenzer 2016).Specifically, in KAT, ER principles such as abstraction and refinement, and statistical attribute selection methods, are applied to the set of 10000 samples to identify the key attributes and their representation in Tables 1 and 2; these define behavior in the FA domain and HFO domain respectively.The coarse-and fine-resolution representation described in Section 3.1 is an example of the principle of refinement.In addition to the choice of features, the characteristic factors of AHT, e.g., the need to make rapid decisions under resource constraints and respond to dynamic changes with limited examples, are matched with the toolbox of heuristics to identify and use an ensemble of "fast and frugal" (FF) decision trees to learn the behavior prediction models for each type of agent.Each FF tree in an ensemble focuses on one valid action, provides a binary class label, and has the number of leaves limited by the number of attributes (Katsikopoulos et al. 2021).Figure 3 shows an example of a FF tree learned (as part of the corresponding ensemble) for a guard agent (Figure 3a) and an attacker agent (Figure 3b) in the FA domain.
The ad hoc agent's teammates and opponents may include different types of agents whose behavior may change over time.Unlike our prior work that used static prediction models, we enabled the ad hoc agent to respond to such changes by automatically revising the current model, switching to a relevant model, or learning new models.Existing models are revised by changing the parameters of the FF trees, and Algorithm 1 is an example of our approach for selecting a suitable model in the context of predicting the pose (i.e., position and orientation) of agents.Specifically, the ad hoc agent periodically compares the existing models' predictions with the observed action choices of each agent (teammate, opponent) over a sliding window of domain state and the agents' action choices; in Algorithm 1, this window is of size 1 (Lines 4-5).Also, a graded strategy is used to compute the error, penalizing differences in orientation less than differences in position (Lines 6-7).The model whose predictions best match the observations is selected for subsequent use and revision (Line 10, Algorithm 1).Note that if none of the models provide a good match over multiple steps, this acts as a trigger to learn a new model.

Partial Observability and Communication
In practical AHT domains, any single agent cannot observe the entire domain and communication is a scarce resource.To explore the interplay between partial observability and communication, we modified the original domains.Specifically, in the FA domain, we introduced a forest region where attackers can hide from the view of the two guards other than ad hoc agent and secretly approach the fort-see Figure 1b.The ad hoc agent has visibility of the forest region; it can decide when to communicate with its teammates, e.g., when: (a) one or more attackers are hidden in the forest; and (b) one of the other guards is closer to the hidden attacker(s) than it.The associated reasoning can be encoded using statements such as: ℎ(ℎ(, ),  + 1) ← ((  , , ), )  where Statement 8(c) encodes that communication is used only when a hidden attacker is within the range of a teammate (i.e., guard agent); Statement 8(b) defines when an attacker is hidden; and Statement 8(a) describes the ad hoc agent's belief that a teammate receiving information about a hidden attacker will shoot it, although the teammate acts independently and may choose to ignore this information.If there are multiple guards satisfying these conditions, the ad hoc agent may only communicate with the guard closest to the hidden attacker(s).
In the HFO domain, we represent partial observability in an indirect manner using the built-in ability to limit each agent's perception to a specific viewing cone relative to the agent.Specifically, each agent is only able to sense objects (e.g., other agents, ball) within its viewing cone; objects outside its viewing cone are not visible.Given this use of built-in functions, we added some helper axioms to ensure that the ad hoc agent only reasoned with visible objects; no additional communication action was implemented.

Experimental setup and results
We experimentally evaluated three hypotheses about KAT's capabilities: H1: KAT's performance is comparable or better than state of the art baselines in different scenarios while requiring much less training; H2: KAT enables adaptation to unforeseen changes in the type and number of other agents (teammates and opponents); and H3: KAT supports adaptation to partial observability with limited communication capabilities.
We evaluated aspects of H1 and H2 in both domains (FA, HFO) under full observability.For H3, we considered partial observability in both domains, and explored limited communication in the FA domain.Each game (i.e., episode) in the FA domain had three guards and three attackers, with our ad hoc agent replacing one of the guards.In HFO domain, each game (i.e., episode) had two offense and two defense players (including one goalkeeper) in the limited version; and four offense and five defense players (including one goalkeeper) in the full version.Our ad hoc agent replaced one of the offense agents in the HFO domain.In the FA domain, the key performance measure was the win percentage of the guards team.In the HFO domain, the key performance measure was the fraction of games in which the offense team scored a goal.In both domains, we also measured the accuracy of the predictive models.Further details of the experiments and the associated baselines are provided below.

Experimental Setup
In the FA domain, we used two kinds of policies for the agents other than our ad hoc agent: handcrafted policies and built-in policies.Hand-crafted policies were constructed as simple strategies that produce basic behavior.Built-in policies were provided with the domain; they are based on graph neural networks trained using many labeled examples.
• Policy1: guards stay near the fort and shoot attackers who spread and approach.
• Policy2: guards and attackers spread and shoot their opponents.
Built-in Policies.
• Policy220: guards stay in front of the fort and shoot continuously as attackers approach.
• Policy650: guards try to block the fort; attackers try to sneak in from all sides.
• Policy1240: guards spread and shoot the attackers; attackers sneak in from all sides.
• Policy1600: guards are willing to move away from the fort; some attackers approach the fort and shoot to distract the guards while others try to sneak in.
The ad hoc agent was evaluated in two experiments: Exp1, in which other agents followed the hand-crafted policies; and Exp2, in which other agents followed the built-in policies.As stated earlier, the ad hoc agent learned behavior models in the form of FF trees from 10000 stateaction observations obtained by running the hand-crafted policies.It was not provided any prior experience or models of the built-in policies.
Our previous work documented the accuracy of a basic AHT architecture that reasoned with some domain knowledge and static behavior prediction models in the FA domain (Dodampegama and Sridharan 2023a).In this paper, the focus is on evaluating the ability to select, revise, and learn the relevant predictive models, and adapt to partial observability.For the former, each agent other than our ad hoc agent was assigned a policy selected randomly from the available policies (described above).The baselines for this experiment were: • Base1: other agents followed a random mix of hand-crafted policies.The ad hoc agent did not revise the learned behavior models or use the model selection algorithm.• Base2: other agents followed a random mix of hand-crafted policies.The ad hoc agent used a model selection algorithm without a graded strategy to compare the predicted and actual actions, i.e., fixed penalty assigned for action mismatch in Line 6 of Algorithm 1. • Base3: other agents followed a random mix of built-in policies.The ad hoc agent did not revise the learned behavior models or use the model selection algorithm.• Base4: other agents followed a random mix of built-in policies.The ad hoc agent used the model selection algorithm without a graded strategy to compare predicted and actual actions, i.e., fixed penalty assigned for action mismatch in Line 6 of Algorithm 1.
The baselines for evaluating partial observability and communication were: • Base5: in Exp1, other agents followed hand-crafted policies and ad hoc agent did not use any communication actions.• Base6: in Exp2, other agents followed built-in policies and the ad hoc agent did not use any communication actions.
Recall that KAT allows the use of communication actions (when needed) under conditions or partial observability.Also, each experiment described above (in FA domain) involved 150 episodes and results were tested for statistical significance.
In the HFO domain, we used six external agent teams from the 2013 RoboCup simulation competition to create the ad hoc agent's teammates and opponents.Five teams were used to create offense agents: helios, gliders, cyrus, axiom and aut; agents of the defense team were based on the agent2d team.Similar to the initial phase in the FA domain, we deployed the existing agent teams in the HFO domain and collected observations of states before and after each transition in the episode.Since the actions of other agents are not directly observable, they were computed from the observed state transitions.To evaluate the ability to learn from limited data, we only used data from 300 episodes for each type of agent to create the tree-based models for behavior prediction, which were then revised (as needed) and used by the ad hoc agent during reasoning.We first compared KAT's performance with a baseline that only used non-monotonic logical reasoning with prior knowledge but without any behavior prediction models (Exp3), i.e., the ad hoc agent was unable to anticipate the actions of other agents.Next, we evaluated KAT's performance with each built-in external team, i.e., all offense agents other than the ad hoc agent were based on one randomly selected external team in each episode.In Exp4, we measured performance in the limited version, i.e., two offense players (including ad hoc agent) against two defense agents (including goalkeeper).In Exp5, we measured performance in the full version, i.e., four offense players (including ad hoc agent) played against five defense agents (including goalkeeper).In Exp6 and Exp7, we evaluated performance under partial observability in the limited and full versions respectively.As the baselines for Exp4-Exp5, we used recent (state of the art) AHT methods: PPAS (Santos et al. 2021), and PLASTIC (Barrett et al. 2017).These methods considered the same external agent teams mentioned above, allowing us to compare our results with the results reported in their papers.For Exp6-Exp7, we used the external agent teams as baselines.We conducted 1000 episodes for each experiment described above, and tested the results for statistical significance.

Experiment Results
We begin with the results of experiments in the FA domain.First, Table 3 summarizes the results of using our model selection algorithm in Exp1.When the other agents followed the hand-crafted policies and the model selection mechanism was not used by the ad hoc agent (Base1), the team of guards had the lowest winning percentage.When the ad hoc agent used the model selection algorithm with a fixed penalty assigned for any mismatch between predicted and actual actions (Base2), the performance of the team of guards improved.When the ad hoc agent used KAT's model selection method (Algorithm 1), the winning percentage of the team of guards was substantially higher than the other two options.These results demonstrated that KAT's adaptive selection of the behavior prediction models improved performance.
Next, the results of Exp2 are summarized in Table 4.We observed that KAT enabled the ad hoc agent to adapt to previously unseen teammates and opponents that used the FA domain's built-in policies, based on the model selection algorithm and the online revision of the behavior models learned from the hand-crafted policies.KAT provided the best performance compared with not using any model adaptation or selection (Base3), and when model selection assigned a fixed penalty for action mismatch (Base4).These results and Table 3 support H1 and H2.
The results from Exp1 under partial observability, with and without communication (Base5), are summarized in Table 5. Recall that the other agents used the FA domain's hand-crafted policies in this experiment.When the communication actions were enabled for the ad hoc (guard) agent, the winning percentage of the team of guards was substantially higher than the winning percentage of the team of guards when they could not use the communication actions.Policy2 was a particularly challenging scenario (because both guards and attackers can shoot), which justified the lower (overall) winning percentage.
Next, the results from Exp2 under partial observability, with and without communication (Base6), are summarized in Table 6.Recall that the other agents used the FA domain's builtin policies.We observed that when the guards (other than the ad hoc agent) followed policies 650, 1240, or 1600, the winning percentage of the team of guards was comparable or higher when communication actions were enabled compared with when these actions were not available (Base6).With Policy 220, the performance of the team of guards was slightly worse when the communication actions were enabled.However, unlike the other policies, Policy 220 results in the guards spreading themselves in-front of the fort and shooting continuously.Under these circumstances, partial observability and communication strategies were not important factors in determining the outcome of the corresponding episodes.These results support hypothesis H3.We next describe the results from the HFO domain.Table 7 summarizes results of Exp3, which compared KAT's performance with a baseline that had the ad hoc agent only reasoning with prior knowledge, i.e., without any learned models predicting the behavior of other agents.With KAT, the fraction of goals scored by the offense team was significantly higher than with the baseline.These results emphasized the importance of learning and using the behavior prediction models, and indicated that leveraging the interplay between representation, reasoning, and learning leads to improved performance, which supports hypothesis H1.
Next, the prediction accuracy of the learned behavior models created for the limited version (Exp4) and full version (Exp5) of the HFO domain are summarized in Tables 8 and 9 respectively.Recall that these behavior models were learned for the agents other than the ad hoc agent using data from 300 episodes (for each external agent type).This translated to orders of magnitude fewer training samples than the few hundred thousand training samples used by state of the art data-driven methods that do not reason with domain knowledge.The prediction accuracy varied over a range for the different agent types.Although the accuracy values were not very high, the models could be learned and revised quickly during run-time; also, these models resulted in good performance when the ad hoc agent also reasoned with prior knowledge.
The results of Exp4 and Exp5 comparing KAT's performance with the state of the art baselines for the HFO domain (PPAS, PLASTIC) are summarized in Table 10.Recall that these data-driven baselines required orders of magnitude more training examples and did not support reasoning with prior domain knowledge.The fraction of goals scored (i.e., games won) by the team of offense agents including our ad hoc agent was comparable with the goals scored by the baselines for the limited version, and substantially better than goals scored by the baselines for the full version.These results strongly support hypotheses H1 and H2.
The results of evaluating KAT under partial observability (in HFO domain) are summarized in Table 11 compared with teams of external agent types without any ad hoc agent.Although the results indicate that KAT's performance was slightly lower than the baseline teams without any ad hoc agents, the difference was not significant and mainly due to noise (e.g., in the perceived angle to the goal during certain episodes).The ability to provide performance comparable with teams whose training datasets were orders of magnitude larger strongly supports hypothesis H3.
In addition to the experimental results documented above, videos of experimental trials, including trials involving unexpected changes in the number and type of other agents, are provided in support of the hypotheses in our open-source repository (Dodampegama and Sridharan 2023b).

Conclusions
Ad hoc teamwork (AHT) refers to the problem of enabling an agent to collaborate with others without any prior coordination.This problem is representative of many practical multiagent collaboration applications.State of the art AHT methods are data-driven, requiring a large labeled dataset of prior observations to learn offline models that predict the behavior of other agents (or agent types) and determine the ad hoc agent's behavior.This paper described KAT, a knowledgedriven AHT architecture that supports non-monotonic logical reasoning with prior commonsense domain knowledge and predictive models of other agents' behaviors that are learned and revised rapidly online using heuristic methods.KAT leverages KR tools and the interplay between reasoning and learning to automate the online selection and revision of the behavior prediction models, and to guide collaboration and communication under partial observability and changes in team composition.Experimental results in two benchmark simulated domains, Fort Attack and Half Field Offense, demonstrated that KAT's performance is better than that of just the non-monotonic logical reasoning component, and is comparable or better than state of the art data-driven methods that require much larger training datasets, provide opaque models, and do not support rapid adaptation to previously unseen situations.
Our architecture open up multiple directions for further research.For example, we will investigate the introduction of multiple ad hoc agents in the benchmark domains used in this paper and in other complex multiagent collaboration domains.We will also continue to explore the benefits of leveraging the interplay between reasoning and learning for AHT in teams of many more agents, including on physical robots collaborating with humans.In addition, we will build on other work in our group (Sridharan and Mota 2023;Sridharan and Meadows 2018) to demonstrate the ad hoc agent's ability to learn previously unknown domain knowledge.Furthermore, we will build on our recent work (Dodampegama and Sridharan 2023a) and the work of others in our group (Mota et al. 2021) to enable the ad hoc agent to provide relational descriptions as explanations of its decisions and beliefs in response to different kinds of questions.

Fig. 2 :
Fig. 2: Our KAT architecture combines complementary strengths of knowledge-based and datadriven heuristic reasoning and learning.
One FF tree in the ensemble for a guard in the FA domain.(b)One FF tree in the ensemble for an attacker in the FA domain.

Fig. 3 :
Fig. 3: Examples of FF trees for a guard and an attacker in the FA domain.

Table 1 :
Attributes considered for models of other agents' behavior in FA domain.Number of attributes represent the number of variables in each attribute times the number of agents.

Table 7 :
Fraction of goals scored (i.e., games won) by the offense team in HFO domain with and without the learned behavior prediction models (Exp3).Reasoning with prior domain knowledge but without the behavior prediction models has a negative impact on performance.

Table 8 :
Prediction accuracy of the learned agent behavior models in limited (2v2) version of the HFO domain (Exp4).

Table 9 :
Prediction accuracy of the learned agent behavior models in full (4v5) version of the HFO domain (Exp5).

Table 10 :
Fraction of goals scored (i.e., games won) by the offense team in HFO domain in the limited version (2v2, Exp4) and full version (4v5, Exp5).KAT's performance comparable with the baselines in the limited version and much better than the baselines in the full version.

Table 11 :
Goals scored (i.e., games won) by offense team in HFO domain under partial observability (Exp6, Exp7).KAT's performance comparable with baseline that had no ad hoc agents in the team but used training datasets that were orders of magnitude larger.