Exploration-exploitation-based trajectory tracking of mobile robots using Gaussian processes and model predictive control

Abstract Mobile robots are a key component for the automation of many tasks that either require high precision or are deemed too hazardous for human personnel. One of the typical duties for mobile robots in the industrial sector is to perform trajectory tracking, which involves pursuing a specific path through both space and time. In this paper, an iterative learning-based procedure for highly accurate tracking is proposed. This contribution shows how data-based techniques, namely Gaussian process regression, can be used to tailor a motion model to a specific reoccurring reference. The procedure is capable of explorative behavior meaning that the robot automatically explores states around the prescribed trajectory, enriching the data set for learning and increasing the robustness and practical training accuracy. The trade-off between highly accurate tracking and exploration is done automatically by an optimization-based reference generator using a suitable cost function minimizing the posterior variance of the underlying Gaussian process model. While this study focuses on omnidirectional mobile robots, the scheme can be applied to a wide range of mobile robots. The effectiveness of this approach is validated in meaningful real-world experiments on a custom-built omnidirectional mobile robot where it is shown that explorative behavior can outperform purely exploitative approaches.


Introduction
Mobile robots are widely used in various industries due to their automation capabilities. In many applications, such as parcel centers, these robots follow the same path repeatedly. Additionally, mobile robots can be utilized as measurement platforms, for example, for automating measurement campaigns of wireless signals during product evaluation and development. Extremely high accuracy and reproducibility are critical in these cases. Data obtained from driving these periodic paths can be utilized to enhance the tracking capabilities of mobile robots through an approach that combines optimization and data analysis. As more input-output data become available, the reference is adapted in an iterative fashion lowering the tracking error (exploitation) while at the same time exploring previously unseen reference solutions (exploration).
Exploration-exploitation itself is associated with various fields ranging from psychology and economics to computer science, most notably reinforcement learning. The core idea is that an agent needs to exploit its knowledge to maximize a certain reward but at the same time needs to explore to gather knowledge to make better decisions in the long run. The dilemma known as the exploration-exploitation trade-off arises; in a complex environment, following only one of these two goals will in general not maximize the reward [1].
In the proposed procedure, this exploration-exploitation trade-off is done automatically in a controlled and safe manner using a novel offline optimization-based reference generator and a special learning cost function [2]. The main advantage of explorative over purely exploitative approaches is that they are less prone to getting stuck in local optima, which in the case of mobile robots can reliably improve the tracking performance. A practical tracking accuracy in the range of a few millimeters becomes attainable even when using comparatively cheap robotic hardware, which can further boost the employment of mobile robots for the automation of various tasks in industrial practice. While the approach is conceptually similar to the work of ref. [3], trajectory tracking instead of path following is considered and the model in this work is tailored to a specific reference trajectory, reducing the tracking error further without significantly increasing the computational burden. Contrary to comparable iterative approaches [4], the scheme is model-and optimization-based and respects the physical constraints of the robot using a model predictive controller (MPC).
This study is focused on omnidirectional mobile robots. The proposed procedure, however, is applicable to general robots in a wide range of applications. In Section 2, the robotic hardware is presented. Section 3 deals with the necessary Gaussian process fundamentals. Subsequently, in Section 4, the model setup is explained in detail, whereas in Section 5, the proposed reference generator and the corresponding model-based controller are introduced. Finally, in Section 6, the effectiveness of the approach is shown in real-world trajectory tracking experiments of an omnidirectional mobile robot. To the best of the authors' knowledge, the approach taken is conceptually novel and unique in the very high fidelity it achieves. This paper is an invited extension of ref. [5].

The omnidirectional mobile robot
There are various methods to achieve omnidirectional motion, but this study employs a specialized type of wheel called the Mecanum wheel. This type of wheel has several rollers attached to its circumference at an angle. When using Mecanum wheels for omnidirectional mobile robots, there are different wheel arrangements available, with rectangular and radial arrangements being the most popular. The custombuilt robot used in the hardware experiments of Section 6 adopts the radial arrangement, which utilizes four equally spaced wheels as shown in Fig. 1. The robot receives inputs in the form of u := [v x v y ω] T , which refers to the desired translational velocities in the robot's frame of reference, denoted as v x and v y , and the angular velocity around the vertical axis, represented as ω. To compute the four resulting desired wheel speeds i for wheel i, the relationship is used, where R = 3 cm is the radius of a wheel and = 11.5 cm is the distance between the robot's and each wheel's center in the motion plane. Over time, relationship (1) can be utilized to compute the corresponding set points for each wheel's individual proportional-integral-derivative controller, which runs at a frequency of 100 Hz. As the robot should rotate independently from its translational velocity [v x v y ] T , the input used to compute (1) is rotated by an angle of −ω/100 Hz about the vertical axis after each cycle unless a new input is received. Without this counter-rotation, the robot would drive arcs, similar to a differential-drive [6] or car-like mobile robot, which would not fully utilize the robot's omnidirectional capabilities [7]. The custom control software runs on an onboard BeagleBone Blue Linux-based computer with integrated motor drivers. The input u is updated through a wireless network at a frequency of 5 Hz. The robot's pose, which consists of its position within the movement plane in the inertial frame of reference and its orientation, is determined with a frequency of 100 Hz by an external motion tracking system consisting of five Optitrack Prime 13W cameras. The pose measurements and the inputs to the mobile robot are communicated between the different machines via a local network using the Lightweight Communications and Marshalling library [8].

Gaussian process regression
Throughout this paper, a suitable regression method is needed to deal with collected input-output data. We opt for nonparametric Gaussian processes (GPs) [9] as the regression technique, as opposed to parametric regression with a fixed number of basis functions [10], because the structure of the mismatches is unknown a priori and the right choice of basis functions in high dimensional input spaces can be challenging. Neural Networks, on the other hand, while being less restrictive, require large amounts of training data to fit the high number of unknown parameters [11]. A typical scenario for a regression problem involves an unknown scalar function f and a dataset consisting of N f noisy observations at the corresponding input x f i ∈ R n . Here, the additive noise ε follows an independent, identically distributed Gaussian distribution with zero mean and noise variance σ 2 n . The provided input-output tuples are collected in the data set D f = x f i , y i , i ∈ 1, . . . , N f . A Gaussian process can be seen as a generalization of Gaussian distributions to the function space. Any GP is uniquely defined by a mean function m : R n → R and a covariance function k : R n × R n → R. Usually, for notational simplicity the mean function is taken to be zero m(x) ≡ 0. The squared exponential (SE) kernel with inputs x, x ∈ R n is a popular choice for the covariance function k Here, σ 2 SE is the signal variance and typically A is chosen as The parameters i are called the length scales and control how much the kernel function will be affected by changes in each input dimension. Therefore, the length scales play a crucial role in determining the smoothness of the regression function and are inversely related to the importance of each input component. In case the function (2) has a scalar input n = 1 and is periodic with (known) fixed period P PER , a periodic kernel can be used. A suitable periodic kernel can be defined as ref. [12] Again, σ 2 PER is the signal variance and PER is the length scale. The periodic kernel can be combined with other kernels, for example, by multiplication with the SE kernel in case the function f has a multidimensional input and is only periodic in one of its arguments. The posterior distribution at a given input x * is Gaussian N (μ(x * ), σ 2 (x * )) with mean μ and variance σ 2 . In the following, the ith component of a vector-valued variable is indicated via [•] i while [•] ij is the entry of a matrix in row i and column j.

By using the notation[K ab
i and [y] i = y i , the posterior mean and variance are given by Evaluating the mean and variance at a given input requires the inversion of a matrix of size N f × N f , which is computationally expensive for large N f and of order O N 3 f . However, with some precomputations, the order of evaluating the mean and variance can be reduced to O N f and O N 2 f , respectively [13].
For the SE kernel with automatic relevance detection (ARD), the kernel hyperparameters, namely the signal σ 2 SE and noise σ 2 n variance and length scales i , are optimized by solving a maximum loglikelihood problem, see ref. [9] for details. However, the hyperparameters can also be chosen with the help of expert knowledge.

Sparse Gaussian processes
The computational effort of evaluating the posterior mean and variance scales with the number of data points N f . This can limit the use of GPs for larger data sets D f or real-time applications and prohibit an effective usage for this paper's purposes. Therefore, several approaches have been proposed to overcome this issue [14][15][16], with one of them being sparse approximations (sGP) of GPs; see ref. [13] for an overview. In this paper, the variational free energy (VFE) [17] approach is used, which provides a computationally efficient method for approximating the posterior distribution of the GP, with the potential to reproduce the full GP exactly. This approximation technique relies on so-called pseudo-inputs. The number of pseudo-inputs M is usually chosen to be significantly lower than the number of training points M N f . Subsequently, the notation Q ab = K au K −1 uu K ub is used, where index u refers to the superscript of the pseudo-inputs x u i , i ∈ {1, . . . , M} and K uu is the corresponding covariance matrix. For the posterior distribution, when using the VFE technique, one now obtains N μ s (x * ), σ 2 s (x * ) with After some pre-computations, the order of evaluating the mean and variance is reduced to O(M) and O(M 2 ), respectively. The location of the pseudo-inputs can be chosen as a subset of the actual training inputs, for example, by Greedy optimization of their location, or they can be included in the overall hyperparameter optimization.

Data-based model of the mobile robot
A kinematic model of the planar omnidirectional mobile robot is introduced as a nominal baseline. This model is commonly used for robotics applications and will subsequently serve as a starting point for learning purposes. Thus, the kinematic discrete-time relationship that describes the position at time step t ∈ N 0 in the inertial frame of reference is given by where x t := [x t y t θ t ] T is the pose of the mobile robot at time t, that is, its position [x t y t ] T in the inertial frame of reference and its orientation θ t . The rotation matrix describes a positive rotation about the robot's vertical axis by the angle θ t . The notation v t := v(t) is used as a shorthand, unless the explicit time dependency needs to be emphasized. A sampling time of T = 0.2s is selected. The simple kinematic model (7) is only suitable for reasonable input combinations. Thus, the wheel speeds and their rates of change are restricted where max = 20 1/s and max = 3.3 1/s are the maximum wheel speed and the wheel speeds' maximal change between two time instances, respectively. Additionally, the maximum magnitude of the angular velocity ω t of the robot is limited to ω max = 0.5 rad/s. The maximum norm is indicated by • ∞ . These constraints are imposed point-wise at each time step t. If a different type of mobile robot is used, (7) has to be replaced with the corresponding (kinematic) model.

Including data as prior knowledge
In the hardware experiments, it was observed that the mobile robot exhibited interesting nonlinear behavior that was not captured by the simple kinematic model (7). To account for this, an augmented model was introduced that includes a data-based nonlinear function g = g x g y g ω T . This function captures systematic mismatches between the actual system dynamics and the nominal model in the robot's frame of reference. The unknown nonlinear function g(a t ) is assumed to be dependent on the current and previous inputs, that is, This choice is motivated by experience with the robotic hardware. Based on (9), the set of admissible inputs A ⊂ R 6 to the function g can be defined as that is, two consecutive inputs that each satisfy the condition for the maximum wheel speed and angular velocity as well as the condition for the change of wheel speed. The training data D g := a(i), e g (i) , i ∈ 1, . . . , N g are obtained by considering the model mismatch to the nominal model between two consecutive time steps in the robot frame of reference Measurements of the position and orientation of the robot are taken via an optical tracking system with high accuracy. These measurements could be available, for example, from the prior operation of the hardware. In our case, to ensure a good generalization to all possible input combinations satisfying (9) while sufficiently covering the corresponding input space, we generate this data using a special automatic model tuning procedure from refs. [18,19]. To generate the training data in the hardware experiment, the hit-and-run sampler from ref. [20] is used to generate uniformly distributed random points inside the six-dimensional convex polytope A, see Fig. 2. These samples are then concatenated in a time series and applied to the robot. For more information on the specific training procedure, it is referred to ref. [19]. Some of the trajectories that are used in the training process are depicted in Fig. 3.  Each component of the unknown function is then approximated using an individual Gaussian process with posterior mean μ g = μ x g μ y g μ ω g T and diagonal variance g , computed using (5).
In the following, M x = M y = 150 pseudo-inputs are used (see Section 3) for g x and g y , whereas for g ω , fewer points M ω = 70 suffice, based on a convergence analysis of the available data. The size of the overall data set is N g ≈ 2500. We opt for the SE-ARD kernel from (3) as covariance function. The GPML Matlab toolbox [21] is used to train the sparse GPs. Thereby, the location of the pseudo-inputs is included in the optimization. The gradient-based hyperparameter optimization is randomly initialized various times to avoid over-or under-fitting caused by local optima.
The data-based model of the omnidirectional robot using the information gained during previous operation (which was generated using the training procedure in this case) is defined as Here, only the mean predictions of the GPs are used. While, in this paper, Gaussian process regression and the special automatic model tuning procedure from ref. [18] is used, the model mismatch could be trained using almost any other regression technique. After several hundred hours of operational time of the hardware on different trajectories and millions of data points in D g for example, it is more efficient to switch to parametric regression approaches that can cope with large amounts of data such as neural networks [11].
Using data-based dynamical models of the mobile robot Throughout this study, kinematic models are employed. The primary reason is that not all (commercially available) mobile robots allow the user to command specific wheel torques or motor currents directly. In order to keep the results applicable to a wide range of robots, this work focuses on kinematic models with (wheel) speeds as inputs. It is also possible to extend the framework to dynamical robot models, for example, obtained when modeling the robot as a rigid multi-body system via the Newton-Euler approach or Lagrange formulation [22]. However, with dynamical models, the specific hardware setup and the sources of the model mismatch have a more significant influence on the approach one may take, for example, whether the model mismatch g is only added on position or also on velocity level. In order to obtain an accurate model, it may also be necessary to measure or estimate the velocity of the robot chassis and the wheels' angular velocities to predict the robot's state precisely.

Learning trajectory-specific mismatches
The model (14) can predict the robot's position for general input combinations. In the upcoming section, trajectory tracking will be considered. To that end, it is assumed that the robot needs to follow a specific reference trajectory x r t for the state and u r t for the input. To obtain an even higher accuracy for specific, a priori known trajectories, subsequently a more specialized model based on the data-driven model (14) is developed. Thus, a new function denoted as h p t with corresponding input p t is introduced to address discrepancies that are unique to the current trajectory, such as the impact of uneven terrain or discrepancies that arise over longer time periods and cannot be accounted for by g. This will further improve the accuracy of the model for the specific reference trajectory.
To learn the trajectory-dependent model mismatch, one may generally assume that the current mismatch might, classically, depend on the robot's current state and the corresponding input which would not directly fit the idea of learning a trajectory-specific mismatch or mismatches that cannot be represented using only the chosen set of states [3,23]. However, assuming that, due to the choice and consideration of g, the robot's state and input will generally already evolve in the vicinity of the state and input reference x t ≈ x r t and u t ≈ u r t , we instead subsequently assume a dependency on the current input reference and time, that is, p T t := u r t T t . Thus, any state dependency of the mismatch (and other, more structurally profound model mismatches) will be purely taken into account through an explicit dependency on time; a dependency on the actual input is replaced by a dependency on the input reference. However, the input and state references need to be calculated with a model of the robot, preferably with the most accurate one available. Hence, it makes sense to use the most detailed model, taking into account trajectory-specific model mismatches, for the calculation of the reference. Yet, as designed, that model depends on the reference. Therefore, subsequently, an iterative approach is used. As more and more sets of data become available, a candidate for the trajectory-specific mismatch is calculated, which is used to calculate a new input and state reference. On the basis of the latter, with new data from a trajectory tracking experiment using a controller with the updated reference and model, an updated candidate for the trajectory-specific mismatch is calculated, and so forth. The resulting iterative procedure is visualized in Fig. 4.
Each component of the function h = h x h y h ω T is, again, approximated by a Gaussian process x , σ 2 y , σ 2 ω , each computed using (5). Apart from their high data efficiency, the advantage of using GPs as the regression technique that one also obtains an estimate of the remaining uncertainty of the model. This will become useful for the exploration-based approach taken in Section 5.1.
For the dynamics, taking the trajectory-specific mismatches into account, only the mean prediction μ h of the GPs is considered The GPs for the function h are trained using the corresponding model mismatch e h (t), that is, It should be noted that the error e h (t), calculated in the robot's frame of reference, pertains to the baseline model f g rather than the deviation from the state reference. For the training data D h := p(i), e h (i) , i ∈ {1, . . . , N h } containing N h data points, the robot drives N lap laps for each reference trajectory to reduce the effect of random disturbances. Training data that was gathered using a specific input reference can be added to D h . The GPs approximating h can be trained, and the input u r and state x r references are subsequently adapted for the updated model f gh , see Fig. 4. The idea is that, as more data D h for different references become available, the model f gh is improved, further reducing the tracking error. As in the first iteration, no trajectory-specific information is available (D h = ∅), one obtains μ h ≡ 0, and therefore f gh ≡ f g . In Section 5.2, the state and input reference are determined using an optimization-based approach.

Trajectory tracking of periodic references
When it comes to trajectory tracking, robots are typically expected to adhere to a predetermined trajectory x des t that is specified in both space and time. This is a commonplace task for robots in real-world applications. To later obtain a functioning controller, it is generally advantageous to calculate a corresponding input reference as well. This input sequence is computed based on the specific underlying control model of the robot. Using this feedforward solution, the robot will be able to roughly follow its trajectory while the controller merely needs to compensate for small deviations from the desired reference. The input and state reference can be computed analytically for the nominal dynamics via Here, θ des t := x des t 3 refers to the desired orientation of the robot, which is the third component of the desired trajectory. When computing the nominal reference solution that way, it is important to ensure that the nominal reference input adheres to the constraints (9) on the control inputs. This work considers periodic reference trajectories, which satisfy the following properties that is, the final state and input should loop back to the initial state and input reference. However, it is not necessary for the orientation of the robot to be exactly the same from one lap to the next as long as it changes by a whole number of full rotations. It is worth noting that the methods presented in this work can also be applied to more general trajectories, as long as (19) holds for at least some sufficiently long time intervals of the overall trajectory. For instance, if the same corridor is passed with the same velocity profile at irregular intervals but the trajectories do not coincide otherwise.

Exploration-exploitation-based reference generation
In the presence of input constraints (9) and the nonlinear dynamics (14) or (16), it is often difficult to achieve perfect tracking of the desired trajectory x des t . This also means that the reference cannot be computed in closed form like in (18). In this subsection, a method is proposed to compute reference trajectories u r t , x r t for input and state, respectively, automatically solving the exploration-exploitation trade-off of following a desired trajectory x des t as close as possible (exploitation), while at the same time reducing the posterior variance of the underlying Gaussian process model by exploring input sequences along the desired trajectory that may have the potential to further reduce the tracking error (exploration).
Without loss of generality, x r 0 := x des 0 is assumed. The notation v t+i|t is used to indicate the modelbased prediction at time t of the quantity v for time t + i. The reference trajectories are computed in a receding horizon fashion, where for each time step, the trajectory is optimized over a horizon H ff < N. This approach is computationally more efficient than optimizing the entire reference at once but possibly suboptimal. Additionally, this method scales linearly with the period of the reference N, making it viable for longer periods in the first place. The so-called performance cost is defined to minimize the deviations to the input reference with v 2 V := v T Vv. Here, Q ff and R ff denote the symmetric positive definite weighting matrices. The input change in (20) is penalized to encourage smooth input references. For f gh the reference will be iteratively updated as more data becomes available. On its own, the performance cost J t does not incentivize the robot to venture into areas of the input space that have not yet been visited. Because h is inferred from GPs, this can be done by considering the posterior variance h [24]. The model f gh reliably predicts the position of the robot around previously seen input sequences, that is, if the variances of the corresponding GP predictions are low. Input sequences that have not yet been applied to the robot, however, show a large posterior variance of the GP predictions. Therefore, the learning cost function is defined as Here, w j are weighting parameters for the standard deviations σ j p k|t which when squared are the diagonal entries of the posterior variance h p k|t of the GPs approximating h from (15). The scalar parameter 0 < β < 1 is called the forgetting factor [2]. This exponential discounting puts more emphasis on the learning cost towards the beginning of the prediction horizon. This can be useful for the convergence of the reference generation as nominal 'exploitative' behavior is incentivized towards the end of the prediction horizon, which would otherwise conflict with the performance cost J t from (20). The so-called exploration-exploitation cost function [25] used for the reference generation is now defined as Note that L enters V t negatively since the learning cost should be maximized compared to the performance cost J t , that is, the deviation to the desired reference. Minimizing V t allows the reference generator to explore new input sequences in unseen regions of the input space (exploration) while at the same time refining the solution for previously scouted input sequences (exploitation). Thereby, the parameter γ ≥ 0 controls the exploration-exploitation trade-off. This cost function is less prone to getting stuck in local optima, which can possibly reduce the tracking error compared to a purely exploitative approach merely minimizing J t . For each time step 0 < t < N − H ff , the reference can be computed via x r k+1|t = f g/gh (x r k|t , u r k|t , •), The expression f g/gh (x r k|t , u r k|t , •) depends on the application and can refer to either (14) or (16), where the state and input are replaced with their respective reference values. A constant 0 < k ε < 1 is introduced to ensure that the controller can still compensate for any mismatches that may occur during closed-loop operation, as 0 is in the interior of A. In (23e). The notation εA := εa ∈ R 6 | a ∈ A for some ε ∈ R is used to denote a scaled set. The constraint (23g) limits the deviation between the input reference and a baseline input sequence u b t to br = 0.05 m/s 0.05 m/s 0.15 rad/s T . Here u b t is set to the input sequence computed for the nominal system (18), which restricts the solutions to inputs near a reasonable guess. This also limits the amplitude of possible oscillations that can occur due to the exploratory nature of the cost function (22).
The optimization problem (23) is solved for time steps 0 < t < N − H ff , and the state and input reference are set to x r t+1 := x r t+1|t and u r t := u r t|t . However, for the initial time step t = 0 and the final time steps, the optimization problem (23) needs to be modified as shown in Table I (23) is set up in Matlab using CasADi [26] and is solved using the interior-point optimizer IPOPT [27].

Reducing the effective number of data points
The number of training points N h is constantly increasing due to the iterative nature of this procedure. Since, as detailed in Section 3, additional training points increase the time it takes for a prediction of the posterior mean and variance, we aim to mitigate this subsequently.
One suitable approach to reduce the effective number of data points for the GPs is called local approximate Gaussian process models [16]. Since, for each input at a given time step t, the data points that were recorded for a time step far into the future or in the past in previous iterations will likely have little effect on the approximation. Therefore, only the points that have the most influence on the Gaussian process approximation in the current time interval could be selected from D h .
Several approaches were proposed in the literature [3,15,28] on how to select a subset of points from D h for local GP approximations, ranging from Greedy optimization [29] to simple nearest neighbor approaches [16]. Note, however, that the scheme to select these points should be reasonably quick, as this would otherwise conflict with the overall goal of reducing the computational time. Generally, it has been observed that a simple (and fast) nearest neighbor approach works well [16] and will therefore be used in the following. Contrary to what was proposed in refs. [16,29], however, instead of actually reducing the training points to a subset of D h , sparse GPs are trained on the full data set and pseudo-inputs are placed at the specific locations [28,30] which results in the same computational effort for the evaluation of the posterior mean and variance. Since an input is comprised of translational speeds, a rotational velocity, and a time step, the kernel k j , j ∈ {x, y, ω}, is used as a similarity/distance measure [30]. This procedure has to be done prior to solving optimization problem (23) for each time step.
Since for every iteration N lap laps are driven, for each time step, several data points with slightly different model mismatches are obtained. However, the corresponding inputs of each iteration repeat N lap times due to the assumption that h only depends on the reference, which is fixed for each iteration. Since adding the same pseudo-input twice will not improve the approximation further [31], the possible locations of the pseudo-inputs are restricted to the unique set of inputs in D h . This also means that for the choice of p t , driving several laps with the same reference input sequence will not further increase the effective number of inputs.
To ensure a reliable approximation, one should at least include all the data points that were recorded for the time steps {t, . . . , t + H ff − 1} of the current time interval. Any additional points then improve the approximation accuracy near the boundary of the interval. This results in N loc ≤ N effective data points. Especially, the setup of the optimization problem is significantly sped up that way. This approximation makes the optimization-based scheme also viable for very long trajectories, as the effective number of training points is independent of N.
The procedure combining the two aforementioned approaches for generating a local GP from the full GP trained on the full data set D h is summarized in Algorithm 1. Using this procedure reduces the effective number of inputs from N h = NN lap N iter (if the inputs are selected naïvely) to N loc . As N h is increasing with each iteration, the number of pseudo-inputs N loc ∝ N iter is increased accordingly to retain a good approximation with a growing number of data points. Here, N iter denotes the number of iterations that are included in D h . The more laps are driven for each iteration, and the longer the trajectories are, the more effective these approximations are.

Model predictive trajectory tracking controller
To follow the computed reference trajectory, a MPC [32] is employed, because it can consider the input constraints (9) as well as leverage the nonlinear data-based model of our robot.
Starting from an initial state measurement x t at time step t, the MPC predicts the future poses for the following H time steps using either the nominal or data-based models. The input sequence is thereby chosen such that the weighted squared error between the states and inputs to their corresponding reference values computed in Subsection 5.1 is minimized. State and input constraints can easily be integrated into the optimization problem. The optimal control problem is then solved in real time to find the optimal control action for each time step. Using the more accurate data-based control model within the optimal control problem instead of the nominal one will improve the prediction accuracy and, subsequently, the control performance.
The optimal control problem for trajectory tracking can be formulated as x k+1|t = f n/g/gh x k|t , u k|t , • , (24d) with positive definite weighting matrices Q = I and R = 1 30 I, the current state measurement x t and the prediction horizon H = 20. This corresponds to a prediction of the robot pose HT = 4 s into the future. The constraint (24f) is an equality constraint for the terminal state of the prediction, that is, the state prediction should coincide with the reference at the end of the horizon. Together with the constraint (24g), this makes sure that applying the nominal reference trajectory is feasible for all future time steps, which is a straightforward way of guaranteeing convergence to the reference for the undisturbed system [33,34]. After (24) is solved, the current input is set to the solution u t := u t|t for each time step. The online MPC problem (24) is implemented using the ACADO Toolkit [35], which implements a realtime iteration scheme with Gauss-Newton Hessian approximation [36]. The thereby arising quadratic programs are solved using qpOASES [37]. The average solving time for the OCP is around 200 and 700 μs using the nominal and the data-based models, respectively. All computations in this study are done using a laptop computer with an Intel Core i7-9750H CPU. Note that, as in (24) the trajectory-specific mismatch h is neither explicitly state nor input dependent, for fixed u r t and t the term h p t is constant. Therefore, the computational effort of solving the OCP for f g and f gh is similar.
One way to improve the tracking performance using the nominal model f n would be to select the weighting matrices Q and R of the MPC in (24a) to obtain a stiffer control, however, in turn making the controller more sensitive to measurement noise. A stiffer control can only reduce systematic model mismatches but never eliminate them entirely. In general, high-gain position feedback should be avoided for robotic applications to be more compliant, especially around human personnel [38]. Contrary to the nominal model, as systematic mismatches are accounted for using the data-based model, a high tracking accuracy can be achieved without requiring a stiff controller. Note that, when a differential-drive or car-like mobile robot is used instead of an omnidirectional mobile robot, some modifications to the cost function (24a) might be necessary to retain some theoretical properties, see [6] for details.

Experimental results
For the functions h j , the corresponding kernels k j , j ∈ {x, y, ω} are used, where x, x ∈ R 3 and t, t ∈ N 0 . This choice of the kernel is a product of the SE kernel (3) and the periodic kernel (4), which respects the periodicity of the considered trajectories, see (19). The kernel hyperparameters σ s,j , σ n,j , A j , PER,j are set to the values from Table II. The selection of the hyperparameters requires experience with the system, the environment, and the measurement setup. However, once a (nominal) model of the robot is available, a large portion of the relevant parameters can be tuned in simulation before moving to real hardware, including the parameters of the reference generator and the MPC. Optimizing the hyperparameters can be a good first step in building some intuition for reasonable values of the kernel hyperparameters in case no prior experience with the hardware setup is available. The parameters for the learning cost function are set to β = 0.8, and the individual weights for the posterior variances are set to w j = σs,ω σ s,j , j ∈{x, y, ω}. This choice of w j is made as an effort to lower the influence of GPs with high signal variances. A value of γ = 0.01 is used unless stated otherwise.
To compare the experimental results for different cost functions (22), the cumulative learning cost is defined as (26) as an indicator of the uncertainty of the current reference solution. Note that since σ 2 j is lower bounded by the signal noise variance σ 2 n,j the term L also has a lower bound of L lb = N j∈{x,y,ω} w j σ n,j which is, however, only reached in the limiting case N h → ∞, see ref. [39] for details. Since the approximation from Subsection 5.1 is used, each σ j (p t ) is evaluated using the local GP approximation corresponding to time step t. The interval for the approximation at time step t (see Algorithm 1) includes N loc = N iter (H ff + 24) points, that is, all the points corresponding to the time steps from the current prediction interval {t, . . . , t + H ff − 1} as well as 24N iter additional points to improve the prediction quality at the boundary. With this setup, the performance of the proposed method is intently analyzed for different trajectories.

Infinity-shaped trajectory
In the first scenario, the custom-built robot drives five consecutive laps of a ∞-shaped trajectory with a length of N = 124 time steps and a desired orientation of θ des t ≡ 45 • , compare Fig. 5. Here, the error to the desired trajectory is depicted using a magnification of ten. Clearly, the tracking error shows systematic behavior, which is especially noticeable using the nominal model. Contrary to any random, non-systematic mismatches, this behavior can be compensated when using a more accurate, data-based model.
In Fig. 6, the root mean square error (RMSE) of the position is depicted over the iterations. The initial (0th) iteration is the one corresponding to the nominal solution. The five laps that are driven using f g are considered the first iteration and are used to train the trajectory-specific mismatch for the second iteration. Evidently, the error is not decreasing monotonically. This is, on the one hand, due to the exploitative behavior driving the robot away from the desired trajectory and, on the other hand, due to a wrong and uncertain model f gh , reflected by higher values of L . Over the iterations, more knowledge is gathered, and the algorithm becomes more confident in its solution, compare Fig. 6 (right). This exploration-exploitation trade-off is done fully automatically thanks to the cost function (22).
In the tracking error histogram in Fig. 7b, the control error for the nominal model f n persists due to unmodeled mismatches, resulting in a RMSE of 9.27 mm or 1.00 • , in terms of position and orientation, respectively. The general GP-based model f g significantly reduces the RMSE to 3.50 mm or 0.53 • . However, the model specifically tailored to the trajectory, f gh , manages to further reduce the tracking RMSE to 1.68 mm and 0.39 • in the best (10th) iteration. This is shown in the histogram in Fig. 7b. The error in position is consistently reduced for most time steps, as seen in Fig. 7a (left). The improvement in orientation error between f g and f gh is only marginal.
An explanation might be that the reference-specific term may primarily be explained by effects due to local ground changes that might not have as big of an impact on the orientation as on the position. These effects are averaged out for f g due to the way the data D g was collected.  In the magnification in Fig. 5, it also becomes evident that even the trajectory-specific model can struggle to perfectly track the desired trajectory. One explanation could be that it is impossible to perfectly track the desired x des t without some local modifications, for example, due to the saturation of the input limits (9). Another explanation would be that the algorithm could not find a potential input sequence with better tracking performance within the limits of its possibilities, for example, due to the selection of the parameters γ or br .

Rectangular trajectory and the influence of explorative behavior
In the second scenario, the robot shall follow a rectangular trajectory. To that end, the robot drives trapezoidal velocity profiles on each rectangle edge with a top speed of about 0.45 ms, N = 197 and θ des t ≡ 0 • . At each corner, the robot halts for a time period of 1 s. In Fig. 8, five consecutive laps for four different input sequences are depicted. The nominal model struggles to accurately predict the robot's behavior which is noticeable in the overshoot in the braking phase at each corner of the rectangle. The data-based model without any trajectory-specific information is able to reduce the RMSE by more than 50% (from 5.8 to 2.6 mm). Both of the reference solutions for f n and f gh were generated with γ = 0. When including trajectory-specific information incorporating a learning term by choosing γ = 0.01 outperforms the pure 'exploitative' approach of γ = 0. The uncertainty decreases for both values of γ (Fig. 8, right). This can indicate that for γ = 0, the algorithm is stuck in a local optimum quickly decreasing L by revisiting previous input sequences with low variances. For γ = 0.01, the first iteration is about as good as with γ = 0 but, in the following iterations, the error, as well as the uncertainty, decrease slowly but steadily until iteration 6 where the uncertainty is about as low as for γ = 0 but the error in the position is about 45% lower (1.9 vs. 1.1 mm). Note that the RMSE also includes any non-systematic noise, which means achieving an RMSE of zero is essentially impossible in any real-world experiment. For the final seventh iteration, the exploration-based cost function, γ = 0 was set as enough information was gathered. Otherwise, at the beginning of this scheme, the choice of γ = 0 can hurt the tracking performance as it can increase the tracking error in exchange for exploratory behavior. Therefore, the tracking performance can generally be expected to be worse compared to taking γ = 0 in the first few iterations, especially for very large values of γ . The error for the orientation is about 0.2 • regardless of the value of γ and the iteration and is therefore not depicted in Fig. 8. While this difference in tracking performance between the two different cost functions seems small, when looking at the tracking error in more detail in Fig. 9, there is some systematic error left for γ = 0 in the top straight especially whereas for γ = 0.01, there is hardly any visible systematic error even with the employed error magnification of 25.
The time it takes to pre-compute the references for f g is about 50 ms per time step, see Fig. 10 (iteration 1). Due to the increasing number of data points, the computational times for f gh increase with each iteration. While the computational time per time step increases approximately quadratically, for γ = 0.01 (or γ = 0 in general) it increases faster than taking γ = 0. This is due to the optimization problem (23) being harder to solve when exploitation is included in the cost function (22). Moreover, as detailed in Section 3, the time to evaluate the GPs posterior variances also increases quadratically with the number of pseudo-inputs, which in this case scales linearly with the number of iterations. The computational time could be reduced by lowering the number of effective data points N loc in exchange for a worse approximation accuracy of the GPs, which is not done here in order to introduce as little of an error as possible via the local approximations. The nominal model f n does not require any time-consuming pre-computation, as the reference can be computed in closed form (18).

Discussion
This procedure will not replace an experienced control engineer, as the proper selection and identification of the relevant inputs a k and p k and the selection of individual parameters in (23) or (24) requires a thorough understanding of the hardware setup and more importantly whether there are sources of systematic errors present. However, this procedure is a semi-automated way to leverage the full potential of the hardware using real-world input-output data, without having to make any changes to the hardware  (23) for each iteration using different values of γ . The times were recorded for the rectangular trajectory (N = 197). The first iteration is the solution corresponding to no additional trajectory-specific information, that is, f g .
itself. The framework is also extremely flexible in the sense that prior knowledge can be included on different levels. If the learning factor is set to zero (γ = 0) much of the experience and prior knowledge needed to select the GP hyperparameters is not necessary, although this might hurt the effectiveness of this procedure. Other simplifications include the optimization of the hyperparameters before each iteration or assuming a pure time dependence of h (for γ = 0). The right choice of the learning factor γ is also important, as a learning factor close to zero can result in hardly any exploration and a very large learning factor can result in a very slow decrease of the uncertainty and the tracking error. Ideally, γ is chosen large enough such that the algorithm does not get stuck in local minima, but small such that the corresponding solutions stay somewhat close to a reasonable input sequence, like the one obtained with f g for example.
In order for the method to be viable in practice, the computational times have to be kept small. Several modifications to the presented framework are possible that trade control performance for reduced computational times. For the reference generation, special data management techniques [23,30] can be used to eliminate irrelevant or old data points over time to keep N h below a certain value. In order to reduce the computational times for the OCP (24) solved in real time, one could fix the inputs a k|t of g to their corresponding reference values a r t resulting in a constant expression for fixed t. In fact, for less intricate systematic mismatches or in case no prior input-output data is available, the function g can be dropped from the procedure altogether.

Conclusion and outlook
By combining a robot-specific data-based model and a trajectory-specific data-based correction term, this paper achieves high-accuracy tracking of periodic references. Gaussian process regression was used to model the trajectory-specific term, which was assumed to explicitly depend on time and the reference. Leveraging this data-based model in a model predictive control framework results in a significant reduction of the computational time. Additionally, a method based on iterative optimization was introduced to efficiently calculate periodic input and state references while considering the dynamics of the model. It was shown that exploitative behavior, that is, trying to reduce the posterior variances of the Gaussian process model, can outperform the purely exploitative approach, merely minimizing the tracking error, which may get stuck in local optima. Since this procedure shows impressive control performance within only a few iterations, it can be used for fast online adaptations of the input reference in case of external disturbances, like changes in the environment or the configuration of the robot, without manual re-calibration. In future work, the proposed theory could be extended to high-accuracy path following control by incorporating a path parameter into the disturbance model, or to the distributed case, such as for decentralized high-accuracy formation control [40].