Transfer Learning over Heterogeneous Agents with Restraining Bolts

In this page, we describe the experiments of the submission “Transfer Learning over Heterogeneous Agents with Restraining Bolts” at ICAPS 2020, submitted to the “Planning and Learning” track.

You can find the code to reproduce them at this link.

Experiment setup

To make easier the assessment of the quality of the learned policy, each experiment follows the structure below:

we fixed a restraining bolt [1] in LTLf/LDLf.
then, we play a simplified variant (Variant A) of the game recording its traces (projected on fluents only) and labeling the good or bad according to the satisfaction of the LTLf formula. In this way, we generate an “expert behavior” that can be used later for assessing the quality of the policy learned (see later).
then we learn the DFA from the traces. (Such a DFA typically is not the same as the LTLf formula but it is close enough).
next, the learner learns a policy in the more complex game (Variant B) using the learned DFA as the restraining bolt.
finally, to assess the policy learned we simulate its execution together with the original LTLf DFA checking when we reach its final states.

The actions in Variant A and Variant B are different. In particular, in Variant A, actions are stronger, making the game easier.

The traces that we consider are generated by playing an easier game (Variant A) wrt to the one that we then use for learning the policy (Variant B). In fact we use Variant A only to get the DFA constituting the restraining bolt that is then used to synthesize a behavior for Variant B.

Notice that there must be a relationship between the actions in the two variants for making the approach effective in practice, but such a relationship can be quite loose and can remain unexpressed.

Note also that actions are not used in the alphabet to progress the DFA and hence are not part of the reward given by the restraining bolt. This allows us to have different actions in the two variants.

We tested our approach in 3 different environments (the same used in [1]):

Breakout: an implementation of the popular Atari game in Pygame.
Sapientino: environment inspired by Sapientino, an educational game for children.
Minecraft: Minecraft-like environment implemented with Pygame

Breakout

Variant A

The variant A of the Breakout environment is defined as follows:

The state space $S = P_x$ is the set of all the possible $x$ positions of the paddle: $\{x_0, x_1, \dots, x_{width} \}$
The action space $A = \{ left, right, nop, fire \}$
The reward function $R(s, a, s’)$ is a positive number $r_b$ when a brick is removed, $0$ otherwise.

The environment is deterministic.

Variant B

The variant B of the Breakout environment is defined as follows:

The state space $S = P_x \times B_x \times B_y \times V_x \times V_y$
where:
- $P_x$ is the set of all the possible $x$ positions of the paddle;
- $B_x$ and $B_y$ are, respectively, the set of all the possible $x$ and $y$ ball positions;
- $V_x$ and $V_y$ are, respectively, the set of all the possible (discrete) $x$ speed and $y$ speed of the ball;
The action space $A = \{left, right, nop\}$
The reward function $R(s, a, s’)$ is defined as in variant A.

The environment is deterministic.

Experiment

The LDLf formula that defines the Restraining Bolt is:

$\langle (\lnot c_1 \wedge \lnot c_2 \wedge \lnot c_3)^*; c_1;(\lnot c_1 \wedge \lnot c_2 \wedge \lnot c_3)^*; c_2;(\lnot c_1 \wedge \lnot c_2 \wedge \lnot c_3)^*; c3 \rangle tt$

Where $c_i$ means: the $i_{th}$ column of bricks has just been broken.

The restraining bolt is endowed with sensors in order to detect the presence or absence of bricks, and hence to determine at a certain time whether fluent $c_i$ is true or not.

The equivalent DFA is:

missing

The expert behaviour is learned as in [1]. Follows a recording of such behaviour:

missing

Expert behaviour for Variant A of Breakout.

The (unique) example of positive trace is $c_1,c_2,c_3$. Some examples of negative traces are:

the empty trace: no column removed;
$c_0$: only the first column has been removed;
$c1;c0;c2$: removed, in this order the second, the first and the third column.
$c2;c1$: removed the third and the second columns.

Notice that we filter out steps of the traces that have no fluents (i.e. empty propositional interpretations).

You can find examples of the produced traces in positive_traces.txt and negative_traces.txt

The automaton learned from the traces is:

missing

The labels are in brackets because they have to be intended as propositional interpretations. We also assume that every state of the DFA has a loop transition with the empty propositional interpretation, that is intended to be true if no other fluent is true.

And this is a recording of the optimal policy learned by the learner agent.

missing

Learner behaviour for Variant B of Breakout.

Notice that it is able to learn the same goal by just observing the traces of fluents produced by the expert. We also stress the fact that the expert can only fire, whereas the learner can only use the ball, meaning that the knowledge of the goal can be transferred even across different agents.