Transfer Learning over Heterogeneous Agents with Restraining Bolts

In this page, we describe the experiments of the submission “Transfer Learning over Heterogeneous Agents with Restraining Bolts” at ICAPS 2020, submitted to the “Planning and Learning” track.

You can find the code to reproduce them at this link.

Experiment setup

To make easier the assessment of the quality of the learned policy, each experiment follows the structure below:

The actions in Variant A and Variant B are different. In particular, in Variant A, actions are stronger, making the game easier.

The traces that we consider are generated by playing an easier game (Variant A) wrt to the one that we then use for learning the policy (Variant B). In fact we use Variant A only to get the DFA constituting the restraining bolt that is then used to synthesize a behavior for Variant B.

Notice that there must be a relationship between the actions in the two variants for making the approach effective in practice, but such a relationship can be quite loose and can remain unexpressed.

Note also that actions are not used in the alphabet to progress the DFA and hence are not part of the reward given by the restraining bolt. This allows us to have different actions in the two variants.

We tested our approach in 3 different environments (the same used in [1]):

Breakout

Variant A

The variant A of the Breakout environment is defined as follows:

The environment is deterministic.

Variant B

The variant B of the Breakout environment is defined as follows:

The environment is deterministic.

Experiment

The LDLf formula that defines the Restraining Bolt is:

Where $c_i$ means: the $i_{th}$ column of bricks has just been broken.

The restraining bolt is endowed with sensors in order to detect the presence or absence of bricks, and hence to determine at a certain time whether fluent $c_i$ is true or not.

The equivalent DFA is:

missing

The expert behaviour is learned as in [1]. Follows a recording of such behaviour:

missing

Expert behaviour for Variant A of Breakout.

The (unique) example of positive trace is $c_1,c_2,c_3$. Some examples of negative traces are:

Notice that we filter out steps of the traces that have no fluents (i.e. empty propositional interpretations).

You can find examples of the produced traces in positive_traces.txt and negative_traces.txt

The automaton learned from the traces is:

missing

The labels are in brackets because they have to be intended as propositional interpretations. We also assume that every state of the DFA has a loop transition with the empty propositional interpretation, that is intended to be true if no other fluent is true.

And this is a recording of the optimal policy learned by the learner agent.

missing

Learner behaviour for Variant B of Breakout.

Notice that it is able to learn the same goal by just observing the traces of fluents produced by the expert. We also stress the fact that the expert can only fire, whereas the learner can only use the ball, meaning that the knowledge of the goal can be transferred even across different agents.

Sapientino

Variant A

The variant A of the Sapientino environment.

Variant B

The variant B of the Sapientino environment.

Experiment

The LDLf formula that defines the Restraining Bolt is:

Where:

In english, the goal can be stated as: visit the colors in a certain order, without any bad beep or wrong color visited in between.

The restraining bolt is endowed with sensors in order to detect whether the robot is on a colored cell and whether it has just executed the action “beep”, hence it is able determine at every step whether the fluents are true not.

The equivalent DFA is:

missing

The expert behaviour is learned as in [1]. Follows a recording of such behaviour:

missing

Expert behaviour for Variant A of Sapientino.

The (unique) example of positive trace is $red;green;blue$. Some examples of negative traces are:

Notice that we filter out steps of the traces that have no fluents (i.e. empty propositional interpretations).

You can find examples of the produced traces in positive_traces.txt and negative_traces.txt

The automaton learned from the traces is:

missing

And this is a recording of the optimal policy learned by the learner agent.

missing

Learner behaviour for Variant B of Sapientino.

Minecraft

Variant A

The variant A of the Minecraft environment.

Variant B

The variant B of the Minecraft environment.

Experiment

The LDLf formula that defines the Restraining Bolt is:

Where:

In english, the goal can be stated as: get iron, then get wood, and then use factory, without other operation in between.

The restraining bolt is endowed with sensors in order to detect whether the robot is on a resource/tool cell and whether it has just executed the action “get” or “use”, hence it is able determine at every step whether the fluents are true not.

The equivalent DFA is:

missing

The expert behaviour is learned as in [1]. Follows a recording of such behaviour:

missing

Expert behaviour for Variant A of Minecraft.

The (unique) example of positive trace is $iron;wood;factory$. Some examples of negative traces are:

Notice that we filter out steps of the traces that have no fluents (i.e. empty propositional interpretations).

You can find examples of the produced traces in positive_traces.txt and negative_traces.txt

The automaton learned from the traces is:

missing

And this is a recording of the optimal policy learned by the learner agent.

missing

Learner behaviour for Variant B of Minecraft.

References

  1. De Giacomo, Giuseppe, et al. “Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications.” Proceedings of the International Conference on Automated Planning and Scheduling. Vol. 29. No. 1. 2019.