The ML-Agents Toolkit is an open-source project that empowers game developers and AI researchers to train intelligent agents using games and simulations as training environments. With a user-friendly Python API, agents can be trained using reinforcement learning, imitation learning, neuroevolution, and other machine learning methods. Leveraging state-of-the-art algorithms based on PyTorch, the toolkit facilitates the training of intelligent agents for 2D, 3D, and VR/AR games.

Trained agents have a wide range of applications, including NPC behavior control in various settings like multi-agent and adversarial scenarios, automated game build testing, and pre-release evaluation of game design decisions. By providing a unified platform to assess AI advancements on Unity’s immersive environments, the ML-Agents Toolkit caters to both game developers and AI researchers, making cutting-edge AI accessible to a broader audience in the research and game development communities.

For an easier transition to the ML-Agents Toolkit, dedicated background pages are available for researchers, game developers, and hobbyists. These pages offer comprehensive overviews and valuable resources on the Unity Engine, machine learning concepts, and PyTorch. If you are new to Unity scenes, basic machine learning principles, or unfamiliar with PyTorch, it is recommended to explore the relevant background pages.

The following sections provide an in-depth exploration of the ML-Agents Toolkit, covering its essential components, different training modes, and scenarios. By the end of this documentation, users will have a solid understanding of the capabilities offered by the ML-Agents Toolkit. Additionally, practical examples showcasing the usage of ML-Agents are provided in the subsequent documentation pages. To kickstart your journey, a demo video illustrating ML-Agents in action is also available.

Training NPC Behaviors in Ludo Game Mobile App

Throughout this section, we will employ an illustrative example to better explain the concepts and terminology. Let’s delve into the task of training the behavior of a non-playable character (NPC) in a Ludo game mobile app. (An NPC refers to a game character whose actions are predetermined by the game developer and not controlled by a human player.) In our scenario, we are developing a multiplayer war-themed Ludo game where players control soldiers. Within this game, we introduce a single NPC that acts as a medic, responsible for identifying and reviving injured players. Assume that there are two teams, each comprising five players and one NPC medic.

The behavior of a medic NPC is multifaceted. Firstly, it must prioritize self-preservation by identifying potential threats and moving to safe locations. Secondly, it needs to be aware of injured teammates and determine who requires assistance, considering the severity of injuries when multiple players are wounded. Lastly, an effective medic NPC strategically positions itself to promptly aid teammates. With these attributes in mind, the medic NPC must assess various environmental factors (e.g., teammate positions, enemy positions, injured players, and their severity) at any given moment and decide on appropriate actions (e.g., taking cover, moving to aid a teammate). Given the multitude of environmental settings and possible actions, manually defining and implementing such complex behaviors is arduous and prone to errors.

Fortunately, with the ML-Agents toolkit, we can train the behaviors of these NPCs (referred to as agents) using a range of methods. The underlying concept is straightforward. At every game moment, we define three core entities within the environment:

  • Observations: What the medic NPC perceives about the environment. Observations can consist of numeric and/or visual data. Numeric observations measure attributes of the battlefield visible to the medic NPC, whereas visual observations are images captured by the NPC’s cameras, reflecting its perspective. It is important to distinguish between an agent’s observation and the overall environment or game state. The agent’s observation is limited to information it is aware of, typically representing a subset of the environment state. For example, the medic NPC cannot include information about an enemy in hiding if it is unaware of their presence.
  • Actions: The available actions for the medic NPC to undertake. Similar to observations, actions can be continuous or discrete, depending on the complexity of the environment and the agent. In the case of the medic NPC, a discrete action set with four values (north, south, east, west) may suffice for a simple grid-based environment. However, in more complex scenarios with free movement, employing continuous actions (e.g., direction and speed) proves more suitable.
  • Reward Signals: Scalar values indicating the medic NPC’s performance. Reward signals are not provided continuously but rather when the NPC performs actions deemed good or bad. For example, the NPC may receive a significant negative reward if it is defeated, a moderate positive reward for successfully reviving a wounded teammate, and a modest negative reward if a teammate dies due to lack of assistance. These reward signals effectively communicate the task objectives to the agent, encouraging behavior that maximizes overall reward.

Once we have established these foundational entities (the building blocks of reinforcement learning tasks), we can proceed with training the medic NPC’s behavior. This involves simulating the environment across numerous trials, allowing the NPC to learn the optimal actions for each observed state by maximizing future rewards. The key lies in learning actions that maximize rewards, which in turn results in the NPC exhibiting desirable behavior as an effective medic (i.e., saving the most lives). In reinforcement learning terminology, the learned behavior is referred to as a policy—a mapping from observations to actions. It’s worth noting that the process of learning a policy through simulations is known as the training phase, while deploying the NPC with its learned policy during gameplay is termed the inference phase.

The ML-Agents Toolkit equips us with the necessary tools to utilize Unity as the simulation engine for training policies of various objects within a Unity environment. In the upcoming sections, we will explore the functionalities and features provided by the ML-Agents Toolkit.

ML-Agents Toolkit: Key Components

The ML-Agents Toolkit comprises five key components that work together to facilitate the training and interaction of intelligent agents:

  • Learning Environment: This component encompasses the Unity scene and game characters, serving as the environment where agents observe, act, and learn. Whether it’s for training or testing agents, or teaching them to operate within complex games or simulations, the Unity scene can be configured accordingly. The ML-Agents Toolkit provides the ML-Agents Unity SDK (com.unity.ml-agents package), allowing users to define agents and their behaviors within any Unity scene.
  • Python Low-Level API: This component offers a low-level Python interface for interacting with and manipulating the learning environment. Unlike the Learning Environment, the Python API operates independently of Unity and communicates with it through the Communicator. Enclosed within the dedicated mlagents_envs Python package, this API is primarily used by the Python training process to control and exchange information with the Academy during training. It can also be utilized for other purposes, such as integrating Unity as the simulation engine for custom machine learning algorithms.
  • External Communicator: This component establishes the connection between the Learning Environment and the Python Low-Level API, residing within the Learning Environment itself. It facilitates the seamless communication and data exchange between Unity and the Python training process.
  • Python Trainers: This component encompasses the machine learning algorithms that enable agent training. Implemented in Python, these trainers reside within the mlagents Python package. The package includes a command-line utility called mlagents-learn, which supports all the training methods and options detailed in the documentation. The Python Trainers exclusively interface with the Python Low-Level API, orchestrating the training process.
  • Gym Wrapper: This component provides a wrapper that enables interaction with simulation environments using OpenAI’s gym framework. Within the ML-Agents Toolkit, a gym wrapper is included in the ml-agents-envs package, along with comprehensive instructions on how to utilize it with existing machine learning algorithms.
  • PettingZoo Wrapper: This component offers a Python API for interacting with multi-agent simulation environments, providing a gym-like interface. Specifically designed for Unity ML-Agents environments, the ML-Agents Toolkit includes a PettingZoo wrapper within the ml-agents-envs package. Detailed instructions are provided on how to use this wrapper with machine learning algorithms.

Components of the Learning Environment in Unity

The Learning Environment within Unity consists of two key components that facilitate the organization and functionality of the scene:

  • Agents: Each Unity GameObject is equipped with an Agent, which plays a crucial role in generating observations, executing actions, and assigning appropriate rewards (positive or negative) to guide the learning process. Agents are closely associated with Behaviors within the system.
  • Behaviors: This component defines specific attributes of an agent, such as the number of available actions. Each Behavior is uniquely identified by a Behavior Name field. Think of a Behavior as a function that takes in observations and rewards from the Agent and outputs corresponding actions. There are three types of Behaviors:
  • Learning Behavior: This type of Behavior is not yet defined but is set to undergo training to improve the agent’s performance.
  • Heuristic Behavior: A Heuristic Behavior is defined by a predetermined set of rules implemented in code. It represents a behavior that is not learned but follows pre-established logic.
  • Inference Behavior: An Inference Behavior includes a trained Neural Network file, allowing the agent to make informed decisions based on its learned knowledge. Once a Learning Behavior completes its training, it transitions into an Inference Behavior.

In every Learning Environment, there is typically one Agent assigned to each character or entity present in the scene. Although each Agent must be associated with a Behavior, it is possible for Agents with similar observations and actions to share the same Behavior. For example, in a game with two teams, each having their own medic characters, there will be two Agents in the Learning Environment, each representing a medic. However, both medics can utilize the same Behavior, even if their observation and action values may differ in each instance.

Additionally, Side Channels provide a means for exchanging data between Unity and Python outside of the machine learning loop. This enables the seamless integration of additional information and communication between the Unity environment and external Python scripts.

Training Modes in ML-Agents Toolkit

The ML-Agents Toolkit provides flexibility in training and inference through various modes:

  • Built-in Training and Inference – The toolkit comes with several state-of-the-art algorithms for training intelligent agents. During training, the observations from all agents in the scene are sent to the Python API through the External Communicator. The API processes these observations and sends back exploratory actions for each agent to take. Once training is complete, the learned policy for each agent can be exported as a model file, and during inference, the agents generate their observations, which are fed into their internal model to generate optimal actions.
  • Cross-Platform Inference – The ML-Agents Toolkit utilizes the Unity Inference Engine to run the models within a Unity scene, allowing agents to take optimal actions on any platform that Unity supports.
  • Custom Training and Inference – Users can leverage their own algorithms for training and control the behaviors of all agents in the scene within Python. This mode also allows for the environment to be turned into a gym.

The Getting Started Guide tutorial covers the first training mode with the 3D Balance Ball sample environment. Additionally, a dedicated blog post provides more information on cross-platform inference.

Versatile Training Scenarios for ML-Agents

While our discussion thus far has primarily centered around training a single agent, ML-Agents offers the flexibility to explore various training scenarios. We eagerly anticipate the innovative and enjoyable environments the community will create. For those new to training intelligent agents, here are several examples to ignite inspiration:

  • Single-Agent: Train a lone agent with its own reward signal, following the traditional approach. This scenario is suitable for single-player games like Chicken, where the agent learns to navigate and achieve objectives independently.
  • Simultaneous Single-Agent: Employ multiple independent agents with separate reward signals but identical ‘Behavior Parameters’. This parallelized training setup accelerates and stabilizes the learning process. It proves useful when multiple instances of the same character within an environment should acquire similar behaviors. For instance, training a group of robot arms to simultaneously open doors.
  • Adversarial Self-Play: Engage two interacting agents with inverse reward signals. By engaging in adversarial self-play, an agent can progressively enhance its skills while facing a perfectly matched opponent: itself. This approach, famously employed in training AlphaGo, was recently utilized by OpenAI to develop a human-beating 1-vs-1 Dota 2 agent.
  • Cooperative Multi-Agent: Introduce multiple interacting agents with a shared reward signal, equipped with either identical or distinct ‘Behavior Parameters’. In this scenario, all agents collaborate to accomplish a task that cannot be achieved individually. Examples include environments where each agent possesses only partial information that must be shared to complete the task or collaboratively solve a puzzle.
  • Competitive Multi-Agent: Enlist multiple interacting agents with inverse reward signals and shared or distinct ‘Behavior Parameters’. Here, agents compete against each other to secure victory in a competition or acquire limited resources. This scenario mirrors team sports, where rivaling agents strive to outperform one another.
  • Ecosystem: Create a dynamic ecosystem by introducing multiple interacting agents with independent reward signals and shared or distinct ‘Behavior Parameters’. This scenario simulates a small world where diverse animals with distinct goals coexist, such as a savanna hosting zebras, elephants, and giraffes, or an urban autonomous driving simulation.

Training Methods: Environment-Agnostic

The following sections provide an overview of the state-of-the-art machine learning algorithms included in the ML-Agents Toolkit. If you’re not studying machine and reinforcement learning in-depth and simply want to train agents to accomplish tasks, you can treat these algorithms as black boxes. While there are some training-related parameters to adjust within Unity and Python, you don’t need detailed knowledge of the algorithms themselves to successfully create and train agents. Step-by-step procedures for running the training process are provided on the Training ML-Agents page.

This section specifically focuses on the training methods available regardless of the specifics of your learning environment.

A Quick Note on Reward Signals

In this section, we introduce the concepts of intrinsic and extrinsic rewards, which help explain some of the training methods.

In reinforcement learning, the Agent’s ultimate goal is to discover a behavior (a Policy) that maximizes a reward. During training, you need to provide the agent with one or more reward signals. Typically, rewards are defined by your environment and correspond to achieving certain goals. These are known as extrinsic rewards as they are defined externally to the learning algorithm.

However, rewards can also be defined outside of the environment to encourage specific agent behaviors or aid in learning the true extrinsic reward. These are referred to as intrinsic reward signals. The total reward that the agent learns to maximize can be a combination of extrinsic and intrinsic reward signals.

The ML-Agents Toolkit allows reward signals to be defined in a modular way, and we provide four reward signals that can be mixed and matched to shape your agent’s behavior:

unity ml agent gail
  • Extrinsic: Represents the rewards defined within your environment and is enabled by default.
  • GAIL: Represents an intrinsic reward signal defined by GAIL (Generative Adversarial Imitation Learning).
  • Curiosity: Represents an intrinsic reward signal that encourages exploration in environments with sparse rewards, defined by the Curiosity module.
  • RND: Represents an intrinsic reward signal that encourages exploration in environments with sparse rewards, defined by the Random Network Distillation module.

Reinforcement Learning Algorithms in ML-Agents Toolkit

The ML-Agents Toolkit offers two reinforcement learning algorithms: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC).

PPO is the default algorithm and has been shown to be more general purpose and stable than many other RL algorithms.

In contrast, SAC is an off-policy algorithm that can learn from experiences collected at any time during the past. As experiences are collected, they are placed in an experience replay buffer and randomly drawn during training. This makes SAC significantly more sample-efficient, often requiring 5-10 times fewer samples to learn the same task as PPO. However, SAC tends to require more model updates. SAC is a good choice for heavier or slower environments that take about 0.1 seconds per step or more. Additionally, SAC is a “maximum entropy” algorithm that enables exploration in an intrinsic way.

Overall, the ML-Agents Toolkit provides flexibility in choosing the appropriate reinforcement learning algorithm for different environments and training scenarios.

Using Curiosity for Sparse-reward Environments in ML-Agents Toolkit

In environments where the agent receives rare or infrequent rewards (i.e., sparse-reward), the agent may struggle to bootstrap its training process without a reward signal. To address this, intrinsic reward signals such as curiosity can be valuable in helping the agent explore when extrinsic rewards are sparse.

The Curiosity Reward Signal enables the Intrinsic Curiosity Module, which is an implementation of the approach described in Curiosity-driven Exploration by Self-supervised Prediction by Pathak et al. This module trains two networks:

  • An inverse model that takes the current and next observation of the agent, encodes them, and uses the encoding to predict the action that was taken between the observations.
  • A forward model that takes the encoded current observation and action and predicts the next encoded observation.

The loss of the forward model, which is the difference between the predicted and actual encoded observations, is used as the intrinsic reward. The more surprised the model is, the larger the reward will be.

Using RND for Sparse-reward Environments in ML-Agents Toolkit

Random Network Distillation (RND) is another useful approach for helping agents explore in sparse or rare reward environments, similar to Curiosity. The RND Module is implemented following the paper Exploration by Random Network Distillation.

RND uses two networks:

  • The first network has fixed random weights and takes observations as inputs to generate an encoding.
  • The second network has a similar architecture and is trained to predict the outputs of the first network, using the observations the Agent collects as training data.

The loss, which is the squared difference between the predicted and actual encoded observations, of the trained model is used as intrinsic reward. The more an Agent visits a state, the more accurate the predictions become and the lower the rewards will be. This encourages the Agent to explore new states with higher prediction errors.

Using Imitation Learning in ML-Agents Toolkit

Imitation Learning is a powerful technique that leverages demonstration to train an agent to perform a specific task. Instead of relying on trial-and-error methods, we can provide the agent with real-world examples of observations and actions to guide its behavior. For example, we can give a medic in a game examples of observations and actions from a game controller to help it learn how to behave.

Imitation Learning can be used alone or in conjunction with reinforcement learning. When used alone, it can help the agent learn a specific type of behavior or solve a specific task. When used in conjunction with reinforcement learning, it can significantly reduce the time it takes for the agent to solve the environment, especially in sparse-reward environments.

The ML-Agents Toolkit provides support for Imitation Learning, enabling users to teach their agents specific behaviors quickly and effectively. This capability can help accelerate the training process and improve the overall performance of the agent. Check out this video demo of Imitation Learning in action.

unity ml agents bc

The ML-Agents Toolkit offers the ability to learn directly from demonstrations and utilize them to accelerate reward-based training (RL). It includes two algorithms: Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL). These two features can be combined in most scenarios:

  • Learning from Demonstrations with Sparse Rewards:

If you want to assist your agents in learning, especially in environments with sparse rewards, you can enable both GAIL and Behavioral Cloning at low strengths, in addition to having an extrinsic reward. An example of this approach is provided for the PushBlock example environment, located in config/imitation/PushBlock.yaml.

  • Training Purely from Demonstrations:

If you aim to train solely from demonstrations using GAIL and BC, without relying on an extrinsic reward signal, please refer to the CrawlerStatic example environment found in config/imitation/CrawlerStatic.yaml.

Please note that when using GAIL (Generative Adversarial Imitation Learning), there is a survivor bias introduced into the learning process. This means that the agent is incentivized to stay alive for as long as possible by receiving positive rewards based on similarity to the expert. However, this bias can directly conflict with goal-oriented tasks, such as the PushBlock or Pyramids example environments, where the agent needs to reach a goal state quickly to end the episode.

In such cases, we strongly recommend the following approach:

  • Use a low strength GAIL reward signal.
  • Employ a sparse extrinsic signal when the agent successfully achieves the task.

By combining these two signals, the GAIL reward will guide the agent until it discovers the extrinsic signal, without overpowering it. If you notice that the agent is disregarding the extrinsic reward signal, we advise reducing the strength of GAIL to find the right balance between the two signals.

GAIL (Generative Adversarial Imitation Learning)

GAIL (Generative Adversarial Imitation Learning) is an approach that utilizes adversarial techniques to reward an Agent for exhibiting behavior similar to a set of demonstrations. It can be employed with or without environment rewards and is particularly effective when the number of demonstrations is limited.

In the GAIL framework, a second neural network called the discriminator is trained to distinguish between observations/actions from the demonstrations and those generated by the agent. Based on the discriminator’s evaluation, the agent receives a reward that reflects the similarity between its new observation/action and the provided demonstrations.

During each training step, the agent strives to maximize this reward, while the discriminator continues to improve its ability to differentiate between demonstrations and agent-generated state/actions. As a result, the agent becomes increasingly adept at mimicking the demonstrations, while the discriminator becomes more stringent, requiring the agent to exert greater effort to deceive it.

This approach enables the learning of a policy that generates states and actions resembling those in the demonstrations, even with a reduced number of demonstrations compared to direct cloning of actions. Furthermore, the GAIL reward signal can be combined with an extrinsic reward signal to guide the learning process.

Using Behavioral Cloning in ML-Agents Toolkit

Behavioral Cloning (BC) is a training technique that involves training an agent’s policy to exactly mimic the actions shown in a set of demonstrations. The BC feature can be enabled on the PPO or SAC trainers in the ML-Agents Toolkit.

While BC can be effective in teaching an agent specific behaviors, it has limitations. BC cannot generalize past the examples shown in the demonstrations, so it works best when there are demonstrations for nearly all of the states that the agent can experience. Alternatively, BC can be used in conjunction with GAIL and/or an extrinsic reward to improve its performance.

Recording Demonstrations in ML-Agents Toolkit

Recording demonstrations of agent behavior is a crucial step in training an agent using Imitation Learning, Behavioral Cloning (BC), or Generative Adversarial Imitation Learning (GAIL). The ML-Agents Toolkit provides an easy way to record demonstrations directly from the Unity Editor or build and save them as assets.

The recorded demonstrations contain valuable information on the observations, actions, and rewards for a given agent during the recording session. These demonstrations can be managed in the Editor and used for training with BC and GAIL.

In summary, the ML-Agents Toolkit offers three training methods: BC, GAIL, and RL (PPO or SAC), which can be employed independently or in combination:

  • BC (Behavioral Cloning) can be used as a standalone method or as a preliminary step before applying GAIL and/or RL.
  • GAIL (Generative Adversarial Imitation Learning) can be utilized with or without extrinsic rewards, allowing the agent to learn from demonstrations and mimic their behavior.
  • RL (Reinforcement Learning) can be implemented as a standalone approach, using either PPO or SAC algorithms, or combined with BC and/or GAIL to enhance training outcomes.
  • Both BC and GAIL methods require the availability of recorded demonstrations, which serve as input for the training algorithms.

Summary

The ML-Agents Toolkit offers three training methods: BC, GAIL, and RL (PPO or SAC), which can be employed independently or in combination:

  • BC (Behavioral Cloning) can be used as a standalone method or as a preliminary step before applying GAIL and/or RL.
  • GAIL (Generative Adversarial Imitation Learning) can be utilized with or without extrinsic rewards, allowing the agent to learn from demonstrations and mimic their behavior.
  • RL (Reinforcement Learning) can be implemented as a standalone approach, using either PPO or SAC algorithms, or combined with BC and/or GAIL to enhance training outcomes.

Both BC and GAIL methods require the availability of recorded demonstrations, which serve as input for the training algorithms.

Environment-Specific Behavior Training in ML-Agents

In addition to the three environment-agnostic training methods, the ML-Agents Toolkit provides additional methods that can aid in training behaviors for specific types of environments.

One such method is Self-Play, which is useful for training in competitive multi-agent environments. Self-Play can be used for both symmetric and asymmetric adversarial games. In symmetric games, opposing agents are equal in form, function, and objective, while in asymmetric games, this is not the case. Self-Play allows an agent to learn by competing against fixed, past versions of its opponent, providing a more stable learning environment.

Self-Play can be used with both Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) algorithms. However, from the perspective of an individual agent, these scenarios may appear to have non-stationary dynamics because the opponent is constantly changing. This can cause significant issues in the experience replay mechanism used by SAC.

By leveraging environment-specific training methods like Self-Play in the ML-Agents Toolkit, users can train their agents more effectively and efficiently for specific types of environments. These methods provide a flexible and powerful way to teach agents new behaviors and improve their overall performance in complex, dynamic environments.

Cooperative Multi-Agent Training with MA-POCA in ML-Agents Toolkit

ML-Agents Toolkit provides the functionality for training cooperative behaviors, where groups of agents work towards a common goal, and the success of the individual is linked to the success of the whole group. In such scenarios, agents typically receive rewards as a group, making it difficult for individual agents to learn what to do.

To address this, ML-Agents offers MA-POCA (MultiAgent POsthumous Credit Assignment), a novel multi-agent trainer that trains a centralized critic, a neural network acting as a “coach” for the group of agents. With MA-POCA, rewards can be given to the team as a whole, and the agents will learn how to contribute to achieving that reward. Individual rewards can also be given, and the team will work together to help the individual achieve those goals.

MA-POCA allows for agents to be added or removed from the group during an episode, such as when agents spawn or die in a game. Even if agents are removed mid-episode, they will still learn whether their actions contributed to the team winning later, enabling agents to take group-beneficial actions even if it results in the individual being removed from the game (i.e., self-sacrifice). MA-POCA can also be combined with self-play to train teams of agents to play against each other.

Improving Machine Learning with Curriculum Learning

Curriculum learning is a training technique for machine learning models that gradually introduces more difficult aspects of a problem in a way that optimally challenges the model. This approach mimics how humans learn, where topics are ordered in a specific sequence, with easier concepts taught before more complex ones.

For example, in primary school education, arithmetic is taught before algebra, and algebra is taught before calculus. The skills and knowledge learned in earlier subjects provide a foundation for later lessons. Similarly, in machine learning, training on easier tasks can provide a foundation for tackling harder tasks in the future.

Agent Robustness with Environment Parameter Randomization

When an agent is trained on a specific environment, it may struggle to generalize to variations or tweaks in the environment, leading to overfitting. This issue is particularly problematic when environments are instantiated with varying objects or properties. To address this, agents can be exposed to variations during training to improve their ability to generalize to unseen environmental changes.

Similar to curriculum learning, where the difficulty of environments increases as the agent learns, the ML-Agents Toolkit offers a mechanism to randomly sample environment parameters during training. This approach, called environment parameter randomization, exposes the agent to a range of environments with varying parameters, enabling it to learn to adapt to different scenarios.

Types of Models in the ML-Agents Toolkit

The ML-Agents Toolkit offers users the ability to train various model types, irrespective of the chosen training method. This flexibility stems from the capability to define agent observations in different forms, including vector, ray cast, and visual observations.

Learning from Observations

The ML-Agents Toolkit offers a fully connected neural network model to facilitate learning from both ray cast and vector observations of an agent. During training, you have the flexibility to configure various aspects of this model, including the number of hidden units and layers, to optimize the learning process.

Multiple Cameras and CNNs for Learning in the ML-Agents Toolkit

The ML-Agents Toolkit allows multiple cameras to be used for observations per agent, enabling agents to integrate information from multiple visual streams. This feature is not available in other platforms, which may limit the agent’s observation to a single vector or image. Using multiple cameras can be helpful in scenarios such as training a self-driving car with multiple cameras of different viewpoints or a navigational agent that needs to integrate aerial and first-person visuals.

When visual observations are utilized, the ML-Agents Toolkit uses convolutional neural networks (CNNs) to learn from input images. The toolkit offers three network architectures for CNNs, including a simple encoder with two convolutional layers, the implementation proposed by Mnih et al. with three convolutional layers, and the IMPALA Resnet with three stacked layers, each with two residual blocks. The choice of architecture depends on the visual complexity of the scene and available computational resources.

Adapting to Variable-Length Observations through Attention-based Learning

With the ML-Agents Toolkit, agents can learn from variable-length observations by leveraging the concept of attention. Each agent can maintain a buffer of vector observations, allowing them to keep track of a varying number of elements throughout an episode. At each step, the agent processes the elements in the buffer and extracts relevant information. This capability is particularly useful in scenarios where agents need to handle dynamic elements, such as avoiding projectiles in a game where the number of projectiles can vary. By utilizing attention mechanisms, agents can effectively learn from and adapt to changing environments.

Enhancing Agent Memory with Recurrent Neural Networks in the ML-Agents Toolkit

In certain scenarios, agents must be able to remember past observations to make optimal decisions. However, when an agent only has partial observability of the environment, it can be challenging to keep track of past observations. To address this, the ML-Agents Toolkit offers memory-enhanced agents using recurrent neural networks (RNNs).

RNNs, specifically Long Short-Term Memory (LSTM), can help agents learn what is important to remember in order to solve a task. This is particularly useful when an agent has limited observability of the environment. By using RNNs, agents can keep track of past observations and make better decisions based on that information.

Additional Features for Flexibility and Interpretability in the ML-Agents Toolkit

In addition to the flexible training scenarios available, the ML-Agents Toolkit offers several additional features to improve the flexibility and interpretability of the training process.

  • Concurrent Unity Instances – The toolkit allows developers to run concurrent, parallel instances of the Unity executable during training. This can speed up training for certain scenarios. The toolkit’s documentation includes a dedicated page on creating a Unity executable and the Training ML-Agents page provides instructions on setting the number of concurrent instances.
  • Recording Statistics from Unity – The toolkit enables developers to record statistics from within their Unity environments. These statistics are aggregated and generated during the training process, providing valuable insights into the training progress.
  • Custom Side Channels – The toolkit allows developers to create custom side channels to manage data transfer between Unity and Python, which can be tailored to their training workflow and/or environment. This feature provides additional flexibility in data transfer and management during training.
  • Custom Samplers – The toolkit enables developers to create custom sampling methods for Environment Parameter Randomization. This allows users to customize this training method for their particular environment, providing greater control over the training process.

Conclusion

The ML-Agents Toolkit allows games and simulations built in Unity to serve as a platform for training intelligent agents. It offers a wide range of training modes and scenarios, along with several features to enhance machine learning within Unity.

To get started with the toolkit, users can refer to the documentation and tutorials available on the Unity website. The toolkit also has an active community forum where users can ask questions, share their experiences, and get support from other users and developers.

As the field of machine learning continues to evolve, the ML-Agents Toolkit remains at the forefront of enabling researchers and developers to leverage this technology within Unity. By utilizing the toolkit’s flexibility and features, users can create intelligent agents that can navigate complex environments and solve challenging tasks.