Create a Bot to Find Diamonds in Minecraft


Reinforcement Learning and Behavior Cloning in Python with MineRL

Image by author (Mojang license)

Minecraft is the next frontier for Artificial Intelligence.

It’s a huge game, with many mechanics and complex sequences of actions. It takes an entire wiki with over 8000 pages just to teach humans how to play Minecraft. So how good can be artificial intelligence?

This is the question we’ll answer in this article. We’ll design a bot and try to achieve one of the most difficult challenges in Minecraft: finding diamonds from scratch. To make things even worse, we will take on this challenge in randomly generated worlds so we can’t learn a particular seed.

Sequence of actions to find diamonds, image by author (Mojang license)

What we’re gonna talk about is not limited to Minecraft. It can be applied to similar complex environments. More specifically, we will implement two different techniques that will become the backbone of our intelligent agent.

But before we can train an agent, we need to understand how to interact with the environment. Let’s start with a scripted bot to get familiar with the syntax. We’ll use MineRL, a fantastic library to build AI applications in Minecraft.

The code used in this article is available on Google Colab. It is a simplified and finetuned version of the excellent notebooks made by the organizers of the MineRL 2021 competition (MIT License).

MineRL allows us to launch Minecraft in Python and interact with the game. This is done through the popular gym library.

Image by author

We are in front of a tree. As you can see, the resolution is quite low. A low resolution means fewer pixels, which speeds things up. Fortunately for us, neural networks don’t need a 4K resolution to understand what’s happening on screen.

Now, we would like to interact with the game. What can our agent do? Here’s the list of possible actions:

List of actions (image by author)

The first step to find diamonds is to get wood to make a crafting table and a wooden pickaxe.

Let’s try to get closer to the tree. It means that we need to hold the “forward” button for less than a second. With MineRL, there are 20 actions processed per second: we don’t need a full second so let’s process it 5 times, and wait for 40 more ticks.

Image by author
Image by author

Great, let’s chop this tree now. We need four actions in total:

  • Forward to go in front of the tree;
  • Attack to chop the tree;
  • Camera to look up or down;
  • Jump to get the final piece of wood.
Image by author

Handling the camera can be a hassle. To simplify the syntax, we’re gonna use the str_to_act function from this GitHub repository (MIT license). This is what the new script looks like:

The agent efficiently chopped the entire tree. This is a good start, but we would like to do it in a more automated way…

Our bot works well in a fixed environment, but what happens if we change the seed or its starting point?

Everything is scripted so the agent would probably try to chop a non-existent tree.

This approach is too static for our requirements: we need something that can adapt to new environments. Instead of scripting orders, we want an AI that knows how to chop trees. Naturally, reinforcement learning is a pertinent framework to train this agent. More specifically, deep RL seems to be the solution since we’re processing images to select the best actions.

There are two ways of implementing it:

  • Pure deep RL: the agent is trained from scratch by interacting with the environment. It is rewarded every time it chops a tree.
  • Imitation learning: the agent learns how to chop trees from a dataset. In this case, it is a sequence of actions to chop trees made by a human.

The two approaches have the same outcome, but they’re not equivalent. According to the authors of the MineRL 2021 competition, it takes 8 hours for the pure RL solution and 15 minutes for the imitation learning agent to reach the same level of performance.

We don’t have that much time to spend, so we’re going for the Imitation Learning solution. This technique is also called Behavior Cloning, which is the simplest form of imitation.

Note that Imitation Learning is not always more efficient than RL. If you want to know more about it, Kumar et al. wrote a great blog post about this topic.

Image by author

The problem is reduced to a multi-class classification task. Our dataset consists of mp4 videos, so we’ll use a Convolutional Neural Network (CNN) to translate these images into relevant actions. Our goal is also to limit the number of actions (classes) that can be taken so the CNN has fewer options, which means it’ll be trained more efficiently.

In this example, we manually define 7 relevant actions: attack, forward, jump, and move the camera (left, right, up, down). Another popular approach is to apply K-means in order to automatically retrieve the most relevant actions taken by humans. In any case, the objective is to discard the least useful actions to complete our objective, such as crafting in our example.

Let’s train our CNN on the MineRLTreechop-v0 dataset. Other datasets can be found at this address. We chose a learning rate of 0.0001 and 6 epochs with a batch size of 32.

Step  4000 | Training loss = 0.878
Step 8000 | Training loss = 0.826
Step 12000 | Training loss = 0.805
Step 16000 | Training loss = 0.773
Step 20000 | Training loss = 0.789
Step 24000 | Training loss = 0.816
Step 28000 | Training loss = 0.769
Step 32000 | Training loss = 0.777
Step 36000 | Training loss = 0.738
Step 40000 | Training loss = 0.751
Step 44000 | Training loss = 0.764
Step 48000 | Training loss = 0.732
Step 52000 | Training loss = 0.748
Step 56000 | Training loss = 0.765
Step 60000 | Training loss = 0.735
Step 64000 | Training loss = 0.716
Step 68000 | Training loss = 0.710
Step 72000 | Training loss = 0.693
Step 76000 | Training loss = 0.695

Our model is trained. We can now instantiate an environment and see how it behaves. If the training was successful, it should frantically cut all the trees in sight.

This time, we’ll use the ActionShaping wrapper to map the array of numbers created with dataset_action_batch_to_actions to discrete actions in MineRL.

Our model needs a pov observation in the correct format and outputs logits. These logits can be turned into a probability distribution over a set of 7 actions with the softmax function. We then randomly choose an action based on the probabilities. The selected action is implemented in MineRL thanks to env.step(action).

This process is repeated as many times as we want. Let’s do it 1000 times and watch the result.

Our agent is quite chaotic but it manages to chop trees in this new, unseen environment. Now, how to find diamonds?

A simple yet powerful approach consists of combining scripted actions with artificial intelligence. Learn the boring stuff, script the knowledge.

In this paradigm, we’ll use the CNN to get a healthy amount of wood (3000 steps). Then, we can script a sequence to craft planks, sticks, a crafting table, a wooden pickaxe, and start mining stone (it should be below our feet). This stone can then be used to craft a stone pickaxe, which can mine iron ore.

CNN + script approach, image by author (Mojang license)

This is when things get complicated: iron ore is quite rare, so we would need to run the game for a while to find a deposit. Then, we would have to craft a furnace and melt it to get the iron pickaxe. Finally, we would have to go even deeper and be even luckier to obtain a diamond without falling into lava.

As you can see, it’s doable but the outcome is fairly random. We could train another agent to find diamonds, and even a third one to create the iron pickaxe. If you’re interested in more complex approaches, you can read the results of the MineRL Diamond 2021 Competition by Kanervisto et al. It describes several solutions using different clever techniques, including end-to-end deep learning architectures. Nonetheless, it is a complex problem and no team managed to consistently find diamonds, if at all.

This is why we will limit ourselves to obtaining a stone pickaxe in the following example, but you can modify the code to go further.

We can see our agent chopping wood like a madman during the first 3000 steps, then our script takes over and completes the task. It might not be obvious, but the command print(obs.inventory) shows a stone pickaxe. Note that this is a cherry-picked example: most of the runs don’t end that well.

There are several reasons why the agent may fail: it can spawn in a hostile environment (water, lava, etc.), in an area without wood, or even fall and die. Playing with different seeds will give you a good understanding of the complexity of this problem and, hopefully, ideas to build event better agents.

I hope you enjoyed this little guide to reinforcement learning in Minecraft. Beyond its obvious popularity, Minecraft is an interesting environment to try and test RL agents. Like NetHack, it requires a thorough knowledge of its mechanics to plan precise sequences of actions in a procedurally-generated world. In this article,

  • We learned how to use MineRL;
  • We saw two approaches (script and behavior cloning) and how to combine them;
  • We visualized the agent’s actions with short videos.

The main drawback of the environment is its slow processing time. Minecraft is not a lightweight game like NetHack or Pong, which is why the agents take a long time to be trained. If this is a problem for you, I would recommend lighter environments like Gym Retro.

Thank you for your attention! Feel free to follow me on Twitter if you’re interested in AI applied to video games.





Source link

Leave a Reply

Your email address will not be published.

Previous Article

Democratic Texas Governor Candidate Beto O’Rourke Trends After Confronting Governor Greg Abbott During Elementary School Shooting Press Conference

Next Article

Sims 4's Improved Bella Goth & Family Now Available In-Game

Related Posts