Friday Facts #147 - Multiplayer rewrite

Posted by kovarex on 2016-07-15, all posts

Hello!

Multiplayer - new field for me

Once we started the matching server, we finally had to face the reality of the multiplayer over the internet with people around the world. We realised there are A LOT of problems rising to the surface, and that it needs to be worked on. I left all the multiplayer logic to be done by cube and tomas until now, and I had only a very simplistic idea how it works. With tomas on holiday and cube busy with other tasks, I realized how big of a problem it is that no one else has a clue how it works under the hood, so I took this opportunity to dive into it personally. After a week of reading, discussions with cube and partial rewrites, I can present you with my findings and a roadmap of ongoing changes.

From the Peer-to-peer model to the server-client model that is still kind of peer-to-peer

As some of you might know, the Factorio multiplayer was originally written to be always peer-to-peer. The motivation was to minimize the latency, as in the theoretical case of everyone having the same connection with everyone else in the game, the latency would actually be smaller compared to the server-client model. The problem is, that there are many things that had to be paid as a price.
  • Everyone needs to be actually sending packets to everyone else, which isn't that easy in the current world, where IPv6 isn't everywhere, and public IPv4 address is becoming quite a luxury. This can be solved by nat punching, but it also isn't 100% reliable.
  • The logic of events, like joining, quitting, disconnecting, is very complex, as it always has to be discussed by the peers before anything can be done. And as we have the lock-step simulation, it always has to be ensured, that these actions are performed in perfect synchrony. Complexity means bugs, and in this case, some of them are hard to fix. On top of that, even if it was written perfectly, it wouldn't feel perfect.
  • Everyone needs to have the same latency.
  • No defence from lag spikes of individual players.
  • One packet per player per tick sent and received by everyone, so the amount of packets sent is O(n^2)
So once we encountered the "real internet" network communication, these problems shown to be too serious. We could have anticipated this if we had only listened to the people warning us that peer to peer will lead to trouble when we were first writing about the implementation more than a year ago. But sometimes, you just have to learn from your own mistakes.
So we added an additional option to run in the Server mode, which became the only option later on.
But our server mode only solved the first problem, as it was just a patch that re-routed all the communication between peers to go through the server, but all of the peer discussion related complexities stayed.

The original peer to peer model, 6 packets per tick minimum with 3 players.


The new server mode we have currently. The peer 1 resends the packets between other peers, here we send 8 packets per tick minimum.

In other words, we took the worst from both of the models and combined it.

The real server-client architecture version 1.0 (to be done)

The current state can't be solved by just small fixes and tweaks, fundamental changes in the internals of the multiplayer logic on almost all of the layers has to be done to take advantage of the possible simplifications implied from the fact that peer to peer isn't supported anymore. Let me present the most important changes that I'm working on:
Clients receive merged package once per tick.
One of the most obvious changes is, that instead of re-sending all the packets, the server is unpacking these and merging them. He first waits to get the actions of all players in a certain tick, and then sends it to all the clients as a single message. This not only reduces the number of packets sent (from O(n^2) to O(n)), but it also keeps the clients from having to deal with the synchronisation and shit. They just accept the package as it is and apply it. If they miss something due to the packet loss, the client just asks the server for the whole package to be resent, in other words, the clients don't communicate with each other at all.

The future server model. The peer 1 sends the merged package, so we are down to 4 packets per tick. (The difference will grow greatly with more players)

Clients don't know about other clients (network wise)
As clients don't need to communicate, they don't even need to know about their existence in the game. This doesn't mean that you wouldn't know about other players in the game. When a player joins, clients receive an input action Player joined as part of the merged package, so the player is created on the map and in the player list, but this is not related to network logic, and it is a different layer that works like this already. The difference is, that the clients don't need to know what network entity is related to what player, they don't care. Ignorance is bliss!
Server is the only Input action authority
The clients are also sending input actions, but only to the server, and it is up to the server to decide whether it should be included in the merged package or not. As the merged package is the only source of the actions to be applied, the server can safely omit a player from the package if he has a lag spike, so the lag spike is isolated from the rest of the game. This is not possible in the peer to peer model.
Removal of strange freezes on network events
Currently, when player wants to join a game, first it had to be discussed to stop the game at a certain tick. This tick had to be at least one latency step in the future, as other players could already be ahead of us, and you can't go backwards in Factorio state (Yes entropy works the same way in Factorio world). This is the strange freeze that happens when someone is connecting, disconnecting etc. During this time, the new client is introduced to others so they know they have to count with him.
But as we decided that clients know nothing about other clients, this can be removed completely. Once the server agrees on the new player to join the game, it can just start sending his actions as part of the merged package without any interruption. The save-game still needs to be uploaded by the server, so there will still be waiting, but there shouldn't be any strange freezes inbetween the download progress bar and normal game anymore.
Internal code simplification
As all the logic is straightened, the internal code will get actually much more simple as well. Simplier code means less bugs. Also this should mean, that if we want someone else to tweak the internals of the multiplayer, it shouldn't take him 3 weeks of study to understand what is going on.

The possible improvements (version 2.0)

Once this is all implemented and working, which will take some weeks, we could use this architecture as a reliable base to make additional improvements. These are ideas that shouldn't be that hard to do, but can't be promised.
Individual latency
The latency is now a global parameter of the multiplayer game, and it is the delay between creation of the user action (Input action), and it's execution, the bigger the latency is, the more time to deliver the actions between players, so the game might be less laggy. Big latency is bad for gameplay, small latency is bad for distant players. But with the proper server-client architecture and the server being the only input action authority, everyone can have different latency. The guy in the same street as the server only needs 30ms to send the package to the server to be included in the next merged package and next 30ms to get his action back to see it on the screen. But someone from the other part of the world in the same game might need 500ms to send the message, so his actions will just be packed into the merged package with bigger delay.
The latency of individual players should be tweaked automatically during the game by the server, so it could make sure that it is as small as possible for a flawless game.
An implication of this would be, that the server would have 0 latency, this would be unfair in a competitive game, but in Factorio, there is no reason to drag everyone down just to make it fair.
Don't wait for upload
This feature could also be added. When someone joins the game, the server needs to save it, and others have to wait for it, this can't be removed, but once it starts uploading the game, other players (including the server) could just continue playing, while the server is providing the map. On top of that, the server would save all the actions made by the players in the meantime. Once the map is uploaded, the server would send these actions and the client would fast-forward to catch up. The only limitation is, that the client has to be able to run the map in faster speed, so he can catch up.
Auto kick based on network speed or CPU limits
Apart from the latency tweaking based on the network roundrip time of individual peers, the server could also measure slowdown related to CPU lag. CPU lag means, that some of the players computers are not able to simulate the Factorio map fast enough, so others have to slow down their simulation and wait for him. It should be possible to set an option to auto-kick players who drag down the game too much. A similar limitation could be applied to upload speed.

Translations for Factorio

Scott and Mishka have been spending this week working through the crowdin, this is the website we use to allow the community to help us translate the game. To this date the community on crowdin has helped us translate the game into dozens of languages, along with the subtitles for the trailer and other related media.
We'd like to take an opportunity to thank all the translators, as their contribution really has a great impact on the game, and their perspective often helps us in understanding how terms and descriptions we write in the game are interpreted by the players.
We still have many untranslated languages on crowdin, so if you think you might be able to help out with the translation for your language, please checkout the factorio project.

As always, let us know what you think on our forums