Dota 2 vectors that have the same magnitude
Machine Learning on DOTA 2 Statistics
DOTA 2 is a heavily played game, with 640,227 average players in June of 2016. 1 Recently, I had the chance to do an independent study at the University Of Missouri – Kansas City looking at the YASP Dota 2 Dataset 2 . I tried to answer two different questions: How accurately could one predict which team will win based on their initial choices of heroes, and how does the number of resources acquired at different times in the game affect the likelihood of winning?
Reason For Doing This Project
I am not someone who regularly plays DOTA 2. I have many friends that do and there exist a large community and public data. Dota 2 vectors that have the same magnitude I think this made it a good choice for a machine learning project since I have minimal domain-specific knowledge and the data was widely accessible. Additionally, having people who are interested in the results and like to chat about the project made gathering knowledge easier than something like a finance or crime project.
Dataset and Filters
For a dataset, I initially looked at the YASP 3.5 Million Data Dump 3 but ended up mostly using the YASP December 2015 500k Data Dump 4 since it was smaller and easier to deal with. I filtered the data by removing games where the there were less than 10 human players and none of the players had leaver status. 5 This was done to remove some variance from the data while still preserving a wide range of games. Unlike most other projects doing similar things, I did not filter out games in the low skill bracket.
Accuracy of Prediction in Games given GPM 6 and XPM 7 History To That Minute
This question explores what the link between resource acquisition history and accuracy of predicting a user. DOTA 2 is largely cons >8 The model was built by looking at the history of the GPM and XPM at every minute up to a specific minute, and then built a logistic regression model off of that history and then tried to predict the outcome of the game. Each model used 5000 samples and was validated against 10 fold cross validation. These samples weren’t necessarily the same 5000 samples since it takes the first 5000 samples where the game is long enough to contain that many minutes, so a 40-minute game wouldn’t be considered at 60 minutes.
The model’s vectors are
Y = 1 if radiant won, 0 otherwise.
T is a particular minute of the game, and max_time is the maximum amount of time considered.
Graph
There appears to be a fairly linear relationship with prediction accuracy and minutes into the game until you get to about 30 minutes into the game when it starts to level off. This could hint that the first 30 minutes of the game are the most important in terms of resources.
Predicting Win/Loss Using Initial Hero Picks
Besides resources, team composition and hero picks are another important factor in determining the success a team has with a DOTA 2 match. I used a few different models which were trained on the following vectors
Xi = 1 if heroID i was in the match on radiant side, 0 otherwise
X(i+113) = 1 if heroID i was in the match on dire side, 0 otherwise
Y = 1 if radiant won, 0 otherwise
which leads to a >9 . The number 113 was chosen because there are 113 heroIDs in DOTA 2. There may have been fewer heroes in the dataset I considered, but constant 0s shouldn’t have an affect on the output.
Models
Logistic Regression
Logistic Regression performed fairly well and was very quick to train. I used most of the SKlearn defaults for LogisticRegression but changed the number of jobs for the fitting routine to match the number of cores I wanted to use. 10
The following graph shows the accuracy of predicting a game given the feature vector with the number of samples, as well as the standard deviation.
Accuracy was slightly higher with lower samples, but plateaus again at about 2500 samples, with a lower standard deviation. The sharp spikes at the beginning suggest to me that the model was undertrained and happened to ‘get lucky’ on the data it encountered.
K-Nearest Neighbors
I tried K-Nearest Neighbors a few different ways with varying results. KNN also has a parameter of K, which needed to be found for every model, but turned out to be about 48 for each. Because there was a high factor space with a relatively low number of models, KNN did not perform as well as it could.
K-Nearest Neighbours, Sklearn Default Settings
I ran a program which would find the optimal number of neighbours on 5000 samples, which peaked around 48
After finding the optimal number of neighbours, I looked at how it would perform with different numbers of samples
K-Nearest Neighbours, Distance Weighting
SKlearn has an optional weights 11 parameter which controls how much weight each neighbour has. Setting this to ‘distance’ weights points according to how far away they are and provided a boost in accuracy. This is probably because the factor space is too big for the number of samples, so matches which are far away are getting considered too heavily without a weighting metric.
Looking at neighbours versus accuracy for this model gives a similar graph as the non-weighted version.
In terms of accuracy with a number of samples, it performs slightly better than unweighted, but not as good as Logistic Regression.
K-Nearest Neighbours, Kevin Technologies Custom Weighting and Distance Metric
Kevin Technology reported improvements using a custom weighting function described here I tried to replicate their work, but I could not get any good results, as shown in the following graph. This may very well be a result of me implementing it wrong, or that they are only training on high skill games.
Neural Networks
I tried to implement a Neural Network because I believed it would do well in a high factor space and consider hero relationships well, but I couldn’t get any results above 53% accuracy, probably because I only have a very basic understanding of them. Exploring Neural Networks could be a project for the future.
Tools Used
Special Thanks
Professor Eddie Burris – Helping me and overseeing the project
Nathan Walker – Consulting with me and providing domain specific knowledge
Code Listing
Note: This does not show all of the things shown in this post simply because I would change things around and overwrite the file.
Update January 8, 2017: I recieved an email saying the code didn’t run. I changed it so it hopefully does now.
http://dev. dota2.com/showthread. php? t=105752 The best explanation of leaver status I could find. ↩
Gold Per Minute, Gold is be used to buy items to power up a player’s hero. http://dota2.gamepedia. Dota 2 vectors that have the same magnitude com/Gold ↩
Experience Per Minute, Experience is used to build a hero to be stronger and learn new abilities. http://dota2.gamepedia. Dota 2 vectors that have the same magnitude com/Experience ↩
jmcauley/cse255/reports/fa15/018.pdf Page 6, Table 1 ↩

