Distributed Multi-sense Word Embedding (DMWE)

The DMWE tool is a parallelization of the Skip-Gram Mixture [2] algorithm on top of the DMTK parameter server. It provides an efficient "scaling to industry size" solution for multi sense word embedding.

The DMWE tool runs in the following manner:
On the client side (running on multiple nodes): three local training steps are executed repeatedly:
1. Get the latest parameters from the DMTK parameter server
2. Run the Skip-Gram Mixture algorithm [2] to generate updates to the current parameters
3. Send the parameter updates to the DMTK parameter server

On the server side, the DMTK parameter server acts as below:
1. Pack the requested parameters and send them to clients
2. Aggregate parameter updates from different clients and merge them into the global parameters


Word2vec [1] uses a single embedding vector for each word, which is not good enough to express the multiple meanings of polysemous words. To solve the problem, the Skip-Gram Mixture model was proposed to produce multiple embedding vectors for the polysemous words [2]. However, computing the multiple vectors for words are computationally expensive, and thus we develop the Distributed Multi-sense Word Embedding (DMWE) tool, which is highly scalable and efficient. It can be used to train multi-sense embedding vectors on very large scale dataset. The training process is powered by the DMTK framework:

  1. The DMTK parameter server stores the parameters in a distributed way, which means that each machine just holds a partition of the entire parameter set. This allows the entire embedding vector to be very large. For example, in our experiment on the ClueWeb data, the vocabulary size is 21 Million, and the number of parameters is up to over 2 billion.

  2. The training process in the clients is conducted in a streaming manner and is automatically pipelined. Specifically, during the training, the data are processed block by block. For each block, the client software will go the three step as aforementioned. The parameter request and model training steps in successive data blocks are pipelined so as to hide the delay caused by the network communication. Furthermore, in this way, the clients just need to hold the parameters for several data blocks simultaneously, corresponding to very economic memory usage.


To download the source codes of DMWE, please run
$ git clone https://github.com/Microsoft/distributed_skipgram_mixture
Please note that DMWE is implemented in C++ for performance consideration.



DMWE is built on top of the DMTK parameter sever, therefore please download and build this project first.

For Windows

  1. Open windows\distributed_skipgram_mixture\distributed_skipgram_mixture.sln using Visual Studio 2013. Add the necessary include path (for example, the path for DMTK multiverso) and lib path. Then build the solution.

For Ubuntu (Tested on Ubuntu 12.04)

  1. Download and build by running $ sh scripts/build.sh. Modify the include and lib path in Makefile. Then run $ make all -j4.

Running DMWE

Training on a single machine

  1. Initialize the settings in the run.py according to your preference
  2. Run run.py in the solution directory

Training with distributed setting

Using mpi:
  1. Create a host.txt file containing all the machines to be used for training
  2. Split your dataset into several parts and store them into the same directory of these machines
  3. Distribute the same executable file into the same directory of these machines
  4. Run the command line "smpd.exe -d -p port" in every machine
  5. Run run.py in one of the machines with host.txt as its argument
Using ZMQ:
  1. Compile the library of the DMTK parameter server, by specifying the communication mode to be ZMQ
  2. Compile the project Multverso.Sever, and you will get the executable Multiverso.Sever.exe
  3. Prepare a configuration file end_points.txt to describe the sever endpoints
  4. Add a parameter setting in run.py, e.g.,'_endpoint_file=config.txt'
  5. Start Multiverso.Sever.exe in each sever machine with appropriate command line arguments (please use Multiverso.Sever.exe -help for further information)
  6. Execute run.py in one of the machines with end_points.txt as its argument

Algorithm configure for DMWE

For the Skip-Gram Mixture word embedding algorithms, we have provided hyperparemeters such as embedding size, number of polysemous words, number of senses and the others. You can specify their values in run.py

For the distributed training, users can configure the size of the data block, the mechanism for parameter update (such as ASP - Asynchronous Parallel, SSP - Stale Synchronous Parallel, BSP - Bulk Synchronous Parallel, and MA - Model Average), by setting the parameters in run.bat. For more details, please refer to the document of the DMTK parameter server.

The details of all the parameters in run.py are explained in parameters_setting.txt.


We report the performance of the DMWE tool on the English versions of Wiki2014 [5] and Clueweb09 [6]. The statistics* of these datasets and the performances of DMWE are given as follows. The experiments are run on 20 cores of Intel Xeon E5-2670 CPU on each machine.

Dataset Token# Vocabulary size Embedding dimension Machine# Training time / epoch (seconds) Spearman's Rank Correlation on Word Similarity in Context[3]
Word2Vec [1] Wiki2014(en) 3,402,883,423 2,043,680 50 1 14,305 0.5505
SG-Mixture [2] Wiki2014(en) 3,402,883,423 2,043,680 50 1 21,779 0.5695
DMWE Wiki2014(en) 3,402,883,423 2,043,680 50 4 7,734 0.5709
DMWE Clueweb09(en) 143,820,387,816 10,784,180 50 8 162,416 0.5996

* The dataset statistics are got after data preprocessing.


  • For fair of comparision, Word2Vec is configured as Skip-Gram + Hierarchical Softmax. For DMWE, ASP was used as the mechanism for parameter update. The data block size is respectively set as 50k and 750k for Wiki2014 and Clueweb09.
  • Given that the Clueweb09 dataset is very large, we only went through the data once during the training process (one training epoch). For the Wiki2014 dataset, the results were obtained by going through 20 epochs.
  • The results clearly show that DMWE can achieve good speed up by leveraging the DMTK framework as compared to its single machine version.


[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.Efficient Estimation of Word Representations in Vector Space.In Proceedings of Workshop at ICLR, 2013.
[2] Tian, F., Dai, H., Bian, J., Gao, B., Zhang, R., Chen, E., & Liu, T. Y. (2014). A probabilistic model for learning multi-prototype word embeddings. In Proceedings of COLING (pp. 151-160).
[3] Huang E H, Socher R, Manning C D, et al. Improving word representations via global context and multiple word prototypes[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012: 873-882.
[4] https://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki
[5] http://www.lemurproject.org/clueweb09.php