Distributed Word Embedding (DWE)


The DWE tool is a parallelization of the Word2Vec[1] algorithm on top of our DMTK parameter server. It provides an efficient "scaling to industry size" solution for word embedding.

The DWE tool runs in the following manner:
On the client side (running on multiple nodes): three local training steps are executed repeatedly:
1. Get the latest parameters from the DMTK parameter server
2. Run the CBOW/Skip-gram algorithm to generate updates to the current parameters
3. Send the parameter updates to the DMTK parameter server

On the server side, the DMTK parameter server acts as below:
1. Pack the requested parameters and send them to clients
2. Aggregate parameter updates from different clients and merge them into the global parameters

Why DWE?


The DWE tool is highly scalable and efficient,thus can be used to train very large scale word embedding.This is powered by the DMTK framework:

  1. The DMTK parameter server stores the parameters in a distributed way, which means that each machine just holds a partition of the entire parameter set. This allows the entire embedding vector to be very large. For example, in our experiment on the ClueWeb data, the vocabulary size is 21 Million, and the parameter size reaches 6 Billion, which is the largest word embedding model ever reported in the literature,as far as we know.

  2. The training process in the clients is conducted in a streaming manner and is automatically pipelined. Specifically, during the training, the data are processed block by block. For each block, the client software will go three the three step as aforementioned. The parameter request and model training steps in successive data blocks are pipelined so as to hide the delay caused by the network communication. Furthermore, in this way, the clients just need to hold the parameters for several data blocks simultaneously,corresponding to very economic memory usage.

Downloading


To download the source codes of DWE, please run
$ git clone http://github.com/Microsoft/WordEmbedding
Please note that Distributed WordEmbedding is implemented in C++ for performance consideration.

Installation


Prerequisite

DWE is built on top of the DMTK parameter sever, therefore please download and build this project first.

For Windows

  1. Download and build the dependence
  2. Open sln/WordEmbedding.sln using Visual Studio 2013 and build all the projects

Ubuntu(Test on Ubuntu 12.04)

  1. Download and build the dependence by running $ sh script/install_dep.sh
  2. Modify the include and lib path in Makefile
  3. Run $ make all -j4

Running DWE


Training on a single machine

  1. Initialize the settings in the run.bat according to your preference
  2. Run run.bat in the solution directory

Training with distributed setting

Using mpi:
  1. Create a host.txt file containing all the machines to be used for training
  2. Split your dataset into several parts and store them into the same directory of these machines
  3. Distribute the same executable file into the same directory of these machines
  4. Run the command line "smpd.exe -d -p port" in every machine
  5. Run run.bat in one of the machines with host.txt as its argument
Using ZMQ:
  1. Compile the library of the DMTK parameter server, by specifying the communication mode to be ZMQ
  2. Compile the project Multverso.Sever, and you will get the executable Multiverso.Sever.exe
  3. Prepare a configuration file end_points.txt to describe the sever endpoints
  4. Add a parameter setting in run.bat, e.g.,'set _endpoint_file=config.txt'
  5. Start Multiverso.Sever.exe in each sever machine with appropriate command line arguments (please use Multiverso.Sever.exe -help for further information)
  6. Execute run.bat in one of the machines with end_points.txt as its argument

Configuration

For the word embedding algorithms,we have implemented both CBOW and Skip-gram. For the output layer,we support both negative sampling and hierarchical softmax.Users can specify their desired setting by setting the parameters in run.bat.

For the distributed training,users can configure the size of the data block, the mechanism for parameter update (such as ASP - Asynchronous Parallel, SSP - Stale Synchronous Parallel, BSP - Bulk Synchronous Parallel, and MA - Model Average), by setting the parameters in run.bat. For more details, please refer to the document of the DMTK parameter server.

Performance


We report the performance of the DWE tool on the English versions of Wiki2014 [3] and Clueweb09 [4]. The statistics* of these datasets and the performances of DWE are given as follows. The experiments are run on 20 cores of Intel Xeon E5-2670 CPU on each machine.

Dataset Token# Vocabulary size Embedding dimension Machine# Traning time cost/epoch(seconds) Accuracy on analogical reasoning task[1] Spearman's Rank Correlation on word similarity 353(ws353) dataset[2]
Word2Vec[1] Wiki2014(en) 3,402,883,423 2,043,680 300 1 2617 60.9% 61.5%
DWE Clueweb09(en) 143,820,387,816 10,784,180 300 8 144,000 62.6% 68.6%

* The dataset statistics are got after data preprocessing.

Remarks

  • All the above experiments were run with the configuration of CBOW + negative sampling. For DWE, ASP was used as the mechanism for parameter update and the data block size is set as 1GB.
  • Given that the Clueweb09 dataset is very large, we only went through the data once during the training process (one training epoch). For the Wiki2014 dataset, the results were obtained by going through 20 epochs.
  • The results clearly show that DWE can achieve good speed up by leveraging the DMTK framework as compared to the original Word2Vec algorithm.

References


[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.Efficient Estimation of Word Representations in Vector Space.In Proceedings of Workshop at ICLR, 2013.
[2] Word similarity task
[3] Wiki2014 dataset
[4] Clueweb09 dataset