Distributed Machine Learning Toolkit #

Distributed machine learning has become more important than ever in this big data era. Especially in recent years, practices have demonstrated the trend that bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models, because the task usually requires a large number of computation resources. In order to enable the training of big models using just a modest cluster and in an efficient manner, we release the Microsoft Distributed Machine Learning Toolkit (DMTK), which contains both algorithmic and system innovations. These innovations make machine learning tasks on big data highly scalable, efficient and flexible.

The current version of DMTK includes the following components (more components will be added to the future versions):

• DMTK Framework: a flexible framework that supports unified interface for data parallelization, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency.

• LightLDA, an extremely fast and scalable topic model algorithm, with a O(1) Gibbs sampler and an efficient distributed implementation.

• Distributed (Multisense) Word Embedding, a distributed version of (multi-sense) word embedding algorithm.

Machine learning researchers and practitioners can also build their own distributed machine learning algorithms on top of our framework with small modifications to their existing single-machine algorithms.

We believe that in order to push the frontier of distributed machine learning, we need the collective effort from the entire community, and need the organic combination of both machine learning innovations and system innovations. This belief strongly motivates us to open source the DMTK project.

DMTK Framework #

This framework includes the following components:

• A parameter server that supports

o Hybrid data structure for model storage, which uses separate data structures for high- and low-frequency parameters, and thus achieves outstanding balance between memory capacity and access speed.

o Acceptance and aggregation of updates from local workers, which supports several different mechanisms for model sync-up, including BSP, ASP, SSP, in a unified manner.

• A client SDK that supports

o Local model cache, which syncs up with the parameter server only when necessary, thus greatly reduces the communication cost.

o Pipeline between local training and model communication, which enables very high training speed regardless of various conditions of computational resources and network connections.

o Scheduling big model training in a round-robin fashion, which allows each worker machine to pull the sub-models as needed from the parameter server, resulting in a frugal use of limited memory capacity and network bandwidth to support very big models.

LightLDA #

LightLDA is a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers.

In the distributed implementation, we leverage the hybrid data structure, model scheduling, and automatic pipelining functions provided by the DMTK framework, so as to make LightLDA capable for extremely Big Data and Big Model even on a modest computer cluster. In particular, on a cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 10-million-word vocabulary (for a total of 10 trillion parameters), on a document collection with over 100-billion tokens - a scale not yet reported even with thousands of machines in the literature.

Distributed (Multi-sense) Word Embedding #

Word embedding has become a very popular tool to compute semantic representation of words, which can serve as high-quality word features for natural language processing tasks. We provide the distributed implementations of two word embedding algorithms. One is the standard word2vec algorithm, and the other is a multi-sense word embedding algorithm that learns multiple embedding vectors for polysemous words. By leveraging the model scheduling and automatic pipelining functions provided by the DMTK framework, we are able to train 300-d word embedding vectors for a 10-million-word vocabulary, on a document collection with over 100-billion tokens, on a cluster of just 8 machines.