Distributed machine learning has become more important than ever in this big data era. Especially in recent years, practices have demonstrated the trend that more training data and bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models from huge amount of data, because the task usually requires a large number of computation resources. In order to tackle this challenge, we release the Microsoft Distributed Machine Learning Toolkit (DMTK), which contains both algorithmic and system innovations. These innovations make machine learning tasks on big data highly scalable, efficient, and flexible. We will continue to add new algorithms to DMTK in a regular basis.
The current version of DMTK includes the following components (more components will be added to the future versions):
• DMTK Framework: a flexible framework that supports unified interface for data parallelization, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency.
• LightLDA, an extremely fast and scalable topic model algorithm, with a O(1) Gibbs sampler and an efficient distributed implementation.
• Distributed (Multisense) Word Embedding, a distributed version of (multi-sense) word embedding algorithm.
• LightGBM: a very high-performance gradient boosting tree framework (supporting GBDT, GBRT, GBM, and MART), and its distributed implementation.
Machine learning researchers and practitioners can also build their own distributed machine learning algorithms on top of our framework with small modifications to their existing single-machine algorithms.
We believe that in order to push the frontier of distributed machine learning, we need the collective effort from the entire community, and need the organic combination of both machine learning innovations and system innovations. This belief strongly motivates us to open source the DMTK project.
This framework includes the following components:
• A parameter server that supports
o Hybrid data structure for model storage, which uses separate data structures for high- and low-frequency parameters, and thus achieves outstanding balance between memory capacity and access speed.
o Acceptance and aggregation of updates from local workers, which supports several different mechanisms for model sync-up, including BSP, ASP, SSP, in a unified manner.
• A client SDK that supports
o Pipeline between local training and model communication, which enables very high training speed regardless of various conditions of computational resources and network connections.
o Scheduling big model training in a round-robin fashion, which allows each worker machine to pull the sub-models as needed from the parameter server, resulting in a frugal use of limited memory capacity and network bandwidth to support very big models.
o Lua and Python bindings, which can enable users in various communities to leverage our parameter server to parallelize their machine learning tasks.
LightLDA is a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers.
In the distributed implementation, we leverage the hybrid data structure, model scheduling, and automatic pipelining functions provided by the DMTK framework, so as to make LightLDA capable for extremely Big Data and Big Model even on a modest computer cluster. In particular, on a cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 10-million-word vocabulary (for a total of 10 trillion parameters), on a document collection with over 100-billion tokens - a scale not yet reported even with thousands of machines in the literature.
Word embedding has become a very popular tool to compute semantic representation of words, which can serve as high-quality word features for natural language processing tasks. We provide the distributed implementations of two word embedding algorithms. One is the standard word2vec algorithm, and the other is a multi-sense word embedding algorithm that learns multiple embedding vectors for polysemous words. By leveraging the model scheduling and automatic pipelining functions provided by the DMTK framework, we are able to train 300-d word embedding vectors for a 10-million-word vocabulary, on a document collection with over 100-billion tokens, on a cluster of 8 machines.
LightGBM is a new gradient boosting tree framework, which is highly efficient and scalable and can support many different algorithms including GBDT, GBRT, GBM, and MART. LightGBM is evidenced to be several times faster than existing implementations of gradient boosting trees, due to its fully greedy tree-growth method and histogram-based memory and computation optimization. It also has a complete solution for distributed training, based on the DMTK framework. The distributed version of LightGBM takes only one or two hours to finish the training of a CTR predictor on the Criteo dataset, which contains 1.7 billion records with 67 features, on a cluster of 16 machines.
A special note: DMTK is a platform deigned for distributed machine learning. Deep learning is not our focus, and the algorithms released in DMTK are mostly non-deep learning algorithms. If you want to use state-of-the-art deep learning tools, you are highly recommended to use Microsoft CNTK. We have close collaborations with CNTK and provide support to its asynchronous parallel training functionalities.