graph of moe can be put to the device

Motivation

The LLM model is to big to put into the device, if there any batter architecture to make the model small enough to put to the device and keep the precision and inference performance.

The LLM model embedding the word to very high dimension, in order to do better classification. But it will affected by the curse of dimensionality. So we need deep neural net work(more parameters) and more training data to avoid overfitting.

Also use attention mechanism to make them classifiable.

But each time put some words into the space, may be adding one new dimension only useful for related words, and for most other words will add one no useful dimension. And the whole vector space is sparse and need more training data and complex model(more parameters).

We can use the idea of mixture of experts, they form one graph, and some experts are independent, but some has relations between each other, can be connected by edges. So the problem becomes make each experts' dimension as small as possible, and the duplication features in different experts as small as possible. We convert one high dimension sparse LLM to many row dimension dense experts. Just like the symbol table design in compiler, we can use one big flatten symbol table, but the scope level and scope name in each scope are same and can be eliminated, so traditional implementation of the symbol table use the chain of scopes hash table, make a more compact memory.

Design

A graph to represent the model, is all experts, and is the edge of the two experts.

Partitioning

Take the compute system as the whole world. We can split experts to

  • operating system
  • database
  • compiler
  • programming language
  • distributed
  • network
  • micro architecture
  • application
  • electronic
  • common knowledge

Once a word file system coming, will be both added to operating system and databse, and if a word computer arrived, it will be put to the common knowledge. And all experts depend on the common knowledge.

Evolve

All the experts can be version controlled, and can evolve, refactoring. Like add more words, add dimension, remove words to other experts, extract more experts. Just like the package manage system, all experts are packages.

Query(inference)

The query convert to tokens, and send to the related experts, exploit method use dependent experts to combine the results, exploare method to query other experts to find results too. The results contian all the experts, the sub graph forms a path. The path can be cached.

Online learning

Support real time input sentences. The tokenizer first to classification, and put to different experts, batch will trigger insert. Insert data to the experts, and update the path.