graph of moe can be put to the device
Motivation
The LLM model is to big to put into the device, if there any batter architecture to make the model small enough to put to the device and keep the precision and inference performance.
The LLM model embedding the word to very high dimension, in order to do better classification. But it will affected by the curse of dimensionality. So we need deep neural net work(more parameters) and more training data to avoid overfitting.
Also use attention mechanism to make them classifiable.
But each time put some words into the space, may be adding one new dimension only useful for related words, and for most other words will add one no useful dimension. And the whole vector space is sparse and need more training data and complex model(more parameters).
We can use the idea of mixture of experts, they form one graph, and
some experts are independent, but some has relations between each other,
can be connected by edges. So the problem becomes make each experts'
dimension as small as possible, and the duplication features in
different experts as small as possible. We convert one high dimension
sparse LLM to many row dimension dense experts. Just like the symbol
table design in compiler, we can use one big flatten symbol table, but
the scope level
and scope name
in each scope
are same and can be eliminated, so traditional implementation of the
symbol table use the chain of scopes hash table, make a more compact
memory.
Design
A graph
Partitioning
Take the compute system as the whole world. We can split experts to
- operating system
- database
- compiler
- programming language
- distributed
- network
- micro architecture
- application
- electronic
- common knowledge
Once a word file system
coming, will be both added to
operating system
and databse
, and if a word
computer
arrived, it will be put to the
common knowledge
. And all experts depend on the
common knowledge
.
Evolve
All the experts can be version controlled, and can evolve,
refactoring. Like add more words, add dimension, remove words to other
experts, extract more experts. Just like the package manage system, all
experts are packages
.
Query(inference)
The query convert to tokens, and send to the related experts,
exploit
method use dependent experts to combine the
results, exploare
method to query other experts to find
results too. The results contian all the experts, the sub graph forms a
path
. The path
can be cached.
Online learning
Support real time input sentences. The tokenizer first to
classification, and put to different experts, batch will trigger insert.
Insert data to the experts, and update the path
.