I've dig into the internal of an MIT licensed MoE system that makes use of Linear Attention (Lightning Attention) to extend it's context length to 1M input tokens.
Share this post
How Minimax-01 Achieves 1M Token Context…
Share this post
I've dig into the internal of an MIT licensed MoE system that makes use of Linear Attention (Lightning Attention) to extend it's context length to 1M input tokens.