Tencent, has introduced VoCo-LLaMA, an LLM for compressing lengthy vision tokens into a single token with minimal loss of visual data.
VoCo-LLaMA comprises "Vision Compression" tokens that are charged with compressing and distilling vision tokens in LLMs. The solution reportedly can achieve a compression ratio of 576x while maintaining 83.7% performance on common visual understanding benchmarks. The solution is also claimed to contribute to efficiency gains, enabling a 99.8% reduction in cache storage, a 94.8% decrease in FLOPs, and a 69.6% faster inference time.
However, the solution is claimed to diminish the model's ability to understand uncompressed tokens and face difficulties with diverse fine-grained compression levels.
By using this site, you agree to allow SPEEDA Edge and our partners to use cookies for analytics and personalization. Visit our privacy policy for more information about our data collection practices.