A New Battlefield for Large Model Applications: Unveiling the Key to AI Competition on the Terminal Side|Intelligence in the Terminal
In the past two-thirds of 2024, a consensus in the field of large models has begun to become clearer:
The true value of AI technology lies in its inclusiveness. Without application, the underlying model will not be able to fulfill its value.
As a result, looking back on the past six months, from major Internet manufacturers to mobile phone manufacturers, people from all walks of life are rushing on the road of exploring the Killer APP in the AI era. This trend has also begun to show its traces in top academic conferences.
One of the core issues that has been paid attention to by the industry and academia is:
Under the background of the large model "powerful bricks", how can AIGC applications be more silkily implemented on terminal devices with limited computing power such as mobile phones?
△Midjourney generation
Since this time, more details are being unveiled about the latest technology sharing and selected papers at top conferences such as ICML (International Conference on Machine Learning) and CVPR (IEEE International Conference on Computer Vision and Pattern Recognition).
It's time to wrap it up.
Let's take a look at how far AI applications have progressed from the cloud to the terminal.
At present, in terms of large model/AIGC applications, many Android mobile phone manufacturers have maintained in-depth cooperation with Qualcomm.
At CVPR 2024 and other summit conferences, Qualcomm's technology demo attracted a lot of attention.
For example, on Android phones, implement the local deployment of the multi-modal large model (LLaVA):
△Qualcomm Research is published on YouTube
It is a large multimodal model with 7 billion parameters, which supports multiple types of data inputs, including text and images. Multi-turn conversations around images are also supported.
Just like that, throw him a picture of a puppy and it will not only describe the photo information, but also talk to you about whether the dog is suitable for domestic care and so on.
△ Qubits at the Qualcomm booth at MWC Barcelona at the official demo
Qualcomm also showed an example of running LoRA on an Android phone.
△Qualcomm Research is published on YouTube
and an audio-powered 3D digital human version of the AI assistant – which can also run locally when the internet connection is disconnected.
The Demo prototype has been released, coupled with the magic optimization of mobile phone manufacturers, for ordinary users, it means that the new gameplay possibilities shown in it are just around the corner on our own terminal devices.
But at the top meeting, what attracted more attention was that in addition to the demo, a series of Qualcomm's latest papers also revealed in detail the key technologies behind the application that need to be focused on layout.
One of them is quantification.
When deploying large model/AIGC applications on mobile phones and other terminal devices, one of the key issues to be solved is how to achieve high-performance inference.
Quantization is one of the most effective ways to improve computing performance and memory efficiency. And Qualcomm believes that the use of low-digit integer precision is critical for energy-efficient inference.
A number of Qualcomm research efforts have found that for generative AI, large language models based on Transformer are limited by memory, and can often obtain a greatly improved efficiency advantage after quantizing to 8-bit (INT8) or 4-bit (INT4) weights.
Among them, 4-bit weighting is not only feasible for large language models, but also possible in post-training quantization (PTQ), and can achieve optimal performance. This efficiency gain has surpassed that of floating-point models.
Specifically, Qualcomm's research shows that many generative AI models can be quantized to INT4 models with the help of quantitative studies such as quantized perception training (QAT).
Without compromising accuracy and performance, the INT4 model can save more power consumption, achieving a 90% performance improvement and a 60% energy efficiency improvement compared to the INT8.
This year, Qualcomm also proposed an algorithm called LR-QAT (Low-rank Quantized Perception Training) that can make large language models more computationally efficient and memory usage.
Inspired by LoRA, LR-QAT adopts the method of low-rank weight parameterization, introduces low-rank auxiliary weights, and places them in the integer field, which achieves efficient inference without loss of precision.
Experimental results on Llama 2/3 and Mistral models show that LR-QAT achieves the same performance when the memory usage is much lower than that of the full model QAT.
In addition, Qualcomm also focuses on vector quantization (VQ) technology, which is different from traditional quantization methods, which considers the joint distribution of parameters, which can achieve more efficient compression and less information loss.
As AI models are deployed into hardware architectures, compilers are key to ensuring that they run efficiently with the highest performance and lowest power consumption.
The compilation includes the steps of slicing, mapping, sorting, and scheduling of the computed graph.
Qualcomm has accumulated a lot of technical achievements in traditional compiler technology, polyhedral AI editor, and editor combination optimization AI.
For example, the Qualcomm AI Engine Direct framework is based on the hardware architecture and memory hierarchy of the Qualcomm Hexagon NPU to perform calculations and sequence, which can improve performance while minimizing memory overflow.
AI acceleration on the device side is inseparable from hardware support.
In terms of hardware, the Qualcomm AI Engine uses heterogeneous computing architectures, including Hexagon NPUs, Qualcomm Adreno GPUs, Qualcomm Kryo CPUs, or Qualcomm Oryon CPUs.
Among them, the Hexagon NPU has become a key processor in Qualcomm's AI engine today.
Taking the third-generation Snapdragon 8 mobile platform as an example, the Hexagon NPU is 98% faster than its predecessor in terms of performance, while reducing power consumption by 40%.
In terms of architecture, the Hexagon NPU has been upgraded with a new microarchitecture. Compared with the previous generation, faster vector accelerator clock speeds, stronger inference technology, and support for more and faster Transformer networks have comprehensively improved Hexagon NPU's responsiveness to generative AI, making it possible for large models on mobile phones to answer user questions in seconds.
In addition to the Hexagon NPU, the third-generation Snapdragon 8 has also put more effort into the Qualcomm sensor hub: adding the next generation of micro NPUs, improving AI performance by 3.5 times and increasing memory by 30%.
In fact, as one of the most concerned technical representatives in the trend of large model/AIGC applications migrating to the terminal side, Qualcomm's AI research layout has long been extended to a wider range of fields.
Taking CVPR 2024 as an example, in terms of generative AI, Qualcomm proposed a method to improve the efficiency of the diffusion model, Clockwork Diffusion, which can reduce the computing power consumption by up to 32% while improving the perception score of Stable Diffusion v1.5, making the SD model more suitable for low-power devices.
And not only mobile phones, but also for the actual needs in the field of XR and autonomous driving, Qualcomm has also studied efficient multi-view video compression methods (LLSS).
In the current hot research areas, such as AI video generation, Qualcomm also has new actions:
We are developing an efficient video architecture for device-side AI. For example, FAIRY, a video-to-video generative AI technology, is optimized. In the first stage of FAIRY, the state is extracted from the anchored frame. In the second stage, the video is edited across the remaining frames. Examples of optimizations include: cross-frame optimization, efficient instructPix2Pix, and image/text guided adjustments.
The application of large models is the current trend. And as the application evolves more deeply, one key question becomes clearer:
The speed at which application innovation evolves depends on the solid foundation of the technology.
The technology base here refers not only to the basic model itself, but also to the full-stack AI optimization from model quantization and compression to deployment.
It can be understood that if the basic model determines the upper limit of the application effect of the large model, then a series of AI optimization technologies determine the lower limit of the application experience of the large model on the terminal side.
As ordinary consumers, it is worth expecting that technology manufacturers like Qualcomm are not only stepping up their theoretical research, but also accelerating the deployment of full-stack AI research and optimization for applications, neural network models, algorithms, software and hardware.
Take the Qualcomm AI software stack as an example. This is a toolkit that contains a large number of AI technologies, fully supports various mainstream AI frameworks, different operating systems, and various programming languages, and can improve the compatibility of various AI software on smart devices.
It also includes Qualcomm AI Studio, which integrates all of Qualcomm's AI tools, including AI model plug-ins, model analyzers, and neural network architecture search (NAS).
What's more, based on the Qualcomm AI software stack, developers can deploy AI models across different devices anytime, anywhere, with a single development.
In other words, the Qualcomm AI software stack is like a "converter" that can solve a major problem faced by the implementation of large models in a wide variety of intelligent terminals - cross-device migration.
As a result, large-scale applications can not only move from the cloud to the mobile phone, but can also be crammed into automobiles, XR, PCs, and IoT devices more quickly.
At this juncture, everyone is looking forward to a more turbulent wave of technology that will change the world.
And the trendsetters who are standing at the forefront of the tide are once again verifying the fact that the people and organizations that lead the way in technology all have an "inventor culture" that values basic technology.
Not only to catch up with the latest technology trends, but also to lay out in advance and conquer the basic solutions first.
Qualcomm also mentions this in its white paper, "Making AI Accessible":
Qualcomm has been deeply involved in AI research and development for more than 15 years, and has always been committed to making core capabilities such as perception, reasoning, and behavior ubiquitous on the device.
These AI studies and the papers produced on top of them are not only affecting Qualcomm's technical layout, but also affecting the AI development of the entire industry.
In the era of large models, the "inventor culture" is still continuing.
It is this kind of culture that continues to promote the popularization of new technologies, promote the competition and prosperity of the market, and drive more industry innovation and development.
What do you think?
— END —
This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com