Home About us

The Pangu media model has advanced, and digital content continues to "dream" the public

Intelligent relativity 2024/06/21 20:51

Text | Intelligent relativity

Author | Chen Bocheng

This is a scene of "Mountains and Rivers Poetry Chang'an" at the Xi'an branch of the Spring Festival Gala stage: "Li Bai" appeared, leading the audience to sing "Will Enter the Wine", and vividly interpreting the pride and romance in the bones of the Chinese.

Intelligent relativity, Pangu media model is advanced, and digital content continues to "dream" the public

This is another scene in the Yiwu commodity market in Zhejiang: the female boss who can only speak a few English words becomes a foreign language expert in seconds, seamlessly switches 36 Chinese dialects to introduce her own goods fluently, and frantically brings goods.

This incredible scene has made today's Chinese culture and commerce frequently out of the circle. Behind it all, there is the same support: AI technology empowers the production and application of digital content.

In recent years, with the continuous upgrading and empowerment of AI large-scale model technology, the trend of digital content production and application has become more and more intense, and the continuous integration of real scenes and digital content has quietly changed the pattern of the entire content creation industry, and even further promoted new changes in related industries and businesses.

Technological innovation reshapes a new paradigm of digital content production and application

Behind the wonderful interpretation of the digital human "Li Bai" and the AI delivery of the female boss of Yiwu - behind these popular events are the results of technological innovation breakthroughs. The mature application of AI large models has enabled more and more different forms of digital content to explode and widely enter the public eye.

At the Huawei Developer Conference (HDC 2024) held on June 21, the HUAWEI CLOUD Pangu model was upgraded to version 5.0, in which the technological innovation of the Pangu media model in speech generation, video generation, and AI translation reshaped a new paradigm for digital content production and application.

Intelligent relativity, Pangu media model is advanced, and digital content continues to "dream" the public

Compared with the technical capabilities of the past, the effect of the new technology is very significant.

1. Advanced voice generation: With only three words or two words, immersive and realistic speech is as easy as a breeze

In the past, speech generation relied on traditional voice cloning models, which were often much more complex in practice due to their small size and low accuracy. For example, in the data collection stage, the voice data of the target person should be as diverse as possible, including different speech speeds, intonations, volumes, and voices in different contexts, and hundreds of sentences of audio recordings are required.

Then, in the pre-processing stage, the collected voice data needs to be cleaned, and noise, muted clips, and other unwanted parts need to be removed through manual annotation and other forms. This is followed by speech segmentation, in which successive speech signals are cut into smaller segments of speech (e.g., phonemes or words). Finally, the audio features are extracted and used for subsequent sound modeling.

The above is just data collection and pre-processing, and has not yet entered the real speech generation stage. However, the workload and operational complexity are already very large, which has a very big impact and challenge on the efficiency and quality of speech generation.

Today, with the innovation of technology, based on more advanced models, such as the speech generation capabilities of Pangu Media Model, this problem has been well solved. With only a few words and a few seconds of voice, AI can learn personalized timbre, intonation, and prosody of expression, so as to obtain high-quality personalized speech. At the same time, it also supports anthropomorphic emotional voices such as joy, anger and sorrow, and more than 10 tone styles such as small talk, news, and live broadcast, making the generated voice more realistic and emotional, and can be immersed in different scenes.

For example, in video translation, AI will be able to achieve the professionalism of voice actors - through the video translation capabilities provided by the Pangu media model, AI can translate videos into the target language and retain the timbre, emotion, and tone of the original characters. HUAWEI CLOUD is also actively working with partners to create high-emotion voice cloning and dubbing in 14 minor languages, and jointly build high-emotion super-anthropomorphic multimodal audio application capabilities. At the same time, combined with the lip-driven model of Pangu Media Model, it can also achieve lip synchronization, especially in scenarios such as side, multi-person dialogue, object occlusion, and character movement, which can also achieve good lip matching.

2. Video generation leapfrogging: With only a few dozen images, controllable and consistent videos are at your fingertips

Traditional video generation technologies have certain limitations in terms of resource requirements, datasets, timing consistency, compliance with physical laws, efficiency and mass balance, controllability, fidelity and consistency, and application limitations. Nowadays, based on the Pangu media model, you only need to train dozens of images of a specific aesthetic style, such as ghibli, two-dimensional and other styles, and then input a live video to quickly generate an anime video of that style.

In addition to generating stable animation videos for on-demand duration, the ID consistency model can also be used to process the key characters in the generated screen consistently, ensuring that the appearance characteristics of the characters in the video are always consistent in the previous frame and the next frame, and the visual effects under the side face and movement trajectory are reasonable and consistent, thereby enhancing the controllability and consistency of AI video generation and making the video content more reasonable and realistic.

In addition, the industry is also focusing on increasing the realism and complexity of video generation. For example, OpenAI's Sora is trying to simulate complex camera movements while accurately keeping characters and visual styles consistent, making AI-created digital content more realistic. NVIDIA has released a series of technology suites such as ACE (NVIDIA Avatar Cloud Engine), NeMo™ and RTX™ to enhance the realism of digital content and make the interactions and dialogues of digital characters more complex and realistic.

Intelligent relativity, Pangu media model is advanced, and digital content continues to "dream" the public

3. AI translation enhancement: The accuracy is > 93%, and real-time, cross-language communication is just around the corner

In the past, machine translation systems were often built based on statistical models or rule models, so most of the translation results could not be consistent with the original text, which was blunt and unnatural, and did not have the conditions to be applied to different scenarios. HUAWEI CLOUD uses AI to achieve real-time interpretation in multiple languages with an accuracy >of 93%, which can be applied to scenarios that require real-time interpretation, such as real-time calls and cloud meetings.

At the same time, based on the voice replication, AI text translation and TTS technology of Pangu Media's large model, simultaneous interpretation of speech can be realized, and the cross-language native communication experience can be successfully realized. It can even be combined with digital human technology, allowing digital humans to simulate user speech, combining lip model technology to accurately match mouth shape and voice, so that AI translation, digital human and voice generation can be highly combined for online meetings, cross-border trade exchanges and other scenarios.

Is the "bottleneck" of technology shrinking?

Technological innovation and breakthroughs have brought about an explosion in the production and application of digital content, but on the other hand, with the acceleration of the process of production and application, corresponding technical bottlenecks are also emerging, and they are constantly shrinking and focusing. At present, the problems of AI large models in the production and application of digital content are mainly presented at three levels.

First, there are bottlenecks in energy and computing efficiency. At present, the computing equivalent of large model training is further increasing, from GPT-3 to GPT-4, the computing equivalent has increased by 68 times. As the number of training tokens and model parameters increases, the amount of computation required for large model training also increases.

What's more, the cost behind it is unsustainable. According to calculations, training a 500 billion parameter scale Dense model requires about $1 billion in basic computing facilities, 21 months of trouble-free operation, and about 530 million yuan in electricity bills - which is far beyond the scope of the enterprise.

However, if you want to produce high-quality digital content at scale, the refinement of large models is a necessary path. At this stage, the industry began to seek more efficient and better computing solutions. For example, HUAWEI CLOUD Ascend AI Cloud Service is committed to providing convenient and easy-to-use computing services, continuously innovating computing power and computing efficiency at the computing power level, and providing full-stack services from cloud-based computing power, model development, model hosting, and ecosystem.

Second, the optimization challenge of algorithm architecture. As the parameters of large models increase, the processing time will continue to extend in order to achieve better calculation results and output better answers. However, this is a significant problem that plagues the production and application of digital content in practical application, and is very unfavorable to the large-scale and commercial development of digital content.

As a result, the industry has also begun to optimize the algorithm architecture and adjust the calculation logic and processing methods to achieve better results. Among them, the MoE model represented by the sparse activation mechanism is a focus of the current AI industry, similar to the concept of "specialization in the industry", the sparse activation mechanism disassembles the data tasks, classifies them, and then assigns them to specific "Experts" for processing, and finally comprehensively weighted output, which not only optimizes the computational efficiency, but also makes the output more comprehensive and powerful.

Third, security and ethics issues. The internal operation mechanism of the large model is complex, and the content produced lacks explainability and controllability, and is vulnerable to adversarial sample attacks, which has regulatory problems and security vulnerabilities. In this regard, the related safety and ethical issues in the production and application of digital content have become increasingly prominent with the extensive development of the industry.

Therefore, in the stage of rapid development of the digital content industry, although enterprises can enjoy the dividends of digital content, they also need to identify the security and reliability of the model, and avoid negative public opinion due to the safety and ethical issues of the model. In response to such challenges, the HUAWEI CLOUD Pangu model was immediately upgraded to meet high standards in terms of data governance and security compliance.

With technology "creating dreams", the future of digital content can be expected

The clarity of the new paradigm brings a clearer technical direction, which means that the future development of the digital content industry has already taken shape, and the outlook is still optimistic. However, it cannot be ignored that the bottleneck of technology objectively exists, and it is more prominent, and the industry is still facing a relatively severe development road.

Although technical bottlenecks aside, in specific practice, the production and application of digital content is not simple, and it is often necessary to face many concrete problems outside of technology, which can only be solved step by step based on the problems to finally complete the implementation.

When translating the documentary, the HUAWEI CLOUD team found that the project would encounter various problems, such as the ambient sound was too cluttered and chaotic, the AI could not accurately recognize the human voice to ensure the completeness of the transliteration, or the characters had different states and mouth shapes as the scene changed, requiring AI to accurately match them.

If these problems are not solved one by one, the effect of AI translation will be greatly discounted. In this regard, the HUAWEI CLOUD team accurately analyzes the problem and uses different technologies to solve various subtle problems, such as separating the ambient sound from the human voice through the separation model technology, and accurately matching the voice to the mouth shape through the mouth model technology.

The technology may have always been there, but how to use it is precisely the key to the implementation of the project. In other words, in today's stage of rapid development of the digital content industry and accelerated iteration of technological innovation, only practice can truly promote the development of the industry. This is a stage of competition for projects, and the more experienced the project, the better they know how the relevant technology should be applied in order to play its due effect.

At present, China Film Group and HUAWEI CLOUD have cooperated to apply the media model to the film and television industry, jointly creating a large film and television translation model, translating videos into different languages through AI, and retaining the timbre, emotion, and tone of the original characters, and supporting lip matching, providing a new AI production method for film translation.

Today, digital content is bursting out, and the other side of more and more exciting content is the process of manufacturers constantly using technology, verifying technology, and improving technology. The road to future projects is a long way to go, and it is also the only way for the industry to mature. Waiting for the continuous upgrading and improvement of technology in innovation and practice, we will be able to see a world of wonderful digital content like a dream sooner in the future.

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com