Home About us

The style of painting is "fast and strong", and the application prospect is unclear: it is difficult to become a "panacea" for fast hands

AI Large Model Workshop 2024/07/03 16:56

AI large model workshop, the painting style is "fast-handed", and the application prospect is unclear: it is difficult to become a "panacea" for Kuaishou

Iced latte

Edited by Fang Qi

Media|AI Large Model Workshop

Under the banner of "China's version of Sora", Kuaishou quickly entered the field of Wensheng video large models. On June 6th, the official website of the "Kelin" video generation model was officially launched, with a video resolution of up to 1080p and a duration of up to 2 minutes (frame rate of 30fps). On June 21, Keling added more chips and announced the launch of the Tusheng video feature.

After the open beta was opened, a group of industry insiders and melon-eating people quickly poured in and applied for it on Kuaishou's creative tool Kuaiying App. In addition to semantic comprehension problems, the generated picture does not conform to the laws of the physical world, and the "hard flaws" such as poor authenticity, "the texture of the painting style is difficult to describe", "the aesthetics is not good", and "the taste of Kuaishou is too strong" are also frequently mentioned keywords.

In a word, just as it relied on the path of the sudden emergence of the sinking market, Kuaishou is now in the AI world, once again showing a strikingly similar temperament - the basic skills are not bad, but it is still difficult to escape the "sinking" label and fate.

1. There are many weak links such as semantic understanding and painting style texture: can the spirit escape the "sinking" label?

In terms of architecture selection, Colink kept up with Sora. According to the Kuaishou large model team, it adopts a DiT structure similar to the Sora model, and replaces the convolutional network-based U-Net in the traditional diffusion model with Transformer, which is also the mainstream trend in the field of current raw video - in the past few years, the diffusion model based on the U-Net architecture has been exposed to problems such as the inability to handle complex instructions, and the Diffusion Transformer has significant advantages in processing large-scale visual data. Ability to generate more complex and coherent video content.

Based on this, Keling's overall performance will not be too bad, however, in the further skill competition, Keling's shortcomings are gradually exposed.

First of all, at the level of semantic understanding, under the question of "how to view the explosion of the Chinese version of Sora", some netizens said that entering "a giant panda is happily eating zongzi" resulted in a panda eating dumplings; For another example, if you want to generate a cat dragon boat racing scene, enter "a group of cats sitting in a dragon boat", and the resulting video will not have cats, only people.

AI large model workshop, the painting style is "fast-handed", and the application prospect is unclear: it is difficult to become a "panacea" for Kuaishou

Behind this, it is revealed that Kling's semantic comprehension ability and detail capture ability are insufficient: whether it is unable to distinguish the difference between "human" and "cat", or confusing "zongzi" and "dumplings", it means that Kling has an understanding bias at the semantic level, and cannot accurately capture the key information in the input description, especially when dealing with unconventional or domain-specific objects, there is still room for improvement at the semantic parsing level.

Going back further, when building video scenes, Coling may be limited by its training data and algorithm capabilities, and cannot accurately translate text descriptions into visual content that meets expectations:

During the training process, the dataset that Kling relied on or lacked enough data for specific scenarios, such as "dragon boat racing", caused the model to be unable to accurately learn and generate relevant videos, and in addition, the training strategy may not be optimized enough for details, so that the model could not fully learn the differences and characteristics between different objects such as "humans and cats".

For another example, according to the evaluation of "Daily Economic News", when some videos are generated, there are many "failure" moments. For example, a panda that plays a guitar has human fingers; In the prompt, "light green fabric sofa", in the video, it is a reddish-brown leather sofa. At the same time, in some videos, when there are multiple subjects, sometimes there are also situations where some elements cannot be fully rendered in the video.

AI large model workshop, the painting style is "fast-handed", and the application prospect is unclear: it is difficult to become a "panacea" for Kuaishou

In fact, behind the one-minute Wensheng video show on stage, the competition is the training accumulation of "ten years of work off the stage". This is also why, under the same architecture, there are many "bugs" in the videos generated by Keling.

As before, Luo Jiangchun, the founder of Glance Technology, publicly stated that the biggest challenge faced by domestic generative video models is essentially the gap in underlying capabilities, which includes data, models and computing power: "We have the ability to catch up with Sora's effect today, but when we catch up, Sora has taken a big step forward, and this gap will remain for a long time." ”

In addition to hard injuries, Keling's painting style is the most criticized place. With the same prompt, the contrast between the art styles generated by Kelling and Sora is "indescribable".

Take the video that made Sora go viral and the prompt reads, "A stylish woman walks down the streets of Tokyo filled with warm glowing neon lights and animated city signs, carrying a black purse." She wears sunglasses and red lipstick. She walks confidently and casually. The streets are damp and reflective, creating a mirror effect of colorful lights, and many pedestrians move around. ”

And some netizens fed the same prompt words to Keling, but the generated video was extremely "fast":

The crooked-mouthed heroine walked with the steps of six relatives who did not recognize her, wearing a dress that looked normal but was inexplicable when she got together, and walked out of the momentum of the spiritual little sister entering the city to collect debts, and there was a guy in leggings behind her, and the whole street also had a strong sense of urban-rural junction. People can't help but want to match a social quote, such as "Exquisite small bag in my arms, drive my little Jetta", "The eldest sister walks like this, as if Zhao Si jumped a rubber band" and so on.

AI large model workshop, the painting style is "fast-handed", and the application prospect is unclear: it is difficult to become a "panacea" for Kuaishou

On social platforms, there are also many netizens who said that "the generated painting style is very ancient", "a bit earthy", "it is really something made by Kuaishou, and there is a Kuaishou flavor".

In the final analysis, the quality and diversity of the dataset directly affect the output effect of the model - if the training data contains a large number of low-quality or single-style images or videos, and lacks samples of modern, fashionable, or specific art styles, it will be difficult for the model to learn high-quality and diverse art styles, which makes it difficult to jump out of the inherent style framework when generating.

At the same time, during the generation process, the model may not have enough constraints to ensure the stylistic consistency, detail richness, and overall aesthetics of the generated content; Optimization algorithms may also fail to fully explore the potential of the generative space, resulting in mediocre or monolithic results.

For example, the MIT Technology Review reported that Guizang, an artificial intelligence artist in Beijing, said that Kling's disadvantage lies in the aesthetics of the results, such as composition or color grading: "But it's not a big problem. This problem can be solved very quickly. ”

It is true that at the moment, there is no harm if there is no comparison, occupying the first batch of Wensheng video models open to public testing in China, Keling's problem is harmless, but in the track surrounded by heroes, it is difficult for Keling to be "one dominant" for a long time.

2. Surrounded by heroes, more problems are exposed: it is difficult to become a "panacea"

Nowadays, there is no shortage of powerful players in the field of Wensheng video. After Sora took the lead in detonating the world in February this year, products in the field of Wensheng video have sprung up across the board, and many products are only separated by a wall from the outside world, which is close to opening to the public.

In April, Shengshu Technology released the Wensheng video model Vidu, which can directly generate high-definition video content in 16s with a resolution of up to 1080P according to text descriptions. In May, Tencent said that its hybrid model based on the DiT architecture supports a variety of video generation capabilities, such as Wensheng video, picture video, graphic video, and video video. In June, Greatstar Technology and Tsinghua University jointly released China's first Sora-level video generation model "YiSu", which has a native 16-second duration and can generate more than 1 minute video......

With the continuous improvement of technology, openness is gradually on the agenda. On June 30, Runway opened access to Gen-3 to some users; On July 2, Runway announced that its Bunsen video model, Gen-3Alpha, is available to all users for a minimum of $12 per month.

As more players unveil the veil, the aura of "the first open beta" on Keling's head will also fade, and at this time, the competition of real skills and kung fu has just begun. At the same time, for Kuaishou, the important thing is not how strong the Wensheng video ability is, but how to combine it with the commercial territory to promote the landing application.

In the eyes of the outside world, as a short video platform, Kuaishou naturally has the soil to land, and can integrate "Keling" into its creator ecology to further promote the prosperity of the content field. According to Kuaishou's 2024 Q1 financial report data, during the reporting period, Kuaishou's actual monthly active users were 697 million, down 0.4% month-on-month, showing a trend of loss, and 700.4 million in Q4 2023.

However, AI creation is not a "panacea" for content - for users, watching short videos created by AI is more about "watching freshness" and "watching excitement", and the real stickiness is still real anchors and high-quality content.

At the same time, with the full opening of Keling, while greatly reducing the comprehensive cost and threshold of short video production, it may also lead to the emergence of more low-quality and bottomless content, and some videos may even be produced, abused and maliciously disseminated, becoming a tool for some criminals to carry out telecom fraud, online pyramid schemes, and extortion, increasing the difficulty of platform supervision.

Kuaishou obviously understands this, and in June this year, Kuaishou E-commerce issued an announcement on the initiative to use AIGC capabilities for live broadcasting, saying, "We prefer to see real live content, encourage merchants/influencers to interact with veterans in real time, and establish deeper emotions, and at the same time, the low-quality content produced by intentionally using the low-cost advantage of AIGC is a content production behavior that the platform does not want to see." Therefore, "compared with other real-time live content, the platform will not give special traffic support to the content created with AIGC capabilities. ”

In fact, the C-side looks lively, the B-side looks at the doorway, and the real landing of the Wensheng video model is still on the industry side. For example, Sora is connected to the overseas mainstream large language model, and by learning the text structure of the popular video, it generates copywriting and scripts suitable for the merchant's products, and automatically matches the product materials provided by the merchant, and generates the video with one click.

In China, Huawei's Pangu Model 5.0's multimodal capabilities include video generation technology, which has been implemented for the industry, according to Zhang Ping'an, Huawei's Executive Director and CEO of HUAWEI CLOUD, Huawei has applied video generation technology to the training of autonomous driving. In June this year, it was officially announced that it and Bona Film announced that the AIGC sci-fi short series "Sanxingdui: Future Apocalypse" jointly produced by Bona Film was the chief AI technical support party, that is, Dream AI is based on bean bag large model technology, providing AI script creation, lens generation and other ten AI technologies.

This also provides a reference path for the application of Kelin. Obviously, in the leap from "spiritual model" to "AI panacea", Keling still has some way to go.

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com