Tencent Hunyuan Open-Sources Image Model 2.1 With Native 2K Support

Sep 19, 20253 Mins read113

Late in the evening of September 9th, Tencent released and open-sourced Hunyuan’s latest raw image model, “HunyuanImage 2.1.” This model boasts industry-leading comprehensive capabilities and supports native 2K HD raw images. The Tencent Hunyuan team also revealed the upcoming release of a native multimodal image generation model.

Hunyuan Image 2.1 is a fully open-source foundational model. The model weights and code have been officially released on open-source communities such as Hugging Face and GitHub. Individual and enterprise developers can conduct research based on this foundational model or develop various derivative models and plug-ins.

Since its open-source release, the Hunyuan Image 2.1 model has rapidly climbed in popularity on Hugging Face’s model popularity list, becoming the third most popular model globally. Three of the top eight models on the list are from the Tencent Hunyuan model family. Hunyuan Image 2.1 is reportedly a comprehensive upgrade based on the 2.0 architecture, focusing on balancing generation quality and performance. The new version not only supports native Chinese and English input but also enables high-quality generation of Chinese and English text and complex semantics. Furthermore, the overall aesthetics of generated images and the diversity of applicable scenarios have been significantly improved. This means that visual creators, such as designers and illustrators, can more efficiently and conveniently translate their ideas into visuals. Whether generating high-fidelity creative illustrations, creating posters and packaging designs with Chinese and English slogans, or even complex four-frame comics and comic strips, Hunyuan Image 2.1 provides fast, high-quality support. Thanks to a larger dataset of image-text alignment, Hunyuan Image 2.1 has significantly improved its ability to understand complex semantics and generalize across domains. It supports prompts of up to 1,000 tokens, accurately generating scene details, character expressions, and actions, and enabling the separate description and control of multiple objects. Furthermore, Hunyuan Image 2.1 enables precise control of text within images, allowing text to blend naturally with the visuals.

Hunyuan Image 2.1 has three key highlights.

Highlight 1: The model has a strong understanding of complex semantics, supporting the separate description and accurate generation of multiple subjects.

Highlight 2: It provides more stable control over text and scene details in images.

Highlight 3: It supports a wide range of styles, including real-life, comic, and vinyl figures, and exhibits a high level of aesthetic appeal.

Based on the Structured Semantic Alignment Evaluation (SSAE) results, Tencent’s HunyuanImage model 2.1 currently achieves the best semantic alignment performance among open-source models and is very close to the performance of the closed-source commercial model (GPT-Image). Furthermore, the Good Same Bad (GSB) evaluation results show that HunyuanImage 2.1’s image generation quality is comparable to that of the closed-source commercial model Seedream 3.0, and slightly better than the comparable open-source model Qwen-Image.

The Hunyuan Image 2.1 model not only utilizes massive amounts of training data but also leverages structured captions of varying lengths and content, significantly improving its ability to understand text descriptions. The caption model incorporates OCR and IP RAG expert models, effectively enhancing its ability to recognize complex text and incorporate world knowledge.

To significantly reduce computational complexity and improve training and inference efficiency, the model employs a VAE with an ultra-high 32x compression ratio and uses Dinov2 alignment and Repa loss to reduce training complexity. As a result, the model can efficiently generate 2K images natively.

For text encoding, Hunyuan Image 2.1 features a dual text encoder: an MLLM module to further improve image-text alignment, and a ByT5 model to enhance text generation. The overall architecture is a single-/dual-stream DiT model with 17B parameters.

In addition, Hunyuan Image 2.1 solves the training stability issue of the mean flow model on a model with 17B parameters, distilling the number of model inference steps from 100 to 8, significantly improving the inference speed while maintaining the original model performance.

The simultaneously open-sourced PromptEnhancer is the industry’s first systematic, industrial-grade Chinese-English rewriting model. It optimizes the structure of user text commands, enriches visual expression, and significantly improves the semantic expression of images generated from rewritten text. Officials stated that the newly released native 2K model, PromptEnhancer 2.1, achieves a better balance between effect and performance, meeting the diverse needs of users and businesses in diverse visual scenarios.