Journal of Textile Research ›› 2024, Vol. 45 ›› Issue (07): 165-172.doi: 10.13475/j.fzxb.20230704201

• Apparel Engineering • Previous Articles     Next Articles

Single dress image video synthesis based on pose embedding and multi-scale attention

LU Yinwen1, HOU Jue1,2,3, YANG Yang1,2, GU Bingfei1, ZHANG Hongwei4, LIU Zheng2,3,5()   

  1. 1. School of Fashion Design & Engineering, Zhejiang Sci-Tech University, Hangzhou, Zhejiang 310018, China
    2. Apparel Engineering Research Center of Zhejiang, Hangzhou, Zhejiang 310018, China
    3. Key Laboratory of Silk Culture Inheritance and Product Design Digital Technology, Ministry of Culture and Tourism, Zhejiang Sci-Tech University, Hangzhou, Zhejiang 310018, China
    4. School of Electronic Information, Xi'an Polytechnic University, Xi'an, Shaanxi 710043, China
    5. School of International Education, Zhejiang Sci-Tech University, Hangzhou, Zhejiang 310018, China
  • Received:2023-07-18 Revised:2024-01-08 Online:2024-07-15 Published:2024-07-15
  • Contact: LIU Zheng E-mail:koala@zstu.edu.cn

Abstract:

Objective Video generation based on a single dress image has important applications in the fields of virtual try-on and 3-D reconstruction. However, existing methods have problems such as incoherent movements between generated frames, poor quality of generated videos, and missing details of clothing. In order to address the above issues, a generative adversarial network model based on pose embedding mechanism and multi-scale attention links is proposed.

Method A generative adversarial network (EBDGAN) model based on pose embedding mechanism and multi-scale attention was proposed. Pose embedding method was adopted to model adjacent frame actions and improve the coherence of video generated actions, and attention links for each resolution scale feature were added to improve feature decoding efficiency and generate image frame fidelity. Human parsing images were utilized during the training process to improve the clothing accuracy of the synthesized images.

Results The learned perceptual image patch similarity (LPIPS) and peak signal-to-noise-ratio(PSNR) values indicated that the generated results of EBDGAN were closer to the original video in terms of color and structure. From the motion vector(MV), it was seen that the video generated by EBDGAN from a single image moved less between adjacent frames and had higher similarity between the two frames, leading to more stable the overall videos. Although the structure similarity index metric(SSIM) score was slightly lower than (CASD), this method was more efficient as it only requires image and pose information as input. In some frames where the characters were far from the camera, EBDGAN retained the details of hair and shoes. In some frames where the characters are closer to the camera, the front clothing image of EBDGAN retained the collar and hem, such as the collar of the left image in the second row and the hem of the right clothing. When the characters in the video turned around, EBDGAN did not cause the characters in the video to exhibit strange pose or lose some body parts, but instead generated a more reasonable body shape. The results of the ablation experiment showed that the complete model can efficiently utilize the pose information and features of the input image to guide video generation. Blocking any network component will result in a decrease in model performance. The results of EBDGAN-1 indicated that multi-scale attention linking could help networks generate images with more reasonable distribution. The MV of EBDGAN-2 suggested that when the attitude embedding module was added, the relative movement between adjacent frames was smaller, resulting in high video stability.

Conclusion This article proposes a method for generating videos from single images based on pose embedding mechanism and multi-scale attention linking. This method uses the pose embedding module EBD to model the pose between adjacent frames in the time series, reducing the number of parameters while ensuring the coherence of actions between adjacent frames. By using multi-scale attention linking, the efficiency of feature extraction is improved, further improving the quality of video generation. Using character analysis images as auxiliary input enhances the expressive ability of character clothing. The prosposed method was experimentally validated on a public dataset, with SSIM 0.855, LPIPS 0.162, PSNR 20.89, and MV 0.108 4. The ablation experiment proves that the model proposed model can help the network achieve better performance in video generation tasks. Comparative experiments have shown that the proposed method offers better stability in generating videos and more realistic character details.

Key words: generative adversarial network, video synthesis, deep learning, pose embedding, attention mechanism, dressing image, virtual try-on

CLC Number: 

  • TS942.8

Fig.1

Network structure of EBDGAN"

Fig.2

Network structure of pose embedding"

Tab.1

Quantitative results of different models"

模型 SSIM值↑ LPIPS值↓ PSNR值↑ MV值↑
PISE[19] 0.847 0.187 20.60 0.096 8
PATN[9] 0.851 0.171 20.63 0.094 2
CASD[20] 0.863 0.176 20.72 0.101 0
EBDGAN 0.855 0.1619 20.89 0.108 4

Fig.3

Qualitative results of each model. (a) Target poses; (b) Results of PISE; (c) Results of PATN; (d) Results of CASD; (e) Results of EBDGAN; (f) Groundtruth"

Tab.2

Quantitative results of ablation experiments"

方法 SSIM值↑ LPIPS值↓ PSNR值↑ MV值↑
EBDGAN-0 0.830 0.183 20.45 0.080 7
EBDGAN-1 0.846 0.168 20.52 0.097 0
EBDGAN-2 0.847 0.177 20.16 0.104 1
EBDGAN 0.855 0.162 20.89 0.108 4

Fig.4

Qualitative results of ablation experiments. (a) Target poses; (b) Results of EBDGAN-0; (c) Results of EBDGAD-1; (d) Results of EBDGAD-2; (e) Results of EBDGAN; (f) Groundtruth"

[1] 张颖, 刘成霞. 生成对抗网络在虚拟试衣中的应用研究进展[J]. 丝绸, 2021, 58(12):63-72.
ZHANG Yin, LIU Chengxia. Research progress on the application of generative adversarial network in virtual fitting[J]. Journal of Silk, 2021, 58(12):63-72.
[2] 王晨麟, 赵正, 张涛, 等. 面向微运动视频的三维重建[J]. 计算机系统应用, 2022, 31(7):298-306.
WANG Chenglin, ZHAO Zheng, ZHANG Tao, et al. Iterative reconstruction for micro motion video[J]. Computer System Applications, 2022, 31(7):298-306.
[3] CAI H, BAI C, TAI Y, et al. Deep video generation, prediction and completion of human action seque-nces[C]// Proceedings of the European conference on computer vision (ECCV). Berlin: Springer-Verlag, 2018: 366-382.
[4] VILLEGAS R, YANG J, HONG S, et al. Decomposing motion and content for natural video sequence predic-tion[C]// 5th International Conference on Learning Representations (ICLR). Addis Ababa: International Conference on Learning Representations, 2017:1-22.
[5] WALKER J, MARINO K, GUPTA A, et al. The pose knows: video forecasting by generating pose futures[C]// Proceedings of the IEEE International Conference On Computer Vision. New York: IEEE Communications Society, 2017: 3332-3341.
[6] MATHIEU M, COUPRIE C, LECUN Y. Deep multi-scale video prediction beyond mean square error[C]// 4th International Conference on Learning Representations (ICLR). Addis Ababa: International Conference on Learning Representations, 2016:1-14.
[7] HU Q, WALCHLI A, PORTENIER T, et al. Learning to take directions one step at a time[C]// 2020 25th International Conference on Pattern Recognition (ICPR). Montreal: IEEE Communications Society, 2020:1-8.
[8] DONG H, LIANG X, SHEN X, et al. Fw-gan: flow-navigated warping gan for video virtual try-on[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE Communications Society, 2019: 1161-1170.
[9] ZHU Z, HUANG T, SHI B, et al. Progressive pose attention transfer for person image generation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2019: 2347-2356.
[10] ZHAO Q, ZHENG C, LIU M, et al. PoseFormerV2: exploring frequency domain for efficient and robust 3D human pose estimation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2023: 8877-8886.
[11] ISOLA P, ZHU J Y, ZHOU T, et al. Image-to-image translation with conditional adversarial networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2017: 1125-1134.
[12] HE K, ZHANG X, REN S, et al. Deep residual learning for image recog-nition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2016: 770-778.
[13] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recog-nition[C]// 3rd International Conference on Learning Representations (ICLR). Addis Ababa: International Conference on Learning Representations, 2015:1-14.
[14] LI P, XU Y, WEI Y, et al. Self-correction for human parsing[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(6): 3260-3271.
[15] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]// International Conference on Learning Representations. Addis Ababa: International Conference on Learning Representations, 2018:23-44.
[16] WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE transactions on image processing, 2004, 13(4): 600-612.
doi: 10.1109/tip.2003.819861 pmid: 15376593
[17] ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2018: 586-595.
[18] HUYNH Q, GHANBARI M. Scope of validity of PSNR in image/video quality assessment[J]. Electronics letters, 2008, 44(13): 800-801.
[19] ZHANG J, LI K, LAI Y K, et al. Pise: Person image synthesis and editing with decoupled GAN[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Communications Society, 2021: 7982-7990.
[20] ZHOU X, YIN M, CHEN X, et al. Cross attention based style distribution for controllable person image synthesis[C]// Computer Vision-ECCV 2022: 17th European Conference. Berlin: Springer-Verlag, 2022: 161-178.
[1] HU Xudong, TANG Wei, ZENG Zhifa, RU Xin, PENG Laihu, LI Jianqiang, WANG Boping. Structure classification of weft-knitted fabric based on lightweight convolutional neural network [J]. Journal of Textile Research, 2024, 45(05): 60-69.
[2] WEN Jiaqi, LI Xinrong, FENG Wenqian, LI Hansen. Rapid extraction of edge contours of printed fabrics [J]. Journal of Textile Research, 2024, 45(05): 165-173.
[3] GU Meihua, HUA Wei, DONG Xiaoxiao, ZHANG Xiaodan. Occlusive clothing image segmentation based on context extraction and attention fusion [J]. Journal of Textile Research, 2024, 45(05): 155-164.
[4] LU Weijian, TU Jiajia, WANG Junru, HAN Sijie, SHI Weimin. Model for empty bobbin recognition based on improved residual network [J]. Journal of Textile Research, 2024, 45(01): 194-202.
[5] CHI Panpan, MEI Chennan, WANG Yan, XIAO Hong, ZHONG Yueqi. Single soldier camouflage small target detection based on boundary-filling [J]. Journal of Textile Research, 2024, 45(01): 112-119.
[6] SHI Hongyu, WEI Yingjie, GUAN Shengqi, LI Yi. Cotton foreign fibers detection algorithm based on residual structure [J]. Journal of Textile Research, 2023, 44(12): 35-42.
[7] MA Chuangjia, QI Lizhe, GAO Xiaofei, WANG Ziheng, SUN Yunquan. Stitch quality detection method based on improved YOLOv4-Tiny [J]. Journal of Textile Research, 2023, 44(08): 181-188.
[8] YUAN Tiantian, WANG Xin, LUO Weihao, MEI Chennan, WEI Jingyan, ZHONG Yueqi. Three-dimensional virtual try-on network based on attention mechanism and vision transformer [J]. Journal of Textile Research, 2023, 44(07): 192-198.
[9] FU Han, HU Feng, GONG Jie, YU Lianqing. Defect reconstruction algorithm for fabric defect detection [J]. Journal of Textile Research, 2023, 44(07): 103-109.
[10] LIU Yuye, WANG Ping. High-precision intelligent algorithm for virtual fitting based on texture feature learning [J]. Journal of Textile Research, 2023, 44(05): 177-183.
[11] YANG Hongmai, ZHANG Xiaodong, YAN Ning, ZHU Linlin, LI Na'na. Robustness algorithm for online yarn breakage detection in warp knitting machines [J]. Journal of Textile Research, 2023, 44(05): 139-146.
[12] GU Bingfei, ZHANG Jian, XU Kaiyi, ZHAO Songling, YE Fan, HOU Jue. Human contour and parameter extraction from complex background [J]. Journal of Textile Research, 2023, 44(03): 168-175.
[13] LI Yang, PENG Laihu, LI Jianqiang, LIU Jianting, ZHENG Qiuyang, HU Xudong. Fabric defect detection based on deep-belief network [J]. Journal of Textile Research, 2023, 44(02): 143-150.
[14] WANG Bin, LI Min, LEI Chenglin, HE Ruhan. Research progress in fabric defect detection based on deep learning [J]. Journal of Textile Research, 2023, 44(01): 219-227.
[15] CHEN Jia, YANG Congcong, LIU Junping, HE Ruhan, LIANG Jinxing. Cross-domain generation for transferring hand-drawn sketches to garment images [J]. Journal of Textile Research, 2023, 44(01): 171-178.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!