TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
Published in arXiv, under review of a CCF-A conference, 2025
TRivia is a self-supervised fine-tuning method that enables pretrained vision-language models to learn table recognition directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism.
Recommended citation: Junyuan Zhang and Bin Wang and Qintong Zhang and Fan Wu and Zichen Wen and Jialin Lu and Junjie Shan and Ziqi Zhao and Shuya Yang and Ziling Wang and Ziyang Miao and Huaping Zhong and Yuhang Zang and Xiaoyi Dong and Ka-Ho Chow and Conghui He (2025). "TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition." arXiv. 1(3).
Download Paper
