Xiangxi Shi

Xiangxi Shi

Hey, dude! How's your model today? :)

© 2026

ADOPD: A Large-Scale Document Page Decomposition Dataset

main

Jiuxiang Gu    Xiangxi Shi    Jason Kuen    Lu Qi    Ruiyi Zhang    Anqi Liu    Ani Nenkova    Tong Sun

ICLR 2024 (Poster)

Research in document image understanding is hindered by limited high-quality document data. ADOPD introduces a large-scale dataset for document page decomposition with a data-driven document taxonomy discovery process during collection, and dense annotations. It supports four tasks: Doc2Mask, Doc2Box, Doc2Tag, and Doc2Seq.

  • Data-driven taxonomy discovery during collection to improve diversity and balance.
  • Dense annotations including entity masks and text bounding boxes; plus tags/captions cleaned with human involvement.
  • Four benchmark tasks: Doc2Mask / Doc2Box / Doc2Tag / Doc2Seq.

BibTeX

@inproceedings{
  gu2024adopd,
  title={{AD}o{PD}: A Large-Scale Document Page Decomposition Dataset},
  author={Jiuxiang Gu and Xiangxi Shi and Jason Kuen and Lu Qi and Ruiyi Zhang and Anqi Liu and Ani Nenkova and Tong Sun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=x1ptaXpOYa}
}