Resource

- Corpora
  1. TGEA 2.0: A Large-Scale Diagnostically Annotated Dataset with Benchmark Tasks for Text Generation of Pretrained Language Models
    • paper: [pdf]
    • dataset and code:
  2. BiPaR : A bilingual MRC dataset on novels [Jing et al. 2019]
  3. Dataset for Shallow Discourse Annotation for Chinese TED Talks.
  4. A Test Suite for Evaluating Discourse Phenomena in Document-level Neural Machine Translation.
  5. RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich Semantic Annotations for Task-Oriented Dialogue Modeling.
  6. TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks.
  7. Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context.

- Codes