- Corpora
-
TGEA 2.0: A Large-Scale Diagnostically Annotated Dataset with Benchmark Tasks for Text Generation of Pretrained Language Models
- paper: [pdf]
- dataset and code:
-
BiPaR : A bilingual MRC dataset on novels [Jing et al. 2019]
-
Dataset for Shallow Discourse Annotation for Chinese TED Talks.
-
A Test Suite for Evaluating Discourse Phenomena in Document-level Neural Machine Translation.
-
RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich Semantic Annotations for Task-Oriented Dialogue Modeling.
-
TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks.
-
Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context.
- Codes
-
Lab repositories:
-
Reading Group Schedule: