CSL-2022 (Chinese Scientific Literature)

Introduced by Li et al. in CSL: A Large-scale Chinese Scientific Literature Dataset

We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.

Paper | Code and data


We obtain the paper's meta-information from the National Engineering Research Center for Science and Technology Resources Sharing Service (NSTR) dated from 2010 to 2020. Then, we filter data by the Catalogue of Chinese Core Journals. According to the Catalogue and collected data, we divide academic fields into 13 first-level categories (e.g., Engineering, Science) and 67 second-level disciplines (e.g., Mechanics, Mathematics). In total, we collect 396,209 instances for the CSL dataset, represented as tuples <T, A, K, c, d>, where T is the title, A is the abstract, K is a list of keywords, c is the category label and d is the discipline label. The paper distribution over categories and the examples of disciplines are shown in below:

Category #d len(T) len(A) num(K) #Samples Discipline Examples
Engineering 27 19.1 210.9 4.4 177,600 Mechanics,Architecture,Electrical Science
Science 9 20.7 254.4 4.3 35,766 Mathematics,Physics,Astronomy,Geography
Agriculture 7 17.1 177.1 7.1 39,560 Crop Science,Horticulture,Forestry
Medicine 5 20.7 269.5 4.7 36,783 Clinical Medicine,Dental Medicine,Pharmacy
Management 4 18.7 157.7 6.2 23,630 Business Management,Public Administration
Jurisprudence 4 18.9 174.4 6.1 21,554 Legal Science,Political Science,Sociology
Pedagogy 3 17.7 179.4 4.3 16,720 Pedagogy,Psychology,Physical Education
Economics 2 19.5 177.2 4.5 11,558 Theoretical Economics,Applied Economics
Literature 2 18.8 158.2 8.3 10,501 Chinese Literature,Journalism
Art 1 17.8 170.8 5.4 5,201 Art
History 1 17.6 181.0 6.0 6,270 History
Strategics 1 17.5 169.3 4.0 3,555 Military Science
Philosophy 1 18.0 176.5 8.0 7,511 Philosophy
All 67 396,209

Evaluation Tasks

We build a benchmark to facilitate the development of Chinese scientific literature NLP. It contains diverse tasks, ranging from classification to text generation, representing many practical scenarios. We randomly select 100k samples and split the datasets into the training set, validation set and test set according to the ratio, 0.8 : 0.1 : 0.1. This split is shared across different tasks, which allows multitask training and evaluation. Datasets are presented in text2text format.

1.Text Summarization (Title Prediction)

Predict the paper title from the abstract.

Data examples:

  "prompt": "to title",
  "text_a": "多个相邻场景同时进行干涉参数外定标的过程称为联合定标,联合定标能够 \
            保证相邻场景的高程衔接性,能够实现无控制点场景的干涉定标.该文提出了 \
  "text_b": "基于加权最优化模型的机载InSAR联合定标算法"

2.Keyword Generation

Predict a list of keywords from a given paper title and abstract.

Data examples:

  "prompt": "to keywords",
  "text_a": "通过对72个圆心角为120°的双跨偏心支承弯箱梁桥模型的计算分析,以梁 \
            格系法为基础编制的3D-BSA软件系统为结构计算工具,用统计分析的方法建 \
            立双跨偏心支承弯箱梁桥结构反应在使用极限状态及承载能力极限状态下与 \
            桥梁跨长... 偏心支承对120°圆心角双跨弯箱梁桥的影响",
  "text_b": "曲线桥_箱形梁_偏心支承_设计_经验公式"

3.Category Classification

Predict the category with the paper title (13 classes).

Data examples:

  "prompt": "to category",
  "text_a": "基于模糊C均值聚类的流动单元划分方法——以克拉玛依油田五3中区克下组为例",
  "text_b": "工学"
  "prompt": "to category",
  "text_a": "正畸牵引联合牙槽外科矫治上颌尖牙埋伏阻生的临床观察",
  "text_b": "医学"

4.Discipline Classification

Predict the discipline with the paper abstract (67 classes).

Data examples:

  "prompt": "to discipline",
  "text_a": "某铁矿选矿厂所产铁精矿含硫超过0.3%,而现场为了今后发展的需要,要 \
             求将含硫量降到0.1%以下.为此,针对该铁精矿中硫化物主要以磁黄铁矿 \
  "text_b": "矿业工程"
  "prompt": "to discipline",
  "text_a": "为了校正广角镜头的桶形畸变,提出一种新的桶形畸变数字校正方法.它 \
             使用点阵样板校正的方法,根据畸变图和理想图中圆点的位置关系,得出 \
  "text_b": "计算机科学与技术"


Paper Code Results Date Stars

Dataset Loaders

No data loaders found. You can submit your data loader here.