Asian Scientific Paper Excerpt Corpus (ASPEC)

Japan Science and Technology Agency (JST)
National Institute of Information and Communications Technology (NICT)
The Asian Scientific Paper Excerpt Corpus consists of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). This corpus is one of the achievements of the Japanese-Chinese machine translation project which was run in Japan from 2006 to 2010
available for research purposes
3M parallel sentences and 680K parallel sentences
Cumbersome to download/implement
  • Texts are only useful when training a scientific vertical engine
  • A 4  page license agreement needs to be signed and send to the "Japan Science and Technology Agency" in order for you to receive an mail with a download link. That however is a smooth process. Answer within the day.
  • Beware: the segmentation used is paragraph segmentation!
  • There is not only training data (3M sentences Japanese-English & 672,315 Japanese-Chinese), but also a DEV set, a DEVTEST set and a TEST set. No info on how these 3 sets have been assembled. 
  • Beware: delimiter in the training corpus is: |||
Japanese-Chinese corpus:
- Medicine 28%
- Computing 28%
- Biology 14%
- Environment 12%
- Chemistry 6%
- Materials 5%
- Agriculture 4%
- Energy 3%
Japanese-English corpus:
- Medicine 34%
- Physics 10%
- Biology 9%
- Electronic Engineering 7%
- Agriculture 6%
- Computer Science 4%
- Construction 4%
- Chemistry 4%
- Mechanical Engineering 3%
- Chemical Manufacturing 3%
- Metal Engineering 3%
- Space/Earth Science 2%
- Environmental Science 2%
- Systems Engineering 2%
- Thermodynamics 2%
- Engineering 1%
- General Science 1%
- Nuclear Science 1%
- Industrial Engineering 1%
- Chemical Engineering 1%
- Transportation Engineering 1%

This corpus was designed mainly for use with Machine Translation (for scientific domain) and can be used for some other NLP applications.

It consists of a Japanese-English scientific paper abstract
corpus (ASPEC-JE) of approximately 3 million parallel sentences and a
Chinese-Japanese scientific paper excerpt corpus (ASPEC-JC) of approximately
0.68 million parallel sentences.
It is divided as follows:

ASPEC/ASPEC-JC (Japanese-Chinese corpus:)

- Training Data
    train/train.txt (672,315 sentences)
- Development Data
    dev/dev.txt (2,090 sentences)
- Development-Test Data
    devtest/devtest.txt (2,148 sentences)
- Test Data
    test/test.txt (2,107 sentences)

For all data, the format of each line is:

  Sentence ID ||| Japanese sentence ||| Chinese sentence
JST_JC_ENVI-abst-06A0281759-par1-sen1 ||| C&D管理施設の高度化 ||| C&D管理设施的高度化

ASPEC/ASPEC-JE (Japanese-English corpus)
- Training Data
    train/train-1.txt (1,000,000 sentence pairs)
    train/train-2.txt (1,000,000 sentence pairs)
    train/train-3.txt (1,008,500 sentence pairs)
- Development Data
    dev/dev.txt (1,790 sentence pairs)
- Development-Test Data
    devtest/devtest.txt (1,784 sentence pairs)
- Test Data
    test/test.txt (1,812 sentence pairs)

In the training data, each line is formatted as follows:

    Similarity score ||| Field symbol - Document ID ||| Sentence ID ||| Japanese sentence ||| English sentence
0.880570409982175 ||| G-03A0568930 ||| 0 ||| 現在,筋ジストロフィー患者の移動介助において文書マニュアルを使用している。 ||| At present, the document manual is used in transfer assistance of the muscular dystrophy patient.

In the dev/test data, each line is formatted as follows:

    Document ID ||| Sentence ID ||| Japanese sentence ||| English sentence
A-02A0174441 ||| 0 ||| ナノテクノロジーの応用として,情報技術・電子分野では,次世代半導体,高密度情報記録技術,超小型集積回路素子,カーボンナノチューブを用いた省電力ディスプレイなどが期待できる。 ||| In information technology and electron field,the application of nanotechnology to next generation semiconductors,high-density information record technology,miniature integrated circuit elements,electric power saving displays using carbon nano-tube,etc. can be expected.

After completing and signing the lecence agreement, it can takes a couple of days (or more)
before receiving an email containing the downlod link with the web pasword and the ZIPpassword.