Corpora

In the context of ModelStudio, a corpus is a file within a dataset, containing a list collection of segments. It can be a training or a testing corpus (test sets are testing corpora).

Please refer to the following sections for more information on how to:

More information about the cleaning performed on corpora is available on the Data upload and cleaning page. More information on how to edit corpora is on the Segments page.

Upload a corpus

Corpora can only be uploaded within datasets. To populate an existing dataset with new corpora:

  • Go to the Resources page, and click on the Dataset in which you want to add new corpora.

  • Click on the + Upload a new corpus button.

A form as the one below will be displayed.

Either drag and drop your files into the Corpus box or click on it to open and browse your file explorer. The files you selected will then appear under the Corpus box. Click on the trash bin to delete a file.

Only two formats are accepted : application/x-tmx+xml (TMX files) and text/plain (raw bitext). See Data upload and cleaning for more information.

Test set creation

By default, ModelStudio allows you to split part of your data into a training set and a testing set.

Under Data allocated for testing, you can either choose a fixed number of lines to extract (Segments), or a percentage value (Percentage). Move the cursor to increase or decrease this value. Default values are 1000 segments (or 10 %).

Tip

Please keep in mind that the BLEU score relies on the proximity between a translation and a reference. As such, it is subject to higher variation if:

  • The segments are very short.

  • The test set contains fewer segments.

If you have already prepared a test set, tick the box indicating you want to use separate files for training and testing. A second drag-and-drop box will appear for the testing data.

Click on Submit. Clicking on Close closes the form an no file will be added to your Dataset.

Your new file(s) will appear in the copora list of your Dataset. While the cleaning process is running, the upload status remains In progress, then changes to Success when the cleaning is done and the corpus is ready to use.

If an error occurs during your upload, the status column will display an error message explaining you what happened.

Common corpus issues

Please be sure to check the following:

  • Files are properly encoded in utf-8

  • Language codes are valid (TMX files)

  • Language codes correspond to the languages declared for the dataset

  • Language direction is correct (datasets are mondirectional)

  • Bitext files start with the right header

  • Segments do not contain tabulations (bitext files)

  • There are enough segment pairs in your file to extract the test subset you’re expecting

Delete a corpus

There are two ways to delete a corpus.

  • Tick the box next to the corpus you want to delete, then click on the Delete button:

  • Click on the cog next to the corpus you want to delete, and choose Delete from the drop-down menu:

You will be asked to confirm your choice.

  • Click on Submit to confirm.

Clicking on Close or on the X in the top right corner will close this confirmation window and your corpus will not be deleted.

It is possible to delete several copora at once.

Download a corpus

There are two ways to download a corpus:

  • Tick the box next to the corpus you want to download, then click on the Download button:

  • Click on the cog next to the corpus you want to delete, and choose Download from the drop-down menu:

You internet navigator should display a dialog box asking you if you wish to open the file with an application of if you’d rather save it.

To avoid network issues, it is not possible to download several copora at once.