Data upload and cleaning

In ModelStudio, training and testing data are called resources and are organized in Datasets > Corpora > Segments. Please refer to these pages for more detailed information.

Data volume requirements

ModelStudio adapts the training parameters to your data volume. You can use up to one million segment pairs (i.e. lines). However, we recommend the following:

  • Your training data should contain at least 2,000 segment pairs for the model training to be efficient.

  • If possible, please split large corpora into smaller files to avoid TimeOut errors during the upload.

File formats

Only bilingual files can be uploaded to ModelStudio, for both model training and model evaluation. Two file formats are currently accepted:

  • application/x-tmx+xml (TMX files)

  • text/plain (raw bitext)

The lines of bitext files are composed of a source and a target separated with a tabulation.

About bitext files

Bitext files need to start with the following header.

#TM
#XX YY

Where XX stands for the two-letters ISO source language code, and YY the target language code. The two language codes should be separated with a tabulation.

A Dataset can contain both TMX and plain text files.

If uploaded separately, training and testing files can be in different file formats. If the test set is extracted from the data during the upload, both the training and the test sets will be in the same file format.

Resources can be downloaded in the same format they were uploaded.

Data cleaning

ModelStudio performs some basic data cleaning during the upload. If the file is a TMX file, the text is first extracted. Then the cleaning procedure is applied following these steps:

  • HTML entities conversion (for example &, <, >, etc.)

  • Control and formatting character supression (for example, newlines are converted to white spaces)

  • Filtering out segment pairs in which either the source or the target segment is missing

  • Duplicate lines (source & target segments) are deleted

ModelStudio does not check for duplicates across files during the data cleaning step. However, these will be filtered out during the training itself.

Note

Please note that for larger files, the duplicate suppression step can take some time, this is perfectly normal.

ModelStudio does not filter out segment pairs where the source and target are identical, as this can be correct (for instance in terminologies).

The cleaning procedure is the same for the training and the testing data to make the Evaluations reliable.