Below is an overview of various terms to help you understand how to use ModelStudio.
Datasets, Corpora & Segments
A Dataset is essentially a directory containing a collection of files: training and testing data. Each of these files is a corpus. Corpora contain pairs of sentences in the source and target languages. These pairs are called segments. Each of these notions and their management in ModelStudio are detailed in the Resources section of this documentation.
During the upload of your data into a Dataset, ModelStudio performs a basic cleaning, removing duplicates and obviously problematic sentences. More information is available on the Data cleaning page.
Training corpora are used as translation examples, to train the model. Testing corpora are used to evaluate the model. See the Evaluations section below for more information.
The training data should be representative of the type of content you want your model to be able to translate. To put it plainly, a model will perform poorly on text very different from what it is used to.
To keep your model robust, some generic data should always be used along with your data, and is automatically added by ModelStudio.
Please note that strating with ModelStudio 1.4.0, Datasets and Evaluations have an expiration date, after which they will be automatically deleted.
A model is a translation engine. It is the result of a training process, using training data (for example corpora or terminologies) from which the engine learns how to translate some content from a source language to a target language.
Domains can be regarded as large thematics (for instance News, IT, Legal or Medical). A model trained with examples balanced between various domain is considered a Generic model. A model can be tailored to perform especially well on a particular domain. This process is called domain specialization.
To train a domain specialized model, the training data should contain a majority of translation examples from that domain.
ModelStudio allows you to specialize a model, whether it be a Generic or a Domain model, fine-tuning it with your own data.
The training is the process during which the engine learns from the examples in the training data. The data is examined through various filters and undergoes several transformations to ensure both quality and robustness. All the SYSTRAN baseline models are the result of SYSTRAN’s expertise and are also used as starting points for model specializations internally trained for SYSTRAN’s customers.
At the end of the training, the generated translation model is used to translate the testing corpora, which allows both automatic and human evaluation.
Evaluation is the process of assessing a translation’s quality and thus a model’s performance. Within ModelStudio, this can correspond to:
A translation and the evaluation this translation at the end of a training, if at least one test set was provided (or extracted during the upload). In this case, the evaluation is automatically launched, and the scores are available on the Models overview and detailed model page.
An Evaluation, where you upload one or several files, and can choose up to three different models to compare the translations (manually and/or with the obtained scores). Please refer to the Evaluate a model page.
Evaluations can be run independently from a training. For example, you may want to translate a file with several already available models to help you decide which one you will specialize with your data. More information is available in the Evaluations section of this documentation.
BLEU score and other metrics
Several metrics are used to evaluate the quality of a translation. The most commonly used is the BLEU score. It measures how close a translation is to a reference translation of the same source text. The reference is typically a human generated translation.
The closer the evaluated translation is to the reference, the higher the BLEU score will be.
Once your model is succesfully trained on ModelStudio, you can publish it to use it on other platforms. Depending on your usage and on the offer you subscribed to, the procedure and the destination platform may vary.
Making the model available on said platform is called model deployment. See Deploy a model for more information.