Datasets

In the context of ModelStudio, resources are organized into Datasets: directories containing collections of training and testing corpora.

Please refer to the following sections for more information on how to:

Datasets overview

Click on Resources in the header menu to go to the the Resources page.

The table displays all the Datasets available to your Entity, with the following information:

  • The Dataset’s name

  • The language pair

  • The name of the user who created the Dataset

  • The Dataset’s expiration date

  • The date the Dataset was last modified

Browse a dataset

Click on the name of a Dataset to enter it and view its content (see Corpora for detailed information).

The following information is displayed for each corpus within the Dataset:

  • The corpus name

  • The corpus type (Train or Test)

  • The number of sentences (or segments) pairs the corpus contains

  • Its upload status (In progress, Success, Error)

  • The last time it was modified

Create a Dataset

To create a new Dataset, click on the “+ Create a new dataset” button. The form below will show.

Complete the form:

  • Choose a name for your Dataset. It should not exceed 50 characters and can only contain alphanumeric characters and .-_

  • Either start typing the 2-letter ISO code of the source language, or select a language pair from the drop-down list

  • Set a date for the automatic suppression of the Dataset ; you can change this date later (see below)

  • Choose the data you wish to upload. You can either drag and drop your files into the Corpus box or click on it to open and browse your file explorer. The files you selected will then appear under the Corpus box. Click on the trash bin to delete a file

  • By default, ModelStudio will extract a test set of 1000 segment pairs from the uploaded data. See Upload a corpus for more information

  • Click on Submit

Warning

If you choose to extract test subset from your files, a test subset will be extracted from each file you ulpoad.

Your dataset will appear in the list displayed on the Resources page.

Dataset expiration

Starting with ModelStudio 1.4.0, customer data is automatically deleted, if not used after a configurable period of time.

You keep control on your data and on its deletion, as you have the possibility to decide whether and when said data will be deleted:

  • 90 days (~ 3 months) after the upload

  • 180 days (~ 6 months) after the upload

  • no automatic suppression

Traffic lights help you seeing quicky how much time is left before the Dataset expires.

Color Time left
🔴 7 days left or less
🟠 between 7 and 21 days left
🟢 more than 21 days or no expiration date
⚪️ Expired

To change the expiration date of a Dataset, either:

  • Select the Dataset then clickon the Update expiration time button

  • Click on the cog icon button and select Update expiration time.

When the Dataset expires, information on which Datasets were used in model traninings and BLEU scores are kept and remain available on the detailed trained models pages.

Delete a dataset

Warning

Depending on the permissions given to you, you may or may not be able to delete a Dataset.

  • Tick the box(es) corresponding to the Dataset(s) you want to delete, then click on the **Delete button.

You will be asked confirmation of your decision to delete the selected Dataset(s).

The Dataset(s) to be removed are listed.

  • Click on Submit to confirm your choice.

Click on Close if you wish to modify your selection or simply to cancel the action.