Datasets
In the context of ModelStudio, resources are organized into Datasets: directories containing collections of training and testing corpora.
Please refer to the following sections for more information on how to:
Datasets overview
Click on Resources in the header menu to go to the the Resources page.
The table displays all the Datasets available to your Entity, with the following information:
The Dataset’s name
The language pair
The name of the user who created the Dataset
The Dataset’s expiration date
The date the Dataset was last modified
Browse a dataset
Click on the name of a Dataset to enter it and view its content (see Corpora for detailed information).
The following information is displayed for each corpus within the Dataset:
The corpus name
The corpus type (Train or Test)
The number of sentences (or segments) pairs the corpus contains
Its upload status (In progress, Success, Error)
The last time it was modified
Create a Dataset
To create a new Dataset, click on the “+ Create a new dataset” button. The form below will show.
Complete the form:
Choose a name for your Dataset. It should not exceed 50 characters and can only contain alphanumeric characters and
.-_
Either start typing the 2-letter ISO code of the source language, or select a language pair from the drop-down list
Set a date for the automatic suppression of the Dataset ; you can change this date later (see below)
Choose the data you wish to upload. You can either drag and drop your files into the Corpus box or click on it to open and browse your file explorer. The files you selected will then appear under the Corpus box. Click on the trash bin to delete a file
By default, ModelStudio will extract a test set of 1000 segment pairs from the uploaded data. See Upload a corpus for more information
Click on Submit
Warning
If you choose to extract test subset from your files, a test subset will be extracted from each file you ulpoad.
Your dataset will appear in the list displayed on the Resources page.
Dataset expiration
Starting with ModelStudio 1.4.0, customer data is automatically deleted, if not used after a configurable period of time.
You keep control on your data and on its deletion, as you have the possibility to decide whether and when said data will be deleted:
90 days (~ 3 months) after the upload
180 days (~ 6 months) after the upload
no automatic suppression
Traffic lights help you seeing quicky how much time is left before the Dataset expires.
Color | Time left |
---|---|
🔴 | 7 days left or less |
🟠 | between 7 and 21 days left |
🟢 | more than 21 days or no expiration date |
⚪️ | Expired |
To change the expiration date of a Dataset, either:
Select the Dataset then clickon the Update expiration time button
Click on the cog icon button and select Update expiration time.
When the Dataset expires, information on which Datasets were used in model traninings and BLEU scores are kept and remain available on the detailed trained models pages.
Delete a dataset
Warning
Depending on the permissions given to you, you may or may not be able to delete a Dataset.
Tick the box(es) corresponding to the Dataset(s) you want to delete, then click on the **Delete button.
You will be asked confirmation of your decision to delete the selected Dataset(s).
The Dataset(s) to be removed are listed.
Click on Submit to confirm your choice.
Click on Close if you wish to modify your selection or simply to cancel the action.