Sequence to Sequence modelling for English to French Translations NLP

Translating from one language to another can be considered as a sequence to sequence task i.e., going from one sequence to another.

We have 2 approaches in training a translator model:

Model : We will be using a Marian model which is already pre-trained to translate from English to French

Dataset: To train the model we will be using the KDE4 dataset

Colab notebook has below steps for training the model:

Preparing the Dataset – Splitting, Padding, Train / Test Split
Tokenizer – Instantiating a tokenizer for the particular model ; We also have to specify the target language for translation to the tokenizer
Data Collation – Used for Padding data when we use dynamic batching ; Lables are padded using -100 , so that these padded values are ignored in loss computation. We use a special Data Collator – DataCollatorForSeq2Seq
Evaluation Metrics – SacreBLEU metric is used to evaluate the french translations of the model ; Score can go from 0 to 100, higher the better
Fine Tuning (Training) the model – We pass the following to the trainer to start training:
- model
- training arguments
- train dataset
- eval dataset
- data collator
- tokenizer
- compute metrics
Evaluating the Model – We use trainer.evaluate to check the metrics on how well the model is trained
Saving the Model – Push the model hugging face hub after each epoch

Accelerator – Using this we design a custom training loop

AI Deep Learning