Link to Colab Notebook (Hosted on Github)
All the code needed to preprocess data & train the model is present in the above link. Please run it on Colab to gain a better understanding
Model and Dataset Information
Translating from one language to another can be considered as a sequence to sequence task i.e., going from one sequence to another.
We have 2 approaches in training a translator model:
- Train from Scratch – If you have a big enough corpus to train the model
- Fine Tune Existing Model – Faster & uses less resources
Model : We will be using a Marian model which is already pre-trained to translate from English to French
Dataset: To train the model we will be using the KDE4 dataset
Colab notebook has below steps for training the model:
- Preparing the Dataset – Splitting, Padding, Train / Test Split
- Tokenizer – Instantiating a tokenizer for the particular model ; We also have to specify the target language for translation to the tokenizer
- Data Collation – Used for Padding data when we use dynamic batching ; Lables are padded using -100 , so that these padded values are ignored in loss computation. We use a special Data Collator – DataCollatorForSeq2Seq
- Evaluation Metrics – SacreBLEU metric is used to evaluate the french translations of the model ; Score can go from 0 to 100, higher the better
- Fine Tuning (Training) the model – We pass the following to the trainer to start training:
- model
- training arguments
- train dataset
- eval dataset
- data collator
- tokenizer
- compute metrics
- Evaluating the Model – We use trainer.evaluate to check the metrics on how well the model is trained
- Saving the Model – Push the model hugging face hub after each epoch
Accelerator – Using this we design a custom training loop