1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Masking and computing loss for a padded batch in a transformer architecture

Discussão em 'Python' iniciado por Stack, Setembro 10, 2024.

  1. Stack

    Stack Membro Participativo

    I am trying to re-create transformer model, and base myself loosely on the Annotated Transformer. My question regards the padding:

    1. How does the Annotated Transformer deal with padded sequences? I can see them creating the method Batch.make_std_mask, which supposedly (also) masks all padded tokens, but this only applies to synthetic data.
    2. How should one generally proceed with padded sequences in a transformer-based architecture? I can see mentions (again, in the Annotated Transformer, search for "Batching matters a ton for speed.") to minimise padding, by which - I guess - is meant to find a sequence length that minimises padding? I am fairly certain one has to pass the whole padded sequence into the model, otherwise the self-attention runs into problems. Apart from setting something like padding_idx in the embeddings (pytorch), should the model itself treat padded tokens? Should loss calculation index target and model output to ignore all padded tokens?

    I have already seen this question (Masking and computing loss for a padded batch sent through an RNN with a linear output layer in pytorch) - is the procedure the same for attention-based models, or is there a difference because of how the attention mechanism 'sees' the whole sequence at once?

    I have also seen this question (Query padding mask and key padding mask in Transformer encoder), but I am not sure whether it treats the same problem. If it does, I would still love a clarification (and be it in the form of an answer there), because I do not completely understand neither the question nor the currently accepted (only) answer.

    Continue reading...

Compartilhe esta Página