1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Fine tune Sentence transformer with single sentence and label data

Discussão em 'Python' iniciado por Stack, Setembro 12, 2024.

  1. Stack

    Stack Membro Participativo

    I am trying to fine tune a sentence transformer model. The data I have contains below columns:

    1. raw_text - the raw chunks of text
    2. label - corresponding label for the text - True or False. (1 or 0)

    I wanted to fine tune a sentence transformer model such that the embeddings are optimized in a way that all the True sentences are closer in the vector space than all the False sentence.

    I have been reading about the losses from Loss Overview — Sentence-Transformers documentation

    I am really confused which loss to use for my type of data and use-case. I am leaned towards below:

    [​IMG]

    since it matches my data format. As I read more about these losses and the way they are being computed using anchor, positive and negative samples I feel less confident in using them since my data does not have these kind of pair.

    Can someone here help me understand if what I am trying to do is plausible with existing losses in sentence transformer library?

    Below is my code so far which work:

    from sentence_transformers import SentenceTransformer, InputExample, SentencesDataset, LoggingHandler, losses
    from torch.utils.data import DataLoader
    import pandas as pd

    # Load a pre-trained Sentence Transformer model
    # model = SentenceTransformer('stsb-roberta-base') #Hugging face says this model produces embeddings of low quality
    model = SentenceTransformer('all-mpnet-base-v2')

    # Assume 'transportation_data' is your dataset containing 'page_raw_text' and 'is_practical' columns
    data = pd.DataFrame({'text': train_data['page_raw_text'], 'label': train_data['label']})

    # Create InputExample objects
    examples = [InputExample(texts=[txt], label=label) for txt, label in zip(data['text'], data['label'])]

    # Create a DataLoader object and a Loss model
    train_dataset = SentencesDataset(examples=examples, model=model)
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)
    train_loss = losses.BatchAllTripletLoss(model=model)

    # Define your training arguments
    num_epochs = 10
    evaluation_steps = 1

    model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs,evaluation_steps=1)

    Continue reading...

Compartilhe esta Página