Week 9- Diving Deeper into Evaluation and Post-Processing 🤿

Overview

The ninth week of the project was mainly focused on refining the evaluation process for our trained (fine-tuned) models. Our primary goal was to develop a reusable python module to measure the effectiveness of the current and future trained models in generating SPARQL queries over DBpedia. To meet this goal, we developed an evaluation module that encompasses post-processing option, thereby enbaling direct comparison between raw generated queries and post-processed ones.

Findings and challenges

A confusion resolved: An early finding of the week was made with regard to the unexpected appearance of the Answer: token at the beginning of the generated query. Upon deeper inspection, it was found that in the original fine-tuning code for StarCoder, there exists a pre-processing step where input/output values from the Prompt and Completion columns are formatted by adding Question: and Answer: at the beginning of the sequences, respectively. With the mystery uncovered, the subsequent step was to remove this unwanted token through a post-processing step.

Inspection of raw queries: A significant amount of the week was dedicated to examining the raw SPARQL queries generated by our fine-tuned model. This helped us pinpoint specific areas where post-processing is needed to improve the quality of the generated output.

Refining post-processing: With the findings from the raw query inspection in hand, we were able to make necessary adjustments to the post-processing steps, ensuring better query output. However, it should be noted that all pos-processing at this stage is of syntactic nature, e.g., removing redundant tokens, spacing, etc. The current implementation can be seen below:

[In]:

class SPARQLPostProcessor:
   # Remove unwanted tokens
   def remove_tokens(self, generated_output, tokens_to_remove=["Answer:", "<|endoftext|>"]):  # Replace with the actual EOS token
       for token in tokens_to_remove:
           generated_output = generated_output.replace(token, '').strip()
       return generated_output


   # custom spacing rules to algin with nspm dataset
   def apply_spacing_rules(self, query):
   
       # Remove redundant spaces
       query = re.sub(r' +', ' ', query)
       query = query.strip()
       
       # Ensure space before "?"
       query = re.sub(r'(?<=[^\s])\?', ' ?', query)
       
       # Ensure no space after "?"
       query = re.sub(r'\?[\s]+', '?', query)
       
       # Ensure space before "}"
       query = re.sub(r'(?<=[^\s])}', ' }', query)
       
       # Ensure space before "."
       query = re.sub(r'(?<=[^\s])\.', ' .', query)
       
       # Ensure space after "."
       query = re.sub(r'\.(?=[^\s])', '. ', query)
       
       # Ensure no space before and after "{"
       query = re.sub(r'\s?{\s?', '{', query)
       
       # Ensure no space before and after ":"
       query = re.sub(r'\s?:\s?', ':', query)
       
       return query
   

   def post_process(self, query):
       query = self.remove_tokens(query)
       query = self.apply_spacing_rules(query)
       
       return query

Evaluation metric: We opted for the BELU metric as our primary evaluative measure due to its well-established reputation in gauging the similarity between generated text sequences and reference sequences. In our context, the metric would determine how closely our model’s generated SPARQL queries align with the actual desired queries. Given the importance of ensuring accurate and semantically coherent translations from natural language to SPARQL, BELU offers a robust, granular assessment, making it an optimal choice for our project’s needs.

Evaluation scheme:

Raw vs. Post-Processed Outputs– Our initial examinations revealed that the fine-tuned model already exhibits notable proficiency in crafting structurally sound SPARQL queries that are well-grounded in the prompt text. However, we recognized that the BELU metric, given its exacting nature, can be particularly sensitive to minor syntactic deviations, even if these don’t necessarily alter the semantic intent of the query. To account for this, we decided to compute scores for both the raw generated queries and their post-processed counterparts. This approach not only helps us understand the efficacy of our post-processing techniques but also underscores the inherent strengths of the model’s raw outputs.
Checkpoint-Based Evaluation– Recognizing the potential variability in model performance across different training durations, we also decided to evaluate multiple checkpoints. This iterative evaluation process will provide insights into the optimal training duration for our model, which is of crucial importance given the hardware-related limitations.

Next week plan

Expand evaluation: Now that we have a refined evaluation module and improved post-processing, the plan is to extend the evaluation to include a larger set of test data to better understand the model’s strengths and weaknesses.
Optimizing model training: Insights from this week’s evaluation will be fed back into the training process to improve the next round of model training.