We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of...
We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emph{significantly} improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion ...
The performance of automatic speech summarisation has been improved in pre- vious experiments by using linguistic model adaptation. We extend such adapta- tion to the use of class models, whose ro- bustness further improves summarisation performance on a wider variety of objec- tive evaluation metri...
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models - tianyi-lab/Cherry_LLM
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models - ura-hcmut/Cherry_LLM
In this work we propose an approach to estimate perplexity values for complex language models such as a language model based on phrase classes. The perplexity values obtained by using this method are compared to other typically employed approaches and to
The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very ...
Perplexity proves to be sufciently correlated to the objective evaluation metrics used in the summarisation literature that it can be used in this fashion. For a much reduced computational cost (approximately 500 times faster), nal relative improvements are very similar to those previously obtained,...
Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As amatter of fact, the removal of noisy part...
In the second one we take global decisions, based on the optimization of the global perplexity of the combination of the cluster-related LMs. Our experiments show a relative reduction of the word error rate of 15.17%, which helps to improve the performance of the understanding and the dialogue...