We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of...
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models - ura-hcmut/Cherry_LLM
perplexity.py cleanup debug Feb 7, 2025 score.py tidy Feb 10, 2025 Very simple example of a Perplexity calculation based on wikipedia trained model. Based on filtering the common crawl from:https://github.com/facebookresearch/cc_net& accompanying paper:https://arxiv.org/pdf/1911.00359 ...
We measure the degree of the model’s perplexity, where tokens predicted with high probability are considered normal, and those exhibiting high perplexity are flagged as adversarial. Additionally, our method also integrates context understanding by incorporating neighboring token inf...
In this work we propose an approach to estimate perplexity values for complex language models such as a language model based on phrase classes. The perplexity values obtained by using this method are compared to other typically employed approaches and to the perplexity obtained without any simplificat...
Social Media Relevance Filtering Using Perplexity-Based Positive-Unlabelled LearningInternet user-generated data, like Twitter, offers data scientists a public real-time data source that can provide insights, supplementing traditional data. However, identifying relevant data for such analyses can be time-...
In the present work we used a word clustering algorithm based on the perplexity criterion, in a Dialogue Act detection framework in order to model the structure of the speech of a user at a dialogue system. Specifically, we constructed an n-gram based model for each target Dialogue Act, ...
Linguistically-augmented Perplexity-based Data Selection forLanguage Modelsdoi:10.1016/j.csl.2014.10.002Antonio ToralPavel PecinaLongyue WangJosef Genabith
Lucas-Cuesta, J. M., Fernandez-Martinez, F., Moreno, T., and Ferreiros, J. (2012a). Mutual information and perplexity based clustering of dialogue information for dynamic adaptation of language models. In Proc. VII Jornadas de Tecnologia del Habla (Iberspeech2012)....
Research limitations/implications While in a state of perplexity, reflecting on the in-game information aids players to think and make meaning, thus supporting learning. We provide suggestions for how to better utilize perplexity as an in-game design mechanism to encourage young players to reflect ...