We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of...
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models - tianyi-lab/Cherry_LLM
The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very ...
Perplexity MCP Server An MCP server that provides web search capabilities using Perplexity's API with automatic model selection based on query intent. Prerequisites Node.js (v14 or higher) A Perplexity API key (get one at https://www.perplexity.ai/settings/api) Claude Desktop App Installation In...
In this work we propose an approach to estimate perplexity values for complex language models such as a language model based on phrase classes. The perplexity values obtained by using this method are compared to other typically employed approaches and to the perplexity obtained without any simplificat...
Social Media Relevance Filtering Using Perplexity-Based Positive-Unlabelled LearningInternet user-generated data, like Twitter, offers data scientists a public real-time data source that can provide insights, supplementing traditional data. However, identifying relevant data for such analyses can be time-...
Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As amatter of fact, the removal of noisy part...
In the present work we used a word clustering algorithm based on the perplexity criterion, in a Dialogue Act detection framework in order to model the structure of the speech of a user at a dialogue system. Specifically, we constructed an n-gram based model for each target Dialogue Act, ...
Linguistically-augmented Perplexity-based Data Selection forLanguage Modelsdoi:10.1016/j.csl.2014.10.002Antonio ToralPavel PecinaLongyue WangJosef Genabith
Lucas-Cuesta, J. M., Fernandez-Martinez, F., Moreno, T., and Ferreiros, J. (2012a). Mutual information and perplexity based clustering of dialogue information for dynamic adaptation of language models. In Proc. VII Jornadas de Tecnologia del Habla (Iberspeech2012)....