My experience is that if you use a 12g gpu to load the llama13b model, the output will basically get stuck if it exceeds 200 tokens. Member jmorgancacommentedFeb 20, 2024 This should be fixed as of 0.1.24. Please let me know if that isn't the case, and we'll re-open this (and...
("ANTHROPIC_API_KEY"); // HuggingFace API key here: https://huggingface.co/settings/tokens public static final String HF_API_KEY = System.getenv("HF_API_KEY"); // Judge0 RapidAPI key here: https://rapidapi.com/judge0-official/api/judge0-ce public static final String RAPID_API_KEY ...
Parallel summarization and extraction, reaching an output of 80 tokens per second with the 13B LLaMa2 model HYDE (Hypothetical Document Embeddings) for enhanced retrieval based upon LLM responses Semantic Chunking for better document splitting (requires GPU) Variety of models supported (LLaMa2, Mistral...
("ANTHROPIC_API_KEY"); // HuggingFace API key here: https://huggingface.co/settings/tokens public static final String HF_API_KEY = System.getenv("HF_API_KEY"); // Judge0 RapidAPI key here: https://rapidapi.com/judge0-official/api/judge0-ce public static final String RAPID_API_KEY ...
Parallelsummarization and extraction, reaching an output of 80 tokens per second with the 13B LLaMa2 model HYDE(Hypothetical Document Embeddings) for enhanced retrieval based upon LLM responses Semantic Chunkingfor better document splitting (requires GPU) ...