Reminder I have read the README and searched the existing issues. System Info Reproduction 使用2048*2048的图片,总量3万个图文对,sharegpt格式的数据集。 设置preprocessing_num_workers=256 或者128/64等,都会在Running tokenizer on dataset的时候暂停,在长时间
@SaulLu when I use the wikitext-103 dataset the tokenizer hangs with Running tokenizer on dataset and shows no progress. This was not always an issue but as of today has become one. It will either hang at the end of tokenizing or at the very beginning. Any idea why this would be han...
() Permutation and Combination in Python Getopt module in Python Merge two Dictionaries in Python Multithreading in Python 3 Static in Python How to get the current date in Python argparse in Python Python tqdm Module Caesar Cipher in Python Tokenizer in Python How to add two lists in Python ...
org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.feature.{HashingTF, Tokenizer} import org.apache.spark.ml.linalg.Vector import org.apache.spark.sql.Row import ml.dmlc.xgboost4j.scala.spark.{XGBoostEstimator, ...
org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.feature.{HashingTF, Tokenizer} import org.apache.spark.ml.linalg.Vector import org.apache.spark.sql.Row import ml.dmlc.xgboost4j.scala.spark.{XGBoostEstimator, ...
It seems like either the tokenizer outputs or the embedding models are not being properly moved to the GPU. Could you try printing the device of the token embedder (with something like print(next(self.token_embedding.parameters()).device)) and the device of the input_ids (print(input_ids....
import torch from transformers import AutoConfig, AutoTokenizer from transformers import AutoModelForCausalLM from accelerate import dispatch_model, infer_auto_device_map from accelerate.utils import get_balanced_memory tokenizer = AutoTokenizer.from_pretrained('togethercomputer/GPT-NeoXT-Chat-Base-20B')...
Hi, I found that there were multiple issues such as BPE tokenizers not being found, Problems loading the tokenizer among others. I'd suggest redoing the blog code to work with the current deployment and make the same publicly available. Thanks...
My own task or dataset (give details below) Reproduction example code: from transformers import LlamaForCausalLM, LlamaTokenizer tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-70b-chat-hf", use_safetensors=True) model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-70b-...
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Create new index train_idx = [i for i in range(len(train.index))] test_idx = [i for i in range(len(test.index))] val_idx = [i for i in range(len(val.index))] # Convert...