Hemant Vishwakarma: Hugging Face Transformers BART CUDA error: CUBLAS_STATUS_NOT

I'm trying to finetune the Facebook BART model, I'm following this article in order to classify text using my own dataset.

And I'm using the Trainer object in order to train:

training_args = TrainingArguments(
    output_dir=model_directory,      # output directory
    num_train_epochs=1,              # total number of training epochs - 3
    per_device_train_batch_size=4,  # batch size per device during training - 16
    per_device_eval_batch_size=16,   # batch size for evaluation - 64
    warmup_steps=50,                # number of warmup steps for learning rate scheduler - 500
    weight_decay=0.01,               # strength of weight decay
    logging_dir=model_logs,          # directory for storing logs
    logging_steps=10,
)

model = BartForSequenceClassification.from_pretrained("facebook/bart-large-mnli") # bart-large-mnli

trainer = Trainer(
    model=model,                          # the instantiated 🤗 Transformers model to be trained
    args=training_args,                   # training arguments, defined above
    compute_metrics=new_compute_metrics,  # a function to compute the metrics
    train_dataset=train_dataset,          # training dataset
    eval_dataset=val_dataset              # evaluation dataset
)

This is the tokenizer I used:

from transformers import BartTokenizerFast
tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-large-mnli')

But when I use trainer.train() I get the following:

Printing the following:

***** Running training *****
  Num examples = 172
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 11

Followed by this error:

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 1496, in forward
    outputs = self.model(
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 1222, in forward
    encoder_outputs = self.encoder(
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 846, in forward
    layer_outputs = encoder_layer(
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 323, in forward
    hidden_states, attn_weights, _ = self.self_attn(
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 191, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

I've searched this site and GitHub and hugging face forum but still didn't find anything that helped me fix this for me (I tried adding more memory, lowering batches and warmup, restarting, specifying CPU or GPU, and more, but none worked for me)

I'm using Databricks for that, with the cluster: Standard_NC24s_v3 with 4 GPUs, 2 till 6 workers

If you require any other information, comment and I'll add it up asap

from Hugging Face Transformers BART CUDA error: CUBLAS_STATUS_NOT_INITIALIZE

Hemant Vishwakarma

Monday, 1 May 2023

Hugging Face Transformers BART CUDA error: CUBLAS_STATUS_NOT_INITIALIZE

No comments:

Post a Comment