Create an AI assistant for lawyers

:wave: Please read the topic category description to understand what this is all about

Description

The Contract Understanding Atticus Dataset (CUAD) is a new dataset for legal contract review. Legal contracts often contain a small number of important clauses that warrant review by lawyers. This is a time-intensive task that requires specialised knowledge, so the goal of this project is to see if Transformer models can be used to extract answers to a predefined set of legal questions.

Model(s)

Many of the Question Answering models on the Hub could serve as a good baseline to get started. Given the specialised domain, you will probably want to try:

  • Fine-tuning encoder-based models like BERT, RoBERTa, DeBERTa and friends
  • Performing domain adaptation, by first fine-tuning the language model before tuning the question-answering head

Datasets

CUAD is available on the Hub.

Challenges

This is a highly specialised domain, so a vanilla Transformer may not obtain great results.

Desired project outcomes

Create a Streamlit of Gradio app on :hugs: Spaces that allows someone to select a legal contract, one or more questions, and provide the answer.

Additional resources

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

  • Follow the instructions on the #join-course channel
  • Join the #ai-law-assistant channel

Just make sure you comment here to indicate that you’ll be contributing to this project :slight_smile:

8 Likes

Hi @lewtun, my name is Pavle and I would be interested in working on this project.

1 Like

Hey @pmarkovic, I’ve just created a Discord channel (see topic description) in case you and others want to coordinate there :slight_smile:

can somebody look at my notebook and help me to resolve the issue. AI_assistant_law | Kaggle
I am using blurr library which depends on fastai
following is error trace of learn.summary()

/tmp/ipykernel_36/2746549787.py in <module>
----> 1 learn.summary()

/opt/conda/lib/python3.7/site-packages/fastai/callback/hook.py in summary(self)
    205     "Print a summary of the model, optimizer and loss function."
    206     xb = self.dls.train.one_batch()[:getattr(self.dls.train, "n_inp", 1)]
--> 207     res = module_summary(self, *xb)
    208     res += f"Optimizer used: {self.opt_func}\nLoss function: {self.loss_func}\n\n"
    209     if self.opt is not None:

/opt/conda/lib/python3.7/site-packages/fastai/callback/hook.py in module_summary(learn, *xb)
    173     #  thus are not counted inside the summary
    174     #TODO: find a way to have them counted in param number somehow
--> 175     infos = layer_info(learn, *xb)
    176     n,bs = 76,find_bs(xb)
    177     inp_sz = _print_shapes(apply(lambda x:x.shape, xb), bs)

/opt/conda/lib/python3.7/site-packages/fastai/callback/hook.py in layer_info(learn, *xb)
    157         train_only_cbs = [cb for cb in learn.cbs if hasattr(cb, '_only_train_loop')]
    158         with learn.removed_cbs(train_only_cbs), learn.no_logging(), learn as l:
--> 159             r = l.get_preds(dl=[batch], inner=True, reorder=False)
    160         return h.stored
    161 

/opt/conda/lib/python3.7/site-packages/fastai/learner.py in get_preds(self, ds_idx, dl, with_input, with_decoded, with_loss, act, inner, reorder, cbs, **kwargs)
    256             pred_i = 1 if with_input else 0
    257             if res[pred_i] is not None:
--> 258                 res[pred_i] = act(res[pred_i])
    259                 if with_decoded: res.insert(pred_i+2, getattr(self.loss_func, 'decodes', noop)(res[pred_i]))
    260             if reorder and hasattr(dl, 'get_idxs'): res = nested_reorder(res, tensor(idxs).argsort())

/opt/conda/lib/python3.7/site-packages/blurr/modeling/question_answering.py in activation(self, outs)
     77 
     78     def activation(self, outs):
---> 79         acts = [ self.loss_funcs[i].activation(o) for i, o in enumerate(outs) ]
     80         return acts
     81 

/opt/conda/lib/python3.7/site-packages/blurr/modeling/question_answering.py in <listcomp>(.0)
     77 
     78     def activation(self, outs):
---> 79         acts = [ self.loss_funcs[i].activation(o) for i, o in enumerate(outs) ]
     80         return acts
     81 

/opt/conda/lib/python3.7/site-packages/fastai/losses.py in activation(self, x)
     46     def __init__(self, *args, axis=-1, **kwargs): super().__init__(nn.CrossEntropyLoss, *args, axis=axis, **kwargs)
     47     def decodes(self, x):    return x.argmax(dim=self.axis)
---> 48     def activation(self, x): return F.softmax(x, dim=self.axis)
     49 
     50 # Cell

/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py in softmax(input, dim, _stacklevel, dtype)
   1677         dim = _get_softmax_dim("softmax", input.dim(), _stacklevel)
   1678     if dtype is None:
-> 1679         ret = input.softmax(dim)
   1680     else:
   1681         ret = input.softmax(dim, dtype=dtype)

AttributeError: 'str' object has no attribute 'softmax' ```

Hey @durgaamma2005 I suggest opening an issue on the blurr GitHub repository since I’m sure they can help you with your query (I personally have no experience with blurr)

@lewtun can you hint at any data preprocessing for CUAD
also i am using Colab with 25GB of ram, the session is crashing during the preprocessing step, is there any work around for this ?

Are you having trouble with the tokenizer or something else? If it’s the GPU RAM that’s the problem, one idea would be to reduce the batch size during training

i am having issue with preprocessing step before tokenizing the dataset because some of the questions have several possible answers, i am getting list index out of range

Hey @muhtasham one special aspect of CUAD is that it has long documents and produces a lot of passages for which there are no answers to be found. I suggest using the CUAD training script in their codebase (see topic description) as a starting point

1 Like