Create an AI assistant for lawyers

lewtun · November 9, 2021, 2:57pm

Please read the topic category description to understand what this is all about

Description

The Contract Understanding Atticus Dataset (CUAD) is a new dataset for legal contract review. Legal contracts often contain a small number of important clauses that warrant review by lawyers. This is a time-intensive task that requires specialised knowledge, so the goal of this project is to see if Transformer models can be used to extract answers to a predefined set of legal questions.

Model(s)

Many of the Question Answering models on the Hub could serve as a good baseline to get started. Given the specialised domain, you will probably want to try:

Fine-tuning encoder-based models like BERT, RoBERTa, DeBERTa and friends
Performing domain adaptation, by first fine-tuning the language model before tuning the question-answering head

Datasets

CUAD is available on the Hub.

Challenges

This is a highly specialised domain, so a vanilla Transformer may not obtain great results.

Desired project outcomes

Create a Streamlit of Gradio app on Spaces that allows someone to select a legal contract, one or more questions, and provide the answer.

Additional resources

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

Follow the instructions on the #join-course channel
Join the #ai-law-assistant channel

Just make sure you comment here to indicate that you’ll be contributing to this project

pmarkovic · November 14, 2021, 12:31pm

Hi @lewtun, my name is Pavle and I would be interested in working on this project.

lewtun · November 14, 2021, 2:52pm

Hey @pmarkovic, I’ve just created a Discord channel (see topic description) in case you and others want to coordinate there

durgaamma2005 · November 17, 2021, 7:27am

can somebody look at my notebook and help me to resolve the issue. AI_assistant_law | Kaggle
I am using blurr library which depends on fastai
following is error trace of learn.summary()

/tmp/ipykernel_36/2746549787.py in <module>
----> 1 learn.summary()

/opt/conda/lib/python3.7/site-packages/fastai/callback/hook.py in summary(self)
    205     "Print a summary of the model, optimizer and loss function."
    206     xb = self.dls.train.one_batch()[:getattr(self.dls.train, "n_inp", 1)]
--> 207     res = module_summary(self, *xb)
    208     res += f"Optimizer used: {self.opt_func}\nLoss function: {self.loss_func}\n\n"
    209     if self.opt is not None:

/opt/conda/lib/python3.7/site-packages/fastai/callback/hook.py in module_summary(learn, *xb)
    173     #  thus are not counted inside the summary
    174     #TODO: find a way to have them counted in param number somehow
--> 175     infos = layer_info(learn, *xb)
    176     n,bs = 76,find_bs(xb)
    177     inp_sz = _print_shapes(apply(lambda x:x.shape, xb), bs)

/opt/conda/lib/python3.7/site-packages/fastai/callback/hook.py in layer_info(learn, *xb)
    157         train_only_cbs = [cb for cb in learn.cbs if hasattr(cb, '_only_train_loop')]
    158         with learn.removed_cbs(train_only_cbs), learn.no_logging(), learn as l:
--> 159             r = l.get_preds(dl=[batch], inner=True, reorder=False)
    160         return h.stored
    161 

/opt/conda/lib/python3.7/site-packages/fastai/learner.py in get_preds(self, ds_idx, dl, with_input, with_decoded, with_loss, act, inner, reorder, cbs, **kwargs)
    256             pred_i = 1 if with_input else 0
    257             if res[pred_i] is not None:
--> 258                 res[pred_i] = act(res[pred_i])
    259                 if with_decoded: res.insert(pred_i+2, getattr(self.loss_func, 'decodes', noop)(res[pred_i]))
    260             if reorder and hasattr(dl, 'get_idxs'): res = nested_reorder(res, tensor(idxs).argsort())

/opt/conda/lib/python3.7/site-packages/blurr/modeling/question_answering.py in activation(self, outs)
     77 
     78     def activation(self, outs):
---> 79         acts = [ self.loss_funcs[i].activation(o) for i, o in enumerate(outs) ]
     80         return acts
     81 

/opt/conda/lib/python3.7/site-packages/blurr/modeling/question_answering.py in <listcomp>(.0)
     77 
     78     def activation(self, outs):
---> 79         acts = [ self.loss_funcs[i].activation(o) for i, o in enumerate(outs) ]
     80         return acts
     81 

/opt/conda/lib/python3.7/site-packages/fastai/losses.py in activation(self, x)
     46     def __init__(self, *args, axis=-1, **kwargs): super().__init__(nn.CrossEntropyLoss, *args, axis=axis, **kwargs)
     47     def decodes(self, x):    return x.argmax(dim=self.axis)
---> 48     def activation(self, x): return F.softmax(x, dim=self.axis)
     49 
     50 # Cell

/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py in softmax(input, dim, _stacklevel, dtype)
   1677         dim = _get_softmax_dim("softmax", input.dim(), _stacklevel)
   1678     if dtype is None:
-> 1679         ret = input.softmax(dim)
   1680     else:
   1681         ret = input.softmax(dim, dtype=dtype)

AttributeError: 'str' object has no attribute 'softmax' ```

lewtun · November 17, 2021, 11:27am

Hey @durgaamma2005 I suggest opening an issue on the blurr GitHub repository since I’m sure they can help you with your query (I personally have no experience with blurr)

muhtasham · November 17, 2021, 10:22pm

@lewtun can you hint at any data preprocessing for CUAD
also i am using Colab with 25GB of ram, the session is crashing during the preprocessing step, is there any work around for this ?

lewtun · November 17, 2021, 10:39pm

Are you having trouble with the tokenizer or something else? If it’s the GPU RAM that’s the problem, one idea would be to reduce the batch size during training

muhtasham · November 17, 2021, 10:49pm

i am having issue with preprocessing step before tokenizing the dataset because some of the questions have several possible answers, i am getting list index out of range

lewtun · November 17, 2021, 11:05pm

Hey @muhtasham one special aspect of CUAD is that it has long documents and produces a lot of passages for which there are no answers to be found. I suggest using the CUAD training script in their codebase (see topic description) as a starting point

Manel-Hik · July 22, 2022, 5:25pm

Hi @lewtun I’m interested in this project, Is there any resources about data processing how to build a training data with my own contracts ?? I checked in the CUAD github project but I can’t find it, I looked into discord for the #ai-law-assistant channel but can’t find it.
could you help me please?
thanks in advance

DukeC · July 23, 2022, 7:00pm

casetext.com already does this very well.

Ayoola · July 29, 2022, 11:58am

I hope this project is still on going, as I am super interested.

drblbaker · February 21, 2023, 4:12am

Interested in similar project for family firm. Following and willin* to help.

Topic		Replies	Views
Seeking Advice on Fine-Tuning a Legal Language Model for Nepalese Law (LLM + RAG) 🤗 Course Projects	0	163	February 25, 2025
We are a startup non-profit civil liberties project looking to use an unrestricted model for summarizing legal ease for people being victimized Community Calls	0	182	April 30, 2024
Build a AI Powered Legal Documentation Assistant Beginners	1	1677	December 21, 2023
Seeking Advice: Developing an Open-Source AI Model for Semantic Analysis and Grading of Textual Responses Beginners	0	453	November 14, 2023
[Call for participation] Interactive Grounded Language Understanding in a Collaborative Environment (IGLU) Competition@NeurIPS2021 Research	0	727	September 9, 2021