Suraj and I started discussing a potential research project and he suggested I make a thread here to discuss. As a quick intro, I am an NLP hobbyist and consumer of NLP research, and Suraj is a software developer with a keen interest in NLP.
From my perspective, here are a few goals of the project:
- Upgrade our NLP skills in general
- Make an immediate contribution to an applied field by introducing modern NLP methods
- Dig deeper into NLP research and potentially make a minor advancement
My idea was to introduce the Innovation Studies (or Economics of Innovation) field to modern NLP methods. I suggested this for a few reasons. First, it is generally accepted that the long-run economic growth rate, and standard of living, is driven by innovation. And second, there are about 8 million published US patents - that are freely available - that we can use as data.
I am open to any directions to take, but here are a few starting points:
I can see two reasons for improving patent classifications. One is for Innovation researchers to use the improved patent classes for their research - rather than relying on officially listed patent classes. And two, would be for actual innovation policy. One consensus in the field is that basic research is drastically under-invested in, since companies do not directly benefit from the large spillovers of basic research. So the rate of return on basic research is much higher for society than for any single company. However, when governments try to encourage basic research through incentivizing these types of patents, inventors can try to “cheat the system” by re-labeling their patent. Economists Ufuk Akcigit and Stefanie Stantcheva  say “Going forward, finding a feasible way to differentiate between basic and applied research is essential to better innovation tax policies.”
Estimating the “Impact” of a Patent
As far as I know, the vast majority of innovation studies, that use patent data, use the number of citations as a proxy for the impact of a patent. So improving the “impact score” of a patent might help many innovation researchers. Professor Bryan Kelly et al  use a very clever modification of TF-IDF to find similarity scores between patents. A patent’s impact is then estimated by finding the difference in similarity scores between the target patent and all previous patents, and the target patent and all future patents. This makes sense to me, and is well explained in their paper. However, I do think that using other methods of finding patent embeddings may be worth investigating - like using AllenAI’s SPECTER document embedding approach. I’d also like to look into deep graph networks to see if they can help produce an estimate of the impact of a patent, without using citations.
Patent Idea Generation
I think it would be cool to generate a patent abstract (or idea) either unconditionally, or conditioned on a sentence that would guide the generation. There are lots of directions we could pursue with this.
Anyway, sorry for the long post. Please let us know if you have ideas, suggestions, would like to participate, etc.