Modern NLP for "Economics of Innovation" (Open Research Project using Patent Data)

Hi all,

Suraj and I started discussing a potential research project and he suggested I make a thread here to discuss. As a quick intro, I am an NLP hobbyist and consumer of NLP research, and Suraj is a software developer with a keen interest in NLP.

From my perspective, here are a few goals of the project:

  • Upgrade our NLP skills in general
  • Make an immediate contribution to an applied field by introducing modern NLP methods
  • Dig deeper into NLP research and potentially make a minor advancement

My idea was to introduce the Innovation Studies (or Economics of Innovation) field to modern NLP methods. I suggested this for a few reasons. First, it is generally accepted that the long-run economic growth rate, and standard of living, is driven by innovation. And second, there are about 8 million published US patents - that are freely available - that we can use as data.

I am open to any directions to take, but here are a few starting points:

  1. Patent Classification
    I can see two reasons for improving patent classifications. One is for Innovation researchers to use the improved patent classes for their research - rather than relying on officially listed patent classes. And two, would be for actual innovation policy. One consensus in the field is that basic research is drastically under-invested in, since companies do not directly benefit from the large spillovers of basic research. So the rate of return on basic research is much higher for society than for any single company. However, when governments try to encourage basic research through incentivizing these types of patents, inventors can try to “cheat the system” by re-labeling their patent. Economists Ufuk Akcigit and Stefanie Stantcheva [1] say “Going forward, finding a feasible way to differentiate between basic and applied research is essential to better innovation tax policies.”

  2. Estimating the “Impact” of a Patent
    As far as I know, the vast majority of innovation studies, that use patent data, use the number of citations as a proxy for the impact of a patent. So improving the “impact score” of a patent might help many innovation researchers. Professor Bryan Kelly et al [2] use a very clever modification of TF-IDF to find similarity scores between patents. A patent’s impact is then estimated by finding the difference in similarity scores between the target patent and all previous patents, and the target patent and all future patents. This makes sense to me, and is well explained in their paper. However, I do think that using other methods of finding patent embeddings may be worth investigating - like using AllenAI’s SPECTER document embedding approach. I’d also like to look into deep graph networks to see if they can help produce an estimate of the impact of a patent, without using citations.

  3. Patent Idea Generation
    I think it would be cool to generate a patent abstract (or idea) either unconditionally, or conditioned on a sentence that would guide the generation. There are lots of directions we could pursue with this.

Anyway, sorry for the long post. Please let us know if you have ideas, suggestions, would like to participate, etc.


1 Like

@VictorSanh, @joeddav, @yjernite we would love to hear your thoughts on this :hugs:

1 Like

Fun idea – thanks for sharing! I think any of these directions would make for a fun and educational project. Some thoughts/questions on each:

  1. Do you have a good dataset with applied vs. basic annotations? If so, this should be pretty easy. If not, one direction would be to explore semi-supervised learning (Seb Ruder has a good blog post on it).
  2. This seems more interesting than #1 to me, but try to be careful with fairness and bias here. The model could easily learn to associate the race or gender of the the patent holders or the prestige of the organizations that they come from with the patent’s impact, since (I assume) these factors will correlate with citations. Removing bias completely won’t be possible, but it will add legitimacy to your project if you are careful & transparent about them. You wouldn’t want a scenario where companies use your tool to determine the value of their employee’s patents which would likely end up disproportionately rewarding men over women, for example.
  3. After a quick google I found this paper, so I’d use that as a starting point and see what you could do that would be fun or interesting on top of that.

Thanks for the feedback!

  1. I believe there are a few small datasets that clearly label applied vs basic research, for patents. There are also the official patent classes, which could help inform classification - but they do not contain clear applied vs basic research distinctions. However, ideally, one would create a classifier which would distinguish between many hundreds of classes. This could allow policy makers to take advantage of the fact that within applied or basic research, some areas would have higher social returns, or relate to a specific mission, like climate change. And thanks for that link!

  2. My original plan was to only use publication dates, and patent abstract and description text for estimating impact. This makes the task more challenging but I believe it would remove as much bias as possible. I appreciate the recommendation, I will keep that in mind.

Edit: To clarify, as far as I understand, the typical approach to analyzing the patent/innovation space is to create a network of individual inventors, institutions, and patent IDs. Then linking these nodes via citations, authorship and affiliations.

Whereas, I propose to ignore all of the above and only focus on the content of patents. This could help decrease the influence of the biases associated with citations, and increase the information associated with each patent. This latter point assumes that there is more information about a patent in the language embedding space than the citation network space. To me, its a fair assumption, but I have no evidence yet :slight_smile:

  1. Yup! That’s a cool paper - and I agree a great starting point.

Again, thanks for the feedback. Once Suraj and I decide on a starting point we can update this thread :slight_smile:


Hi @joeddav

Thanks for the feedback.

  1. This also seems important to me as well. Fairness will be utmost concern, no private info (race, gender) will be visible to the model. And I think the embeddings should also help in discoverability i.e finding out concepts/patents/papers which are similar to a particular paper .

  2. Generation is always fun so will definitely start from there

1 Like