I am working on a project that requires LLM to extract data lineage from many PySpark scripts. My plan is to use a HG LLM and fine tune it with training data. TargetTable, SourceTable, TargetColumn, SourceColumn, and Transformation logic are the information I am interested in and will be sent to Collibra for a subsequent report.
I’m relatively new to this field. Any advice or suggestion is greatly appreciated!