Regarding best way to vectorize q&a for a form having close to 1200 fields

Hi Team

I am working on POC where I have application holding close to 1200 fields having sections and sub sections with multiple versions present when submitted, approved and reverted back for any corrections

Now I want to vectorize the form so user can interact with form as asking question and answer or user might ask to get form data of particular section/subsection so it can be updated to another form with values

Wanted to know any suggestions I am trying langgraph framework for steps

Vectorization I am doing with bge m3 any suggestions on metadata or chunking strategy or approach would be helpful

Form will have nested arrays and lists.

Please add approaches or any github links if available so I can go through

Thanks in advance

2 Likes

I’m not too familiar with LangGraph, but is it something like this…?

Thank you for reply

Langgraph I am using to organize it as steps yes I think the example you pointed is similar

Mainly I want to see approach that is best for a form having 1200 fields having nested lists or arrays

I want to implement chatbot which can have interactive way and fill a new form using existing ones or answer questions that user asks

As an example I will ask pull me form A address details and fill in the current form that i am editing

Is it possible

1 Like

Yeah. Maybe possible.

This looks good I can start looking into how it works

One more thought I have if we create a CSV for each form like below having headers

Label, section, subsection,value,json property name ,status

I will have everything including property name now user can also query with natural language

As json only contains properties which are not semantically straight forward like it can have desc instead of description pr address instead of address

just a thought

1 Like

Seems possible. Chatbot code is rarely usable as-is, but when it’s difficult, they don’t suggest much code itself…

Thank you for this

This looks good and it will fit my case

But is there a library which can easily convert json to CSV

I can provide template and library should find json property and value and the write to CSV as the template and adding value column along with json pointer

If you see array or list in json it should have multiple entries having index numbers in another column

So output csv will be something like

Label,propertyname,repeatable,array/listindex,property value, section, subsection,pointerforjson

1 Like

Of course it can be done in Python. It seems like it could be done in bash too.


Convert JSON → CSV with an RFC-6901 pointer per row, then left-join your template to add Label/section/subsection. Below is a revised, template-aware solution plus a jq one-liner. Arrays become multiple rows with an index column. This keeps reads/writes unambiguous and scalable to 1,200+ fields.

Background that matters

  • Pointer is your stable key. RFC-6901 defines how to address any JSON value and how to escape ~~0 and /~1. Use it as pointerforjson. (datatracker.ietf.org)
  • Edits stay safe. If you later write back, use RFC-6902 JSON Patch ops against those pointers. (datatracker.ietf.org)
  • CLI and libs exist. jq can walk scalars and emit CSV; Python has python-json-pointer for spec-accurate pointers and ijson for streaming large files; Node has json2csv if you’re in JS. (jqlang.org)

Option A — jq (quick, zero code)

Produces rows with pointerforjson, property value, repeatable, array index, propertyname. Join this with your template on pointerforjson.

# jq manual: https://jqlang.org/manual/                      # paths, getpath, @csv
# RFC 6901: https://datatracker.ietf.org/doc/html/rfc6901    # escaping rules

jq -r '
  def esc: gsub("~";"~0") | gsub("/";"~1");           # RFC 6901 escape
  def ptr($p): "/" + ($p | map(tostring | esc) | join("/"));

  # header matching your schema
  ("Label,propertyname,repeatable,array/listindex,property value,section,subsection,pointerforjson"),

  # one row per scalar
  ( paths(scalars) as $p
    | [
        "",                                            # Label (filled by template join)
        ([$p[]|tostring] | last(select(test("^[^0-9]+$"))) // ""),
        (any($p[]; type=="number")),                  # repeatable
        ((first($p[]|select(type=="number"))) // ""), # array/listindex
        (getpath($p)),                                 # property value
        "", "",                                        # section, subsection (from template)
        ptr($p)                                        # pointerforjson
      ] | @csv
  )
' input.json > raw_extract.csv

jq’s paths(scalars) and getpath are stable primitives; @csv writes valid CSV. (jqlang.org)


Option B — Python (template-aware, spec-accurate, extendable)

Reads JSON, emits the exact CSV you asked for, and inlines a left-join with your template (keyed by pointer), so ambiguous keys like desc map to your canonical Label/section/subsection.

#!/usr/bin/env python3
# JSON→CSV with RFC 6901 pointers + template join
# RFC 6901: https://datatracker.ietf.org/doc/html/rfc6901
# python-json-pointer docs: https://python-json-pointer.readthedocs.io/
# ijson (streaming, optional): https://pypi.org/project/ijson/

import json, csv, sys
from typing import Any, List, Dict

def esc_token(s: str) -> str:               # RFC 6901 escaping
    return s.replace("~", "~0").replace("/", "~1")

def walk(doc: Any, base: List[str] = []):
    if isinstance(doc, dict):
        for k, v in doc.items():
            yield from walk(v, base + [k])
    elif isinstance(doc, list):
        for i, v in enumerate(doc):
            yield from walk(v, base + [str(i)])
    else:
        yield base, doc                     # scalar

def load_template(path: str) -> Dict[str, Dict[str,str]]:
    """
    Template CSV must have at least: pointer,label,section,subsection
    pointer must be RFC 6901 (same as we emit here)
    """
    m = {}
    with open(path, newline="", encoding="utf-8") as f:
        r = csv.DictReader(f)
        for row in r:
            m[row["pointer"]] = row
    return m

def main(in_json: str, out_csv: str, template_csv: str | None):
    template = load_template(template_csv) if template_csv else {}

    with open(in_json, "r", encoding="utf-8") as f:
        doc = json.load(f)

    cols = ["Label","propertyname","repeatable","array/listindex",
            "property value","section","subsection","pointerforjson"]
    with open(out_csv, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        w.writeheader()

        for path, val in walk(doc):
            ptr = "/" + "/".join(esc_token(p) for p in path)
            idxs = [int(p) for p in path if p.isdigit()]
            propname = next((p for p in reversed(path) if not p.isdigit()), "")

            t = template.get(ptr, {})
            w.writerow({
                "Label": t.get("label",""),
                "propertyname": propname,
                "repeatable": bool(idxs),
                "array/listindex": idxs[0] if idxs else "",
                "property value": val,
                "section": t.get("section",""),
                "subsection": t.get("subsection",""),
                "pointerforjson": ptr
            })

if __name__ == "__main__":
    # Usage: python json_to_csv.py input.json out.csv template.csv
    in_json   = sys.argv[1]
    out_csv   = sys.argv[2] if len(sys.argv)>2 else "out.csv"
    template  = sys.argv[3] if len(sys.argv)>3 else None
    main(in_json, out_csv, template)

Why these choices:

  • pointerforjson follows RFC-6901, so every CSV row maps back to an exact JSON value. (datatracker.ietf.org)
  • If your files are huge, swap json.load for ijson to stream without loading everything in memory. (PyPI)
  • If you prefer a JS stack, json2csv provides streaming conversion; pre-walk to add a pointer field, then feed objects to the parser. Docs and repo are active. (juanjodiaz.github.io)

Notes, tips, pitfalls

  • Arrays become multiple rows. repeatable=true and array/listindex carries the index. That matches your requested schema and keeps edits precise.
  • Template is the semantic layer. Keep pointer,label,section,subsection,aliases in your catalog. Your chatbot retrieves rows by label or alias, then writes back by pointer.
  • If you later write back, generate JSON Patch a la [{op:'replace', path:pointer, value:new}] using the same pointers; this is a standardized format. (datatracker.ietf.org)
  • Alternative CLIs. Miller (mlr) can convert JSON↔CSV fast but you still need to compute pointer strings yourself; docs and GitHub issues show JSON/CSV behavior. (miller.readthedocs.io)

Thank you for help

Now we had data in the right format and we want to store in redis instead of chroma

Hope this should help to store both sparse and dense vectors which can help search with bgem3 model that we use with ollama

Any other inputs or guidance so I can also this in that route

Just to add out data will be having 1000+ fields nested forms sections and subsection

So we created CSV file from json which has multiple nested details

CSV will have headers like

Question label, json property name,value for the property,json pointer

1 Like

I don’t know much about Redis, but does it work like this?

Thank you. @John6666 also we wanted to do same POC with Milvus any suggestions on how to search 1200 fields document

Can we use bm25 or sparse vector retrieval

And dense vectors to retain semantic info

We can use both dense and sparse wanted to know anything specific that I can use in milvus for better performance

Also want to understand as I am dealing with CSV which is a conversion of json

Json to CSV

So CSV has value which is json value that we have for keys

So I came to know better to keep value in sparse vector and when creating dense vectors don’t add value add only question related to forms with corresponding json property

Is that the only better way we do now

Sorry if I am asking repetitive

For Milvus.

Thank you @John6666

One more thought now as the user does a query in natural language can I do this step

I will have the labels and prop as a separate collection and create vectors we will not have any values here it’s just a metadata

Now I will hit NLP with the metadata vectors to convert to a query

Next step I will run the updated query to call milvus database this way I get the user natural language query getting converted to metadata specific one

Query transformation kind of is it good or any suggestions

1 Like

Seems okay?

Thank you @John6666

I was able to build but I got one question if we can do better

For example I give prompt like below

Get addresses for client abc or

Get account number details for this name aaa

Now as I am rewrting user query with prop and question label the code will try for account number prop as it matches

But the client name and name of person will not be in the schema as they are actual values

How can the llm know that it should get result for this value or can we do it better

As I see it did not give me right result now

1 Like

Possible causes and fixes might be something like this?