Which chunker to utilize for code based data

I am chunking and tokenizing this dataset : - greengerong/leetcode

I am using tiktoken with chunk size of 512 and chunking model cl100k_base. For embedding generation I am using my own finetuned model on this dataset.

Is there a more better way to chunk this dataset for indexing in chroma db. I have read in a few places to use ast based chunking.

Please share sources based for your suggestions.

1 Like

AST-Based Chunking seems to be recommended…?


When dealing with code-based datasets like the one you are working with from https://huggingface.co/datasets/greengerong/leetcode, the choice of chunking method depends on your specific requirements, such as preserving code structure, maintaining semantic meaning, and ensuring efficient indexing in ChromaDB. AST-based chunking can be particularly advantageous for code data because it captures the hierarchical and structural relationships in source code, which might be lost with token-level chunking.

Suggestions for Chunking Code Data:

  1. AST-Based Chunking:

    • AST (Abstract Syntax Tree) chunking can help maintain the semantic relationships in code by chunking based on the tree structure of the code. This method is particularly useful for tasks that require understanding code structure, such as code summarization, change detection, and retrieval.
    • According to Source [1], HyperASTs are a novel approach to storing ASTs of multiple file versions, which can be beneficial for chunking code data while preserving structural information. This approach can help identify and chunk code segments that are semantically related, even across different versions of the code.
  2. Hybrid Approaches:

    • Combining AST-based chunking with tokenization methods like BPE (Byte Pair Encoding) can be effective. Source [2] discusses how recent pre-trained language models such as CodeBERT and TreeBERT leverage both token embeddings and AST-based structural information for tasks like code search and code completion. This hybrid approach can help maintain both the fine-grained token-level details and the broader structural context of the code.
  3. Token-Level Chunking with Advanced Tokenization:

    • If you prefer token-level chunking, consider using more advanced tokenization methods like BPE, which is widely adopted in code-centric language models such as CodeBERT and CuBERT (as discussed in Source [2]). These models use BPE to reduce vocabulary size while maintaining the ability to capture meaningful code segments.

Why AST-Based Chunking Might Be Better:

  • Preservation of Structure: AST-based chunking helps preserve the hierarchical structure of the code, which is critical for understanding code semantics.
  • Semantic Awareness: Unlike token-level chunking, AST-based chunking ensures that chunks are semantically meaningful and aligned with code constructs like functions, loops, and conditionals.
  • Improved Retrieval: For applications like code search and retrieval, AST-based chunks can lead to more accurate results because they capture the context and relationships between code elements.

Sources:

  • Source [1] discusses AST-based chunking approaches, like HyperASTs, which can help you represent multiple versions of code and chunk semantically related segments.
  • Source [2] highlights the effectiveness of combining token embeddings with AST-based structural information for tasks like code summarization and code completion.

In summary, if your goal is to preserve code structure and semantic meaning for downstream tasks like code search, retrieval, or summarization, AST-based chunking is a better approach. You can adapt the HyperAST approach mentioned in Source [1] or integrate AST-based chunking with tokenization methods like BPE for a hybrid solution.

If you’re constrained by time or resources, starting with tiktoken is a good way to get started, but consider gradually integrating AST-based chunking for better performance in code-aware tasks.

Sources: