Adding additional metadata columns to a .parque file from .xlsx files

I just created a data set containing extracted text from the JFK Files.

Each release had an accompanying .xlsx file with a bunch of metadata including: Record Num, NARA Release Date, Formerly Withheld, Doc Date, Doc Type, Doc Type, File Num, To Name, From Name, Title, Num Pages, Originator, Record Series, Review Date, Comments, Pages Released

Record Num - Record Number, also sometimes the filename less the extension but sometimes not.
NARA Release Date - Date archives(.)org released the file
Formerly Withheld - Reason for withholding the document
Doc Date - Original document date
Doc Type - Paper, audio tape, etc.
File Num - File Number
To Name - Who the document was addressed to
From Name - Who sent the document
Title - Document title
Num Pages - Total number of pages in the document
Originator - Where the document came from, often CIA or FBI
Record Series - In this case they may all be ā€˜JFK’
Review Date - Date the document was reviewed for release
Comments - Comments
Pages Released - Number of pages released

It seems like the parque format is ideal to attach all this meta data to the content of the files and while this initially looks like a straight forward task, it’s a bit more challenging because:

  1. The same record number can refer to multiple files and a single file can have multiple record numbers.

  2. Sometimes the record number is the file name (less the extension), sometimes it’s a ā€œdicidā€ (whatever that is) and sometimes the files follow no standard naming convention at all.

  3. Each release has a different format for the .xlsx files.

  4. 2025 seems to have standardized on the record number for the file name and no .xlsx is provided. We only have filenames and NARA Release Date. But, many (maybe even all?) of these files were previously released (often with more redactions , blank or missing pages) and have meta data in the .xlsx files from previous releases.

  5. Many of the same files appear again and again in subsequent releases usually with additional pages and/or less redactions.

  6. The 2017-2018 release is by far the largest and many files appear twice within the same release.

This may be a trivial task for an experienced data scientist but it’s challenging for me therefore I’m reaching out to see if anyone can suggest the best approach.

1 Like

The xlsx format is often difficult to handle with software, so it would be better to convert it to csv (using Python or some kind of GUI tool) and then read it with the datasets library…

Incidentally, it will be converted to parquet format when it is read.

The text is small, so size is not really an issue, but I think it would be better to copy it for multiple references. Is there a good way to convert complex xlsx files…?

1 Like

Hi again @John6666 converting to .csv is no problem using python or just saving it to CSV from Exel - there are only four files. They are large but not super massive. The problem arises from a few different issues, inconsistent format of the spreadsheet. Record numbers that refer to multiple files but also single files that have multiple record numbers. Duplicate file listings in the spreadsheets (probably due to the record number issue), and some bad data:

34 files in the 2022 release and 5 files in the 2021 release tie to multiple record numbers listed in the .xlsx files which have more rows than unique file names (13,263 and 1,491 resptively). The 2017-2018 release xlsx filecontains 6 bad links, but the 2017-2018 release website lists two files not included in the xlsx in the /additional path. With two exceptions all .md files match up to .pdf files, the two exceptions match to .mp3 files.

national-archives-jfk-assassination-records-2017-2018-release.xlsx (17 columns, 54,636 data rows, 1 header)

Columns: File Name, Record Num, NARA Release Date, Formerly Withheld, Agency, Doc Date, Doc Type. File Num To Name, From Name, Title, Num Pages, Originator, Record Series, Review Date, Comments, Pages Released.

national-archives-jfk-assassination-records-2021-release.xlsx (16 columns, 1,491 data rows, 1 header)

Columns: Record Number, File Title, NARA Release Date, Formerly Withheld, Document Date, Document Type, File Number., To, From, Title, Original Document Pages, Originator, Record Series, Review Date, Comments, Document Pages in PDF

File Title is the same as File Name
Document Pages in PDF is the same as Pages Released
Agency is missing (often the same as ā€œOriginatorā€ but sometimes different).

national-archives-jfk-assassination-records-2022-release.xlsx (16 columns, 13,264 data rows, 1 header)

Columns: File Name, Record Num, NARA Release Date, Formerly Withheld, Doc Date, Doc Type, File Num To Name, From Name, Title, Num Pages, Originator, Record Series, Review Date, Comments, Pages Released

Format looks the same as the first file but is missing ā€œAgencyā€

national-archives-jfk-assassination-records-2023-release.xlsx (17 columns, 2693 data rows, 1 header)

Columns: File Name, Record Num, NARA Release Date, Formerly Withheld, Agency, Doc Date, Doc Type File Num, To Name, From Name, Title, Num Pages, Originator, Record Series, Review Date, Comments, Pages Released

Back to the first file’s format, Agency column is back but it’s blank for this release.

2025-release.xlsx (2 columns, 2,566 data rows, 1 header)

Columns: Record Number, NARA Release Date

There was no .xlsx provided for 2025, this is the only available information from the website which mirrors the .xlsx for previous years.

For an experienced developer I’m sure this is easy but I’m not sure how to go about because of all the inconsistencies and discrepancies. It’s not a simple 1:1 mapping. But, having all this metadata in the parque file and standardized as best as possible would definitely make for a much better data set.

It would make sense to standardize on the column headings used in 3 out of the 4 files and to leave the columns blank where data wasn’t provided.

If anyone can offer some advice on the best way to do this without introducing a bunch of data errors it would be much appreciated.

1 Like

I’m not a data scientist, so this is just a general observation, but when dealing with text-based data, it’s easier for the computer to process if you align the data to the larger number.
Regardless of whether individual data points exist or not, it’s best to add all possible columns to all data.

And for complete irregularities like the mp3 part, it’s faster and more reliable to handle them manually. Just because you have the tools doesn’t mean you have to do it by hand—no one has decided that.


by Hugging Chat: HuggingChat

To standardize the inconsistent spreadsheet data from the JFK assassination records releases, follow this structured approach:

Step-by-Step Solution

  1. Read and Load Data

    • Use Python’s pandas library to read each Excel file into a DataFrame.
    import pandas as pd
    
    files = ['2017-2018.xlsx', '2021.xlsx', '2022.xlsx', '2023.xlsx', '2025.xlsx']
    dfs = []
    for file in files:
        dfs.append(pd.read_excel(file))
    
  2. Standardize Column Names

    • Create a mapping dictionary to standardize column names across all files.
    column_mapping = {
        'File Name': 'File Name',
        'Record Num': 'Record Number',
        'NARA Release Date': 'Release Date',
        'Formerly Withheld': 'Withheld',
        'Agency': 'Agency',
        'Doc Date': 'Document Date',
        'Doc Type': 'Document Type',
        'File Num To Name': 'File Number',
        'From Name': 'From',
        'Title': 'Title',
        'Num Pages': 'Pages',
        'Originator': 'Originator',
        'Record Series': 'Series',
        'Review Date': 'Review Date',
        'Comments': 'Comments',
        'Pages Released': 'Released Pages'
    }
    
    • Apply the mapping to each DataFrame.
    for df in dfs:
        df.columns = [column_mapping.get(col, col) for col in df.columns]
    
  3. Handle Missing Columns

    • Ensure all DataFrames have the same columns by adding missing ones with NaN where data is unavailable.
    all_columns = set()
    for df in dfs:
        all_columns.update(df.columns)
    all_columns = list(all_columns)
    
    for df in dfs:
        missing_cols = [col for col in all_columns if col not in df.columns]
        for col in missing_cols:
            df[col] = pd.NA
    
  4. Identify and Remove Duplicates

    • Check for duplicates based on key columns like ā€˜Record Number’ or ā€˜File Name’.
    for df in dfs:
        df.drop_duplicates(subset=['Record Number', 'File Name'], inplace=True)
    
  5. Clean Data and Validate Paths

    • Validate file paths and correct extensions where necessary.
    import os
    
    for df in dfs:
        for index, row in df.iterrows():
            file_path = row['File Name']
            if not os.path.exists(file_path):
                # Handle invalid path; log or flag
                print(f"Invalid path: {file_path}")
            # Correct extension for known exceptions
            if file_path.endswith('.md') and row['File Type'] == 'mp3':
                df.at[index, 'File Name'] = file_path.replace('.md', '.mp3')
    
  6. Concatenate DataFrames

    • Combine all DataFrames into a single standardized DataFrame.
    combined_df = pd.concat(dfs, ignore_index=True)
    
  7. Save to Standardized CSV

    • Write the standardized data to a CSV file.
    combined_df.to_csv('standardized_jfk_records.csv', index=False)
    

Considerations and Recommendations

  • Data Type Handling: Ensure consistent data types, especially for dates and numeric fields.
  • Testing and Validation: Regularly inspect DataFrames to verify data integrity post-processing.
  • Exception Handling: Implement logging for discrepancies and edge cases encountered.

This approach systematically addresses the inconsistencies and discrepancies in the data, ensuring a standardized and clean dataset is produced.