I just created a data set containing extracted text from the JFK Files.
Each release had an accompanying .xlsx file with a bunch of metadata including: Record Num, NARA Release Date, Formerly Withheld, Doc Date, Doc Type, Doc Type, File Num, To Name, From Name, Title, Num Pages, Originator, Record Series, Review Date, Comments, Pages Released
Record Num - Record Number, also sometimes the filename less the extension but sometimes not.
NARA Release Date - Date archives(.)org released the file
Formerly Withheld - Reason for withholding the document
Doc Date - Original document date
Doc Type - Paper, audio tape, etc.
File Num - File Number
To Name - Who the document was addressed to
From Name - Who sent the document
Title - Document title
Num Pages - Total number of pages in the document
Originator - Where the document came from, often CIA or FBI
Record Series - In this case they may all be āJFKā
Review Date - Date the document was reviewed for release
Comments - Comments
Pages Released - Number of pages released
It seems like the parque format is ideal to attach all this meta data to the content of the files and while this initially looks like a straight forward task, itās a bit more challenging because:
-
The same record number can refer to multiple files and a single file can have multiple record numbers.
-
Sometimes the record number is the file name (less the extension), sometimes itās a ādicidā (whatever that is) and sometimes the files follow no standard naming convention at all.
-
Each release has a different format for the .xlsx files.
-
2025 seems to have standardized on the record number for the file name and no .xlsx is provided. We only have filenames and NARA Release Date. But, many (maybe even all?) of these files were previously released (often with more redactions , blank or missing pages) and have meta data in the .xlsx files from previous releases.
-
Many of the same files appear again and again in subsequent releases usually with additional pages and/or less redactions.
-
The 2017-2018 release is by far the largest and many files appear twice within the same release.
This may be a trivial task for an experienced data scientist but itās challenging for me therefore Iām reaching out to see if anyone can suggest the best approach.