AIP Independence Test for US GPO
The United States Government Printing Office (or GPO for short) is tasked with producing, protecting, preserving, and distributing the nation's federal documents. To quote their website, "The core mission of Keeping America Informed, dated to 1813 when Congress determined to make information regarding the work of the three branches of Government available to all Americans. The U.S Government Printing Office (GPO) provides publishing and dissemination services for the official and authentic government publications to Congress, Federal agencies, Federal depository libraries, and the American public."
To further these goals in a digital era, GPO has gradually rolled out FDsys, an open archival information system (OAIS) that was designed as an update to the old GPO Access system. Since FDsys will hold many potentially critical and sensitive documents, it has many requirements: while it needs to distribute data to the public easily and efficiently, it must also be able to preserve its data even in the event of a catastrophic failure in the system. The CSULA AIP Independence project was created in order to test whether FDsys is capable of performing this last function, even if the software itself should go offline.
An well-designed OAIS will archive its data in such a way that the contents would still be locatable and decipherable to humans even if the OAIS itself is not available. To facilitate this, all related data is stored in packages. While stored within the OAIS, these packages are called Archival Information Packages (or AIPs). An AIP could contain multiple files that make up one piece of data, or multiple renditions of the same data in different formats. It also includes metadata files that identify the data, as well as binding files that describe the package's relationship to other packages. The AIP structure helps ensure that related data always remains grouped, and that connections between files are integrated into each package.
In the CSULA project, the CSULA team was supplied a subset of the raw contents of FDsys, a collection of AIPs severed from any access to or knowledge of FDsys itself. By parsing internal XML documents while using only openly known metadata standards as a reference, the CSULA team would attempt to locate and interpret any package in the archive. Whether they succeeded or failed, they would provide feedback back to GPO, which would help the ongoing development of FDsys. Hence, the CSULA project could be boiled down to answering a simple motivational question: Is FDsys AIP independent? That is, are its AIPs independent of the OAIS that created them?
To accomplish this goal, the CSULA team had to design an experiment that would answer this question. With the guidance of our GPO liaison, we ended up following this course of action:
- Research the relevant metadata standards. FDsys uses a version of the METS standard for its XML, augmented with two other standards named MODS and PREMIS. Understanding the intricacies of these standards would be essential to parsing the archive.
- Inspect the overall structure of the data. The data that GPO provided was quite massive at over half a terabyte, and the organization was obscure. Clearly, a brute force solution would not suffice here.
- Formulate an algorithm for converting the metadata. With knowledge of the XML standards above, the team was able to parse several AIPs by hand. However, the standards used by FDsys were not exactly identical to the open standard we had researched. We needed to eliminate the points of discontinuity and consolidate many files so that the parsing could be automated.
- Write code to locate and convert the metadata. The new versions of the metadata still links to old file locations, but it should now be in a form ready to be parsed by an external source.
- Ingest the data into a third-party repository. This is a proof-of-concept, to demonstrate that the data was parsed correctly. We settled on using Fedora Commons as our repository of choice. If the ingestion succeeds and the data is searchable and viewable in the new repository, then we can conclude that our efforts were successful.
The experiment proved to be successful; the new Fedora repository is accessible online (albeit dependent on a local client machine to operate). After clearing a few hurdles, ingestion was totally successful. The CSULA team concludes that FDsys fulfills the criteria that we tested; its data should still be recoverable even if the repository structure is lost. We can therefore answer our motivational question in the affirmative: FDsys is indeed AIP independent.
- Antonio Castillo
- Johnny Ng
- Aram Weintraub
- Tin Wong