FileNet to Nuxeo: Lessons from Billion Document Migrations

Ethan Steiner
Nuxeo Open Kitchen
Published in
5 min readSep 13, 2022

--

Legacy ECM Migration to Cloud Native Nuxeo

NOTICE: These lessons apply when migrating from any system into Nuxeo

For the past 3 years, I’ve been working on migrating several billion documents from IBM FileNet P8 systems into Nuxeo. My efforts have gotten a lot more interesting since both customers I have been working with decided to add Nuxeo’s DAM capabilities shortly after the migration processes were underway. This blog will describe some of the lessons learned along the way. It will be useful if you are contemplating migration from your legacy ECM application into a more modern, cloud first architecture like Nuxeo.

Nuxeo scales to billions of documents within a single repository so it makes a wise choice for large volume ECM users who want to leverage the many advantages that come from having all documents in one repo. Shortly after Nuxeo announced their intention to complete an 11 Billion document benchmark in late 2019, I began working to migrate 1.4 billion documents from a 2003 FileNet P8 ECM into Nuxeo. After that job was underway, I started working with a different client who had the same need… to migrate 1.6 billion documents into Nuxeo from FileNet.

This blog assumes you already have some basic knowledge about Nuxeo and data-modeling in Nuxeo.

Lesson 1: Build a fast importer for metadata migration

To minimize overall system downtime, the speed at which Nuxeo can import data is critical. With the addition of Nuxeo Stream, we were able to leverage Kafka and build a robust CSV import tool that has a capability to import 120+ million documents into Nuxeo per day. The baseline of this importer is published in the Nuxeo marketplace and it can be customized in many ways.

Client A had strict business rules that required data validation and importing the documents across 90+ business domains inside pre-existing folder structures, as well as importing proxies and creating initial versions of documents upon import. This required running scripts within Nuxeo to create folders pre-import as well as create versions and proxies post-import.

Client B imported all documents into one folder for a flat structure. There were no versions and no proxies. Turns out putting a Billion plus documents into one folder can lead to other problems that need to be addressed after the import was complete. More on this later…

Both clients leveraged the Bulk CSV Importer to import documents fast, so their internal teams did not lose time when switching from FileNet to Nuxeo. For example, the legal team had about 50 million documents. All documents were copied and extracted to CSV files in the background, during normal working hours. One evening, we migrated all the documents into Nuxeo in about 8 hours. The next morning the team was able to validate documents in both systems and perform new work exclusively in Nuxeo. In one day, Nuxeo became the source of truth. This process repeated, team by team, until all 1.4 billion documents were migrated.

Lesson 2: Prepare the data before it is imported

Legacy systems typically have some “legacy” data… Adding a new ECM system is a great opportunity to change the content model or modify legacy business rules to fit new needs. It’s also a great time to clean up bad data which could have been added mistakenly or improperly by users. Addressing these changes are best made with a “transformation” step for the data before it is imported.

If you can extract from your legacy system into CSV, there are many open source libraries written in your language of choice to transform CSV files (Pandas, Apache, csv.js) with the proper data validation built into the transformation itself.

In my experience the library used to transform the data should be the same one that is used to import the data into Nuxeo. We eventually chose the Apache Commons Java Library as it was what worked for both projects. You could easily decide on using a JSON library or better yet — use Avro — which is what is used inside Nuxeo Stream.

Lesson 3: Move the binary files before moving the metadata

Moving Terabytes or Petabytes of data is time consuming. There’s no getting around it. Luckily binary files are stored separately from the metadata with a marker to the binary file location inside the metadata. This allows copying of the binary files to a new location before loading the documents in Nuxeo.

Nuxeo uses the hash key of the binary file to locate it within its storage location. If your legacy file system also uses a hash key, you can simply copy the binaries as is and load them into the Nuxeo file system (typically AWS S3 in Nuxeo Cloud). Transfer of larger loads may be faster to run with a physical device (ex. Snowball).

If your system does not use a hash key, it’s a good idea for security purposes to transform the files to hash keys as you extract them from your legacy system. Make sure to leave the hash key marker inside the exported metadata record to reference during import.

Lesson 4: Be prepared for anything (to be continued…)

As I mentioned in the beginning of this post, both of my ECM clients — both with over 1 billion documents currently in production — wanted to leverage some of Nuxeo’s excellent DAM capabilities.

Client A wanted advanced workflow tooling to leverage Nuxeo’s robust reporting and governance capabilities. Client B wanted automated publishing to an external source at scale — with the ability to process 1000 rendition publications per second. I’ll dive deeper into both of these use cases in a future post as they are not related to migration.

In conclusion, Nuxeo offers a robust tech stack that allows cloud native management of multiple billions of documents. This makes it perfect for large ECM implementations that would like to utilize the search and reporting advantages that come with a single repository. In my future post, I’ll dive deep into how Nuxeo’s highly adaptable, cloud first architecture makes most future business needs possible — even if you begin with a billion plus migration from your legacy system.

--

--