Records Management Roadmap - AI for Documents Content and Records Management

The Journey Begins - Records Management System

Paper has long been the key way of storing documents. All types of documents have existed, and today we generate documents in electronics applications as well like messaging, emails, chats, and others. Although the documents or communications exist in paper and media, the media forms are dealt with in different ways, so they may be available to the end users, but the information is nonexistent in a cumulative format.

The Challenge for Records Management

The challenge with records management is the combination of paper plus electronic documents, even if they do exist in soft media, the various formats make it extremely difficult to manage them with any one application. The federal and other governments have a broad mix of document types but there is no system to manage these documents consistently, in a systematic way over long period of times.

  • Physical paper files such as office memos, contracts, marketing materials, Standard Operating Procedure (SOP) documents, reports, or plans
  • Electronic messages such as email content, their media attachments or files, or instant short messages
  • Content on the website, as well as eBooks, white papers, or important multimedia files on flash drives, desktops, servers, or in document management systems
  • Confidential or sensitive information in the organization’s various databases.

The White House Directive

In 2019, White House OMB (Office of Management and Budgeting) issued a directive to convert all types of documents that are generally provided to NARA (National Authority of Records Administration) in paper form. All government agencies’ documents have to be provided to NARA in electronic form (with standardized associated metadata) by the end of 2022. One of the key Paragraph states what the documents are.

Maintain permanent electronic records (e.g., emails, IMs, text messages, electronic documents, spreadsheets, presentations, images, maps, videos, blogs and other social media tools that generate communications) electronically, and if applicable, in an approved electronic records system.

The OMB-19-22 memo lists the types of documents that need to be stored in a way that take advantage of capabilities that are to be provided in one Document Management System that can collect, convert, transform, parse, analyze, and present in a way that ordinary users can search and review the documents. It is definitely a very complex task and there are no systems that can perform all these tasks or perform them in a secure, reliable, and distributed manner to a large audience in a collaborative manner.

Artificial Intelligence (AI) for Documents, Content and Records Management

Aretec saw the document space as an opportunity and created a platform called Context, which interfaces to Google’s AI-driven Document AI as the technology to perform all these functions. Google Document AI and the other AI functions like the Records AI and the ability to convert records or forms into a document format will help in organizing all records, paper document, and various content types into an electronic form for records management system. These document types can be collected, converted, transformed, parsed, analyzed, and presented as one document type.  This uses all the latest AI algorithms to convert a scan of the form into a record or an electronic document with a high degree of accuracy. Aretec also created a Human in the Loop function so that the Document or Record AI software learns from a human in the loop and its machine learning and AI capabilities improve with time.

An array of all the document types mentioned earlier can be taken as a pilot and various types of documents identified as candidates for creating an all-encompassing content management platform that helps in creating the compliance required by current and future needs of all Federal, State, County or City governments to serve their communities. Context AI is a system that provides all the capabilities, and Aretec is willing to work with any organization to provide them a demonstration of these capabilities. 

It will take a group of people or experts and teams to prepare the plans, deploy the infrastructure, and eventually get to the point where you can start to get some traction with the record management project. This effort may go on for a year or two before any tangible progress will show. This team is typically look like below.

Luckily there is a ray of light in the tunnel. Aretec looked at the problem and while we were all fighting with Covid for the past two years, and hired a large team that set out to take Google’s AI technology for documents and applied it to Content and Records Management. We developed the Context content platform that takes almost all the Google technology infrastructure and software AI assets and adds custom code to deliver a fully Self Serviced Context Platform. The plan was to build Context Self-Service to solve customer experience issues, and provide an easy-to-use interface for the citizen analyst.

Digitalization Process in Records Management System

AI for Documents

The process first step is to use a paper Digitalization process that takes all the paper forms and documents and scans them into a large repository. This technology area of scanning has made a lot of progress in the past few years. Although you can scan any form or document with your smartphone or any tabletop scanner, that is only applicable to individuals. Government agencies whether they are federal, state, county or city governments contains thousands or millions of document forms. The only caveat is to use someone who can deliver on scanning reliably. Aretec partners with the industry leading digitization providers with the equipment and expertise to deliver document and forms in a reliable manner.

Once the paper records or document or record forms are scanned, they can be loaded into Context with a simple drag and drop. Context also has an ever-growing list of pre-built connectors to whatever platform or system you use to store your records. Customers can upload thousands of documents or content in one go. Once the documents are loaded, Context takes over and it starts converting and indexing them. A few documents are done first with a human in the loop so that the Context AI gets trained and as the confidence improves, the number of documents increases substantially.

Document Artificial Intelligence Function for Records Management

As the Document AI scans the documents, it also creates all the ‘key’ and ‘value’ pairs that simple OCR technology cannot do.  This Metadata flows into a Metadata Repository Automatically with a human in the loop.  This Metadata Repository becomes the Data Catalog for the government agency, and they can then explore their data in a meaningful way. The documents become part of the cloud data with meaningful Metadata that is immediately available to the group of users who would like to review, modify, or change definition based on their unique needs. This Metadata can be made available to them in Visual BI dashboards that make it easier to handle the data. The Metadata is in plain English in easily understood terms and First Name appears as First Name and not Fname, FN, or all the other variations that database tables exhibit.

When the AI training for a form or record is completed, and accurate data conversion is successful, mass data ingestion and mass conversion will start, and all the records or forms are converted into a database schema. The user does not have to worry about and all of the data is stored in a Google Bigquery database which is a typical Big Data implementation with superior performance and continuous information delivery on a 24x7x365 timeline.

Support for Various Document Types

Context App has added the support of multiple types of documents in their application. These documents include:

  • Emails
  • IMs
  • text messages
  • electronic documents
  • spreadsheets
  • presentations
  • images
  • maps
  • videos
  • blogs
  • social media tools that generate communications

Any of the above content types may be present in the common document formats like .doc, .xls, .ppt, png, jpeg, mpu, etc. Context can extract any of the document type and then parse and store them in a Big Data like Google Big Query and the data can be analyzed using Context tools or dashboards or by using any of the customized data models that will apply to the documents for the government agency.

As the documents are indexed and transferred into the Big query database, Context can also apply the full assortment of Data Quality techniques to clean any special characters, any incorrect addresses, names, and other fields by creating a Data Quality Framework for this agency meeting and exceeding their data quality requirements. Context offers other features like the auto-redaction of information based on certain business rules, and built-in Data Loss Prevention (DLP). As the data accumulates, Context can provide you a dashboard that can bring your organizational structure into the Context space. It means that Context can take your organization, departments, offices, subdepartments, suboffices, and put them in a list or better yest sync with your LDAP (Lightweight Directory Access Protocol) servers for a content management system that aligns to your organizational structure.

It is also possible to use all the teams assigned to this project to be in the LDAP with the individual name, position, reporting to, reported by, and other parameters. All of the information contains full document management requirement that are sought by the agencies with full auditability. It means that all documents across the agency can be stored by the agency structure and there is full accountability built into the system. Context can then create Dashboards that report on the records or forms by all the distribution attributes mentioned above.