Create custom parsers in gcp document AI automatically.

I had recently an opportunity to investigate what tools does Google Cloud Platform offer for an automatic data extraction from documents. The whole palette of solutions for this problem they combined into a section called the Document AI. I looked into a tool named Form Parser. It is a general parser suited for all kinds of documents and does not require any training to make it work it. It generally does a good job extracting the properties and tables from the document, however there is a clear issue with it – we send it a document and the task is: tell me what you see. It is clear even a human will perform better if we clearly specify what we are after in given document. It turns out Document AI has exactly a tool for this – Custom extractor. If we use it with its pretrained generative AI engine it tooks a set of labels, called a dataset, and tries to fill them up with data from supplied documents. It “understands” the label names semantically and tries to find the right values for a document it is dealing with. Custom extractor is a pretrained ML tool and does not require any uptraining to make it work. I wanted to be able to create them with corresponding datasets automatically, this however turned out to be quite challenging as the documentation on gcp Document AI api is a bit scarce and the offered functionality is not broad as well. I first investigated the v1 api and though it does have an endpoint for creating processors it seemingly does not offer one that allows for updating the processor dataset. A v1beta3 endpoint seems better equipped for this. One can create a custom extractor by sending to a request:

  1. POST https://{endpoint}/v1beta3/projects/{project-id}/locations/{location}/processors
    Body:
{
  "displayName": "...",
  "type": "CUSTOM_EXTRACTION_PROCESSOR"
}
https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors/create: Create custom parsers in gcp document AI automatically.

Then we need to setup processor dataset with call to:

2. PATCH https://{endpoint}/v1beta3/projects/{project-id}/locations/{location}/processors/{processor-id}/dataset

Body:

{
  "name": "projects/{project-id}/locations/{location}/processors/{processor-id}/dataset",
  "state": "INITIALIZED",
  "unmanagedDatasetConfig": {},
  "spannerIndexingConfig": {}
}
https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors/updateDataset: Create custom parsers in gcp document AI automatically.

Sending empty objects is vital here, if we send null instead the processor will not work. Calling to this endpoint does not have an immediate result but insteads sets up long running operation which state we may track by calling endpoint:

3. GET https://{endpoint}/v1beta3/projects/{project}/locations/{location}/operations/{operation-id}

https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.operations/get: Create custom parsers in gcp document AI automatically.

The final call we need to make is to setup the actual labels we want to extract:

4. PATCH https://{endpoint}/v1beta3/projects/{project-id}/locations/{location}/processors/{processor-id}/dataset/datasetSchema

Body:

{
  "documentSchema": {
    "entityTypes": [{
      "name": "custom_extraction_document_type",
      "baseTypes": ["documents"],
      "properties": [
        {
          "name": "ShippingCompany",
          "occurenceType": "REQUIRED_ONCE",
          "valueType": "String"
        },
        ....
      ]
    }]
  }
}

Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *