How to redact data, count items, and calculate values in documents using Sensible

Updated on
March 29, 2024
5
min read
Contributors
No items found.
Author
How to redact data, count items, and calculate values in documents using Sensible
Table of contents
Turn documents into structured data
Get started free
Share this post

With Sensible, you can extract data from documents in structured JSON format. Once you’ve extracted data with our developer platform, you can transform the extraction to add or remove document data or to conform to your desired schema. For example, you can transform the extracted data shown in the following image:

Extracted data field

The preceding image shows extracting a single dollar amount as a field (earnest_money) from a purchase contract. If you extract each currency amount as a field (earnest_money, seller_financing, and cash, respectively), you can sum the extracted fields using Sensible’s newly supported logic features, and get a total amount that’s missing from the source document:

Sum extracted data fields

Sensible supports such transformations with JsonLogic. With JsonLogic, you can apply boolean, logic, numeric, array, and string operations to your extracted data. For example, sum extracted numbers in a document with Reduce operations, redact strings with Replace operations, or use default text to populate a field if the extracted output is null.

The new Custom Computation method gives you access to JsonLogic and joins our existing built-in data schema transformation features such as Zip and Concatenate. Because JsonLogic isn’t a full programming language, it’s a small and safe way to introduce a lot of programming power into our existing domain-specific query language, SenseML. Sensible also extends JsonLogic with some new operations, such as Exists, Replace, and Match operations.

What we'll cover

In this tutorial, we’ll briefly walk through transforming extracted data from an example document using the new Custom Computation method. 

Prerequisites

To follow along with this tutorial, take the following steps:

  • Get an account with Sensible. Or, read along for a rough idea of how things work.
  • Download an example document.
  • Adapt the steps in Configure the extraction to create an empty document type, create an empty config, and to upload the document you downloaded in the previous step.

Example scenario 

You want to extract data from loss run insurance documents, which have the format shown in the following image:

Example document

The document format is missing data you’re interested in, such as the total number of claims, and contains data you want to redact. You’ll compute the following based on the extracted data:

  • Get the total number of claims listed in the document
  • Redact the claim IDs
  • Sum up the incurred cost for all claims listed

Count the instances of a term in a document using array length

The document doesn’t list the total number of claims. To output the total claims in a loss run, we’ll use custom logic to count the instances of the term “Claim ID”  in the document. See the following screenshot for an overview of how to count the number of claims in the document:

Calculate number of claims

The preceding image shows a collection of extraction queries, or “config,” in the left pane, the PDF in the middle pane, and the extracted data in the right pane. The config in the left pane outputs an array of all the instances of the phrase “Claim ID” in the document. The config then uses the Custom Computation method to calculate the length of the array, resulting in a total claim count of 5.

To try this out yourself, paste the following config into the left pane of the Sensible app:


{
  "fields": [
    {
      /* source field for the jsonLogic computation */
      "id": "_claim_strings_raw",
      /* "all" outputs an array of each
       claim number in the document */
      "match": "all",
      "anchor": {
        "match": {
          "text": "Claim ID",
          "type": "startsWith",
          "isCaseSensitive": true
        }
      },
      "method": {
        /* target data to extract is a single line 
        near anchor line ("Claim number") */
        "id": "label",
        /* target data is to right of anchor */
        "position": "right",
      }
    },
    {
      /* use JsonLogic to perform custom
        data transformation */
      "id": "claim_count",
      "method": {
        "id": "customComputation",
        "jsonLogic": {
          /* output the number of claims in the document
             by taking the length of the source claims array */
          "var": "_claim_strings_raw.length"
        }
      }
    }
  ]
}

You should get output similar to the following:


{
  "_claim_strings_raw": [
    {
      "type": "string",
      "value": "1223456789"
    },
    {
      "type": "string",
      "value": "9876543211"
    },
    {
      "type": "string",
      "value": "6785439210"
    },
    {
      "type": "string",
      "value": "7235439210"
    },
    {
      "type": "string",
      "value": "8235439211"
    }
  ],
  "claim_count": {
    "value": 5,
    "type": "number"
  }
}

Redact data using Replace operation

You don’t want to extract the full claim IDs in the output because of privacy concerns. So let’s use JsonLogic operations combined with Sensible’s extended operations to redact the claim IDs. 

First, let’s use best practices to improve on the preceding example. By using sections instead of the preceding example’s "match": "all", we can get an array of claim objects, where each object in the array can contain data such as the claim_id and incurred_amount

See the following screenshot for an overview:

Extract each claim as an object

In the preceding image, the green brackets in the middle pane show the start and end of each claims section. The extracted output in the right pane shows the unredacted claim IDs.

Now we have the claims objects extracted, let’s use the Custom Computation method to redact the claim IDs. See the following image for an overview:

Redact IDs

The preceding screenshot shows how to implement the following logic:

  •  Use a Computed Fields array in the claims_sections definition to operate on each claim ID in the document. In a Custom Computation method, access each ID’s value with dot notation, i.e. raw_claim_id.value.
  • The Sensible-specific Replace JsonLogic operation finds a 10-digit number with regex and replaces the first 3 numbers with ***  to create redacted output like ***3456789.  
  • The Suppress Output method removes the full claim ID from the output.

To try this out yourself, paste the following config into the left pane of the Sensible app:


{
  "fields": [
    {
      /* use sections to extract repeating data, 
      in this case, claims */
      "id": "claims_sections",
      "type": "sections",
      "range": {
        /* starting line of each claim is "claim number" */
        "anchor": "claim id",
        /* ending line of each claim is "incurred" */
        "stop": "incurred"
      },
      "fields": [
        {
          "id": "raw_claim_id",
          "anchor": "claim id",
          "method": {
            /* target data to extract is a single line 
            to right of anchor line ("claim number") */
            "id": "label",
            "position": "right"
          }
        },
        {
          "id": "incurred_amount",
          "type": "currency",
          "anchor": "incurred",
          "method": {
            /* target data to extract is a single line 
            to right of anchor line ("incurred") */
            "id": "label",
            "position": "right",
          }
        },
        {
          "id": "redacted_id",
          "method": {
            /* use JsonLogic to perform custom
            data transformation */
            "id": "customComputation",
            "jsonLogic": {
              /* the Replace method extends jsonLogic
                 to enable regex find/replace operations  */
              "replace": {
                "source": {
                  /* replace 1st 3 digits in each ID with '***' 
                  to redact it */
                  "var": "raw_claim_id.value"
                },
                "find_regex": ".*(\\d{3})(\\d{7}).*",
                "replace": "***$2",
              }
            }
          }
        },
        /* suppress unredacted IDs from output */
        {
          "id": "hide_fields",
          "method": {
            "id": "suppressOutput",
            "source_ids": [
              "raw_claim_id"
            ]
          }
        }
      ],
    },
  ],
  "computed_fields": [
    {
      /* output the total claims in the document
      by counting the number of claims sections */
      "id": "claim_count",
      "method": {
        "id": "customComputation",
        "jsonLogic": {
          "var": "claims_sections.length"
        }
      }
    }
  ]
}

You should get output similar to the following:


{
  "claims_sections": [
    {
      "incurred_amount": {
        "source": "$3,053",
        "value": 3053,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***3456789",
        "type": "string"
      }
    },
    {
      "incurred_amount": {
        "source": "$251",
        "value": 251,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***6543211",
        "type": "string"
      }
    },
    {
      "incurred_amount": {
        "source": "$985",
        "value": 985,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***5439210",
        "type": "string"
      }
    },
    {
      "incurred_amount": {
        "source": "$581",
        "value": 581,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***5439210",
        "type": "string"
      }
    },
    {
      "incurred_amount": {
        "source": "$771",
        "value": 771,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***5439211",
        "type": "string"
      }
    }
  ],
  "claim_count": {
    "value": 5,
    "type": "number"
  }
}

Sum amounts using Reduce and Accumulator operations

Each claim in the document lists an incurred dollar amount, but the document is missing a total. Here’s an overview of how to sum the total amount:

Sum amounts

The previous screenshot shows using the Reduce operation to access each dollar value in the claims array with the current iterator, then add it to the running total with an accumulator. The config outputs a total of 5641. Note that in this example, adding currencies results in a number, because the Custom Computation method doesn't infer Sensible types.

To try out all the preceding examples, paste the following config into the left pane of the Sensible app. The field near the end of the config, total_incurred, shows how to sum the incurred amounts:


{
  "fields": [
    {
      /* use sections to extract repeating data, 
      in this case, claims */
      "id": "claims_sections",
      "type": "sections",
      "range": {
        /* starting line of each claim is "claim number" */
        "anchor": "claim id",
        /* ending line of each claim is "incurred" */
        "stop": "incurred"
      },
      "fields": [
        {
          "id": "raw_claim_id",
          "anchor": "claim id",
          "method": {
            /* target data to extract is a single line 
            to right of anchor line ("claim number") */
            "id": "label",
            "position": "right"
          }
        },
        {
          "id": "incurred_amount",
          "type": "currency",
          "anchor": "incurred",
          "method": {
            /* target data to extract is a single line 
            to right of anchor line ("incurred") */
            "id": "label",
            "position": "right",
          }
        },
        {
          "id": "redacted_id",
          "method": {
            /* use JsonLogic to perform custom
            data transformation */
            "id": "customComputation",
            "jsonLogic": {
              /* the Replace method extends jsonLogic
                 to enable regex find/replace operations  */
              "replace": {
                "source": {
                  /* replace 1st 3 digits in each ID with '***' 
                  to redact it */
                  "var": "raw_claim_id.value"
                },
                "find_regex": ".*(\\d{3})(\\d{7}).*",
                "replace": "***$2",
              }
            }
          }
        },
        /* hide unredacted IDs from output */
        {
          "id": "hide_fields",
          "method": {
            "id": "suppressOutput",
            "source_ids": [
              "raw_claim_id"
            ]
          }
        }
      ],
    },
  ],
  "computed_fields": [
    {
      /* output the number of claims in the document
      by taking the length of the claims array */
      "id": "claim_count",
      "method": {
        "id": "customComputation",
        "jsonLogic": {
          "var": "claims_sections.length"
        }
      }
    },
    /* get the sum of all incurred dollar
    amounts in the document */
    {
      "id": "total_incurred",
      "method": {
        "id": "customComputation",
        "jsonLogic": {
          /* combine elements of array into a 
          single value with Reduce operation */
          "reduce": [
            {
              "var": "claims_sections"
            },
            {
              "+": [
                {
                  /* for the current element in the array .. */
                  "var": "current.incurred_amount.value"
                },
                {
                  /* ...add its value to the running total */
                  "var": "accumulator"
                }
              ]
            },
            0
          ]
        }
      }
    },
  ]
}

You should get output similar to the following:


{
  "claims_sections": [
    {
      "incurred_amount": {
        "source": "$3,053",
        "value": 3053,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***3456789",
        "type": "string"
      }
    },
    {
      "incurred_amount": {
        "source": "$251",
        "value": 251,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***6543211",
        "type": "string"
      }
    },
    {
      "incurred_amount": {
        "source": "$985",
        "value": 985,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***5439210",
        "type": "string"
      }
    },
    {
      "incurred_amount": {
        "source": "$581",
        "value": 581,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***5439210",
        "type": "string"
      }
    },
    {
      "incurred_amount": {
        "source": "$771",
        "value": 771,
        "unit": "$",
        "type": "currency"
      },
      "redacted_id": {
        "value": "***5439211",
        "type": "string"
      }
    }
  ],
  "claim_count": {
    "value": 5,
    "type": "number"
  },
  "total_incurred": {
    "value": 5641,
    "type": "number"
  }
}

Conclusion

Sensible’s new Custom Computation method gives you all the power of JsonLogic for transforming your extracted document data schemas. 

Explore our prebuilt open-source library for extracting from common business documents, check out our docs, and sign up for a free account to start extracting and transforming data from your own documents.

Turn documents into structured data
Get started free
Share this post

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.