how-to

Extracting data from rate confirmations

Frances Elliott
Tuesday, March 8, 2022

Use SenseML to extract structured data from a trucking rate confirmation PDF.

Related Documents

Why extract data from a trucking rate confirmation?

Better data and analytics is key to shock-proofing supply chains against future disruptions, as the COVID-19 pandemic brought home. But high-quality real-world data can be hard to come by in logistics. Take the trucking industry: it’s an essential part of the supply chain and, like many other transportation industries, still has an analog backbone. In particular, getting information about what’s on a truck and where it’s going might seem trivial. But freight brokers don’t often have APIs where this information is easily accessible. Instead, it’s reported in a PDF: the ubiquitous "rate confirmation". Anytime a truckload goes anywhere, a rate confirmation describes what’s on the truck, how much it costs to ship, and where it’s going. 

Fortunately, with Sensible you can easily extract key information out of rate confirmation PDFs using SenseML. SenseML is Sensible’s JSON-formatted query language for extracting data from documents.


What we'll cover

In this blog post, we’ll briefly walk through configuring extractions for rate confirmations. At the end, you’ll know a couple of SenseML methods and you’ll be on your way to extracting any data you choose using our documentation or our prebuilt rate confirmation configurations in our open-source library.

Write document extraction queries with SenseML

Let's walk through extracting specific pieces of data from a top global transportation company, CH Robinson. Here's an example of a USA-based trucking rate confirmation PDF with fake data:


CH Robinson rate confirmation

To follow along, you can download this example PDF and upload it to the Sensible app:

Download example PDF

 


Our open-source rate confirmation config extracts pretty much everything from this PDF, but for the example in this post, let's keep it simple. We'll extract just the:

  • Load ID and total rate
  • Receivers' names and delivery dates
  • Transportation equipment type

You can also skip ahead to the end of this post to see all the queries we cover in this post wrapped up in a single code sample.

Extract the load ID and total cost

To extract the load ID, #123456789:

Load ID number

Paste the following query, or "field" into the right pane of the Sensible app:


{
	"fields": [{
			"id": "load_id",
			"method": {
				"id": "label",
				"position": "right"
			},
			"anchor": {
				"match": {
					"text": "confirmation - #",
					"type": "includes"
				}

			}
		}

	]
}

The query treats the string "confirmation - #" as a label, and searches to the right of the label for the load ID. You'll get this output:


{
	"load_id": {
		"type": "string",
		"value": "328298459"
	}
}

To extract the total rate, $4,5000:

Total rate

Use a query that finds the text "total" preceded by the text "service for load," then treat that text as part of a row. Sensible searches for the first row cell to the right of the text that contains a currency. 


{
	"id": "rate",
	"type": "currency",
	"method": {
		"id": "row",
		"position": "right",
		"tiebreaker": "first"
	},
	"anchor": {
		"start": [{
			"text": "service for load",
			"type": "startsWith"
		}],
		"match": [{
			"text": "total",
			"type": "startsWith"
		}]
	}
}

The query outputs:


{
	"rate": {
		"source": "$4,500.00",
		"value": 4500,
		"unit": "$",
		"type": "currency"
	}
}

Extract receivers

Since a rate confirmation can list multiple receivers, let's output arrays of receiver data. You can either A) output separate arrays for each piece of receiver data, or you can B) output an array of receiver objects, each containing receiver data.

In our open-source library, our customer-contributed config uses approach A with a "matchAll" parameter. In this post, we'll instead illustrate approach B by treating each receiver as a repeating section. The following image shows the receiver sections, each of which contains name and delivery date properties.

Receiver sections have a repeated format

The following query looks for the receiver name and delivery date in a group of receiver sections that repeatedly start with the text "RECEIVER #". You can append this query after the fields array in the Sensible app:



	"sections": [{
		"id": "receivers",
		"range": {
			"anchor": {
				"match": {
					"pattern": "RECEIVER[ ]*\\#[0-9]{1,2}\\:",
					"type": "regex"
				}
			},
			"stop": {
				"type": "startsWith",
				"text": "Service for Load",
				"isCaseSensitive": true
			},
			"offsetY": -0.1
		},
		"fields": [{
				"id": "consignee_name",
				"method": {
					"id": "row",
					"position": "right",
					"tiebreaker": "first"
				},
				"anchor": {
					"match": [{
						"text": "RECEIVER",
						"type": "startsWith",
						"isCaseSensitive": true
					}]
				}
			},
			{
				"id": "delivery_date",
				"method": {
					"id": "row",
					"position": "right"
				},
				"anchor": {
					"match": [{
						"text": "delivery date",
						"type": "startsWith"
					}]
				}
			}
		]
	}]


The output is:


{
	"receivers": [{
			"consignee_name": {
				"value": "Anyco Clothes #176",
				"type": "string"
			},
			"delivery_date": {
				"type": "string",
				"value": "7/13/21"
			}
		},
		{
			"consignee_name": {
				"value": "Anyco Clothes #425",
				"type": "string"
			},
			"delivery_date": {
				"type": "string",
				"value": "7/14/21"
			}
		},
		{
			"consignee_name": {
				"value": "Anyco Clothes #255",
				"type": "string"
			},
			"delivery_date": {
				"type": "string",
				"value": "7/15/21"
			}
		},
		{
			...
		}
	]
}

Extracting trailer type

Finally, let's take a look at how to refine extracted data. We can get a string for the equipment, "Van - Min L=53":

Equipment type

Using this query:


{
	"id": "_trailer_raw",
	"method": {
		"id": "row",
		"position": "right",
		"tiebreaker": "first"
	},
	"anchor": {
		"match": [{
			"text": "Equipment:",
			"type": "startsWith"
		}]
	}
}
}

Let's refine this output so we only capture the trailer type ("van"). We'll use a computed field to split the field's output on the hyphen. You can append this query after the sections array in the Sensible app :


{
	"computed_fields": [{
		"id": "trailer_type",
		"method": {
			"id": "split",
			"separator": " - ",
			"source_id": "_trailer_raw",
			"index": 0
		}
	}]
}


To get output like this:


{
	"trailer_type": {
		"value": "Van",
		"type": "string"
	}

}

Wrapping up

To wrap it all up, you can paste all the queries mentioned in this post into the Sensible app like this:


{
  "fields": [
    {
      "id": "load_id",
      "method": {
        "id": "label",
        "position": "right"
      },
      "anchor": {
        "match": {
          "text": "confirmation - #",
          "type": "includes"
        }
      }
    },
    {
      "id": "rate",
      "type": "currency",
      "method": {
        "id": "row",
        "position": "right",
        "tiebreaker": "first"
      },
      "anchor": {
        "start": [
          {
            "text": "service for load",
            "type": "startsWith"
          }
        ],
        "match": [
          {
            "text": "total",
            "type": "startsWith"
          }
        ]
      }
    },
    {
      "id": "_trailer_raw",
      "method": {
        "id": "row",
        "position": "right",
        "tiebreaker": "first"
      },
      "anchor": {
        "match": [
          {
            "text": "Equipment:",
            "type": "startsWith"
          }
        ]
      }
    },
    {
      "id": "weight",
      "method": {
        "id": "label",
        "position": "below"
      },
      "anchor": {
        "match": [
          {
            "text": "Commodity",
            "type": "equals"
          },
          {
            "text": "Est wgt",
            "type": "equals"
          }
        ]
      }
    }
  ],
  "sections": [
    {
      "id": "receivers",
      "range": {
        "anchor": {
          "match": {
            "pattern": "RECEIVER[ ]*\\#[0-9]{1,2}\\:",
            "type": "regex"
          }
        },
        "stop": {
          "type": "startsWith",
          "text": "Service for Load",
          "isCaseSensitive": true
        },
        "offsetY": -0.1
      },
      "fields": [
        {
          "id": "consignee_name",
          "method": {
            "id": "row",
            "position": "right",
            "tiebreaker": "first"
          },
          "anchor": {
            "match": [
              {
                "text": "RECEIVER",
                "type": "startsWith",
                "isCaseSensitive": true
              }
            ]
          }
        },
        {
          "id": "delivery_date",
          "method": {
            "id": "row",
            "position": "right"
          },
          "anchor": {
            "match": [
              {
                "text": "delivery date",
                "type": "startsWith"
              }
            ]
          }
        }
      ]
    }
  ],
  "computed_fields": [
    {
      "id": "trailer_type",
      "method": {
        "id": "split",
        "separator": " - ",
        "source_id": "_trailer_raw",
        "index": 0
      }
    }
  ]
}

Start extracting

Congratulations, you've learned some key methods for extracting structured data from rate confirmation documents. There's more extraction power for you to uncover. Check out our prebuilt rate confirmation config in our open-source library, check out our docs, and sign up for a free account (150 docs a month, no credit card required) to start extracting data from your own documents.