How a Spanish compliance startup automated extraction from 80-page corporate documents

Updated on
February 24, 2026
5
min read
Contributors
No items found.
Author
How a Spanish compliance startup automated extraction from 80-page corporate documents
Table of contents
Turn documents into structured data
Get started free
Share this post

Corporate compliance requires meticulous documentation. When companies transfer shares, reduce capital, or bring on new shareholders, the legal documents recording these transactions become permanent records that auditors, tax authorities, and compliance teams reference for years.

These documents are dense. A single corporate deed might run 80 pages, mixing company information, shareholder demographics, share allocations, and legal boilerplate across dozens of sections. For a Spanish compliance platform helping clients reconcile their records, extracting structured data from these documents was a manual nightmare, until they partnered with Sensible to combine LLMs with deterministic guardrails.


The challenge: Long documents, scattered data, multiple languages

The company builds software that helps Spanish businesses with tax compliance and regulatory reporting, including reconciling shareholder records against official corporate documents before audits surface discrepancies. Getting accurate data out of those source documents (known as "escrituras", or deeds, in Spain) was a challenge:

Sprawling and multilingual. A typical deed runs 50–80 pages, with shareholder demographics scattered far from their share allocations. Documents arrive in Spanish, Catalan,  French, or a mix of languages, requiring country names, cities, and regional codes to be normalized regardless of source language.

Fuzzy identity matching. The same shareholder might appear as "María Cristina Beso Arnalot" on page 10 and "Maria C. Beso" on page 50. Middle names, accent marks, and abbreviations vary throughout.

The solution: LLMs constrained by structure

The implementation is approximately 80% LLM-based, but it's the remaining 20% deterministic methods that make all the difference:

Sections as subdocuments. Corporate deeds follow predictable structures, with clearly demarcated headers. The implementation uses these natural boundaries to create sections: document slices where LLM queries focus on specific content. Rather than asking an LLM to find shareholder demographics somewhere in an 80-page document, the system first deterministically identifies the relevant section, then queries only within it. The result is faster extraction, lower costs, and more reliable results.

Merging scattered data with intelligent joining. Shareholder demographics appear early in a deed; share allocations appear much later. The system runs multiple LLM extraction passes, then merges results by shareholder name. But names aren't reliable join keys, because spelling variations, missing middle names, and inconsistent formatting mean exact matching fails constantly. Sensible conducted iterative testing till the agentic LLMs reliably make the judgment calls that fuzzy string matching would get wrong. Now the LLM agents reliably merge five separate data lists into unified shareholder records.

Multilingual normalization. Deterministic lookup tables map language variants to standard codes across Spanish, Catalan, and French—so "ESPAÑA," "ESPANYA," and "SPAIN" all resolve to the same country code, and "ALEMANYA" and "ALEMANIA" both become "DE."


The results: From hours to minutes

What previously required reading an 80-page document cover-to-cover now completes in minutes. Approximately 50 data points are extracted per document, including company identifiers, incorporation details, shareholder demographics, share allocations, transaction specifics, and director information.

The volume is modest at roughly six documents per week, but the value per document is enormous. Each extraction replaces hours of manual work and feeds directly into compliance workflows where accuracy matters. The partnership has evolved over two years, with custom features built to handle edge cases and five document configurations in production.

Key takeaways for complex document extraction

  • Structure enables intelligence. LLMs work better when deterministically constrained to specific sections rather than navigating entire documents.
  • Low volume doesn't mean low value. A document that takes hours to process manually is worth automating even if you only see a few per week.
  • Complex use cases require partnership. Some documents are hard enough that off-the-shelf solutions won't work, and a partnership is needed to get to production-ready extraction.
Frances Elliott
Frances Elliott
Turn documents into structured data
Get started free
Share this post

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.