January 20, 2025Case study

PDF to digital: automated compliance forms

Built an automated pipeline to transform complex PDF compliance forms into digital workflows, leveraging LLMs for structured data extraction and GraphQL for platform integration.

TypeScriptOpenAIETLGraphQLAutomation

Automating PDF form digitization with LLMs

Context

While working on partnership development for our compliance platform, we identified an opportunity to demonstrate value to a potential enterprise partner. They had comprehensive compliance templates (QM10 and QM20) available only as PDFs. Instead of traditional sales outreach, I proposed building their templates directly on our platform as a proof of concept.

Note: While the specific partner and compliance details are kept confidential, this case study focuses on the technical approach to solving a common enterprise challenge: converting unstructured PDF forms into interactive digital workflows.

What I Built

I developed an automated pipeline that could:

Extract structured data from complex PDF compliance forms
Transform the data into a standardized format
Automatically generate digital workflows through our platform's GraphQL API
Create a complete, interactive compliance assessment ready for immediate use

The end result was a fully-functional digital version of their compliance workflow that we could demonstrate during partnership discussions.

Technical Breakdown

Stack & Tools

TypeScript/Node.js: Core implementation
OpenAI GPT API: PDF content extraction and structuring
GraphQL: Platform API integration
PDF Processing: Initial exploration with PDF parsing libraries

Key Architecture Decisions

LLM-Based Data Extraction

After initially exploring traditional PDF parsing libraries, I pivoted to using OpenAI's GPT for data extraction. Here's why:

// Example of the structured data format we achieved
interface ComplianceSection {
  title: string;
  informationText: string;
  goal: string;
  booleanQuestions: string[];
  mappingIndication: string[];
}

// Sample of extracted and structured data
const extractedData = [
  {
    title: "1.2 Informatiebeveiligingsbeleid en bestuurlijke goedkeuring",
    informationText: "Het management van de organisatie dient...",
    goal: "Voorkomen dat er informatiebeveiligingsincidenten...",
    booleanQuestions: [
      "Heeft de organisatie een gedetailleerd informatiebeveiligingsbeleid...?",
      // More questions...
    ],
    mappingIndication: [
      "ISO 27001: A.5.1 - Information security policies",
      // More mappings...
    ],
  },
  // More sections...
];

This approach provided superior results compared to traditional PDF parsing because:

PDFs contained complex formatting and tables
LLM could understand context and relationships between elements
Structured output was more reliable and required less cleanup

Automated Form Generation Pipeline

async function main() {
  // Authentication
  await signIn(baseUrl, process.env.USERNAME, process.env.PASSWORD);

  // Create base form structure
  const formCollection = await createFormCollection({
    tenantId: tenant.id,
    data: { name: "QM20 Assessment Form" },
  });

  // Process each section from extracted data
  for (const section of extractedData) {
    const formSection = await createFormSection({
      tenantId: tenant.id,
      formId: formId,
      data: {
        title: section.title,
        description: section.informationText,
      },
    });

    // Create dynamic form fields
    await createFormFields(tenant.id, formSection.id, section);
  }
}

The pipeline handles:

Authentication and session management
Hierarchical form creation (collections → sections → fields)
Dynamic field generation based on question types
Metadata and mapping preservation

Error Handling and Validation

const createFormField = async ({
  tenantId,
  formSectionId,
  type,
  initialTitle,
}) => {
  try {
    const field = await createField(/* ... */);
    await updateFormField(tenant.id, field.id, {
      richDescription: "",
      richTitle: initialTitle,
      disabled: false,
      metadata: {
        validation: {
          required: false,
          formats: ["PDF", "DOC", "DOCX" /* ... */],
        },
      },
    });
    return field;
  } catch (error) {
    console.error(`Failed to create field: ${initialTitle}`);
    throw error;
  }
};

Built-in safeguards include:

Proper error handling for API calls
Field validation rules
File format restrictions
Rich text support for complex content

What I Learned

PDF Data Extraction Strategy

The initial approach using PDF parsing libraries like pdf-parse or pdf2json proved challenging due to:

Inconsistent text extraction
Loss of formatting and structure
Difficulty handling tables and layouts

LLMs provided a more elegant solution by:

Understanding document context
Maintaining relationships between elements
Producing clean, structured output

GraphQL API Orchestration

Managing multiple dependent API calls required careful orchestration:

Sequential processing for proper parent-child relationships
Error handling with appropriate rollbacks
Rate limiting consideration
Progress tracking for long-running operations

Business Process Automation

The project highlighted how technical solutions can directly impact business development:

Reduced sales cycle by providing immediate value
Demonstrated platform capabilities effectively
Saved significant manual work for the CSM team
Created reusable automation patterns

What's Next?

Scalability Improvements
- Batch processing for multiple PDFs
- Parallel processing where possible
- Caching for improved performance
Enhanced Extraction
- Support for more complex PDF layouts
- Additional compliance template types
- Multi-language support
Integration Enhancements
- Automated testing for generated forms
- Version control for templates
- Change tracking and diff generation

Key Takeaways

I learned the power of combining modern AI tools with traditional automation to solve real business challenges. By thinking creatively about PDF data extraction and leveraging LLMs, I turned what could have been weeks of manual work into an automated process.

The solution saved immediate time and resources + created a repeatable pattern for future partner onboarding. Most importantly, it transformed a traditional sales approach into a value-first demonstration which resonated with our potential partner.

Note: This case study focuses on the technical implementation while respecting confidentiality around specific partner details and compliance requirements.