ag
← back
January 20, 2025Case study

PDF to digital: automated compliance forms

Built an automated pipeline to transform complex PDF compliance forms into digital workflows, leveraging LLMs for structured data extraction and GraphQL for platform integration.

TypeScriptOpenAIETLGraphQLAutomation

Automating PDF form digitization with LLMs

Context

While working on partnership development for our compliance platform, we identified an opportunity to demonstrate value to a potential enterprise partner. They had comprehensive compliance templates (QM10 and QM20) available only as PDFs. Instead of traditional sales outreach, I proposed building their templates directly on our platform as a proof of concept.

Note: While the specific partner and compliance details are kept confidential, this case study focuses on the technical approach to solving a common enterprise challenge: converting unstructured PDF forms into interactive digital workflows.

What I Built

I developed an automated pipeline that could:

  1. Extract structured data from complex PDF compliance forms
  2. Transform the data into a standardized format
  3. Automatically generate digital workflows through our platform's GraphQL API
  4. Create a complete, interactive compliance assessment ready for immediate use

The end result was a fully-functional digital version of their compliance workflow that we could demonstrate during partnership discussions.

Technical Breakdown

Stack & Tools

  • TypeScript/Node.js: Core implementation
  • OpenAI GPT API: PDF content extraction and structuring
  • GraphQL: Platform API integration
  • PDF Processing: Initial exploration with PDF parsing libraries

Key Architecture Decisions

  1. LLM-Based Data Extraction

After initially exploring traditional PDF parsing libraries, I pivoted to using OpenAI's GPT for data extraction. Here's why:

// Example of the structured data format we achieved
interface ComplianceSection {
  title: string;
  informationText: string;
  goal: string;
  booleanQuestions: string[];
  mappingIndication: string[];
}

// Sample of extracted and structured data
const extractedData = [
  {
    title: "1.2 Informatiebeveiligingsbeleid en bestuurlijke goedkeuring",
    informationText: "Het management van de organisatie dient...",
    goal: "Voorkomen dat er informatiebeveiligingsincidenten...",
    booleanQuestions: [
      "Heeft de organisatie een gedetailleerd informatiebeveiligingsbeleid...?",
      // More questions...
    ],
    mappingIndication: [
      "ISO 27001: A.5.1 – Information security policies",
      // More mappings...
    ],
  },
  // More sections...
];

This approach provided superior results compared to traditional PDF parsing because:

  • PDFs contained complex formatting and tables
  • LLM could understand context and relationships between elements
  • Structured output was more reliable and required less cleanup
  1. Automated Form Generation Pipeline
async function main() {
  // Authentication
  await signIn(baseUrl, process.env.USERNAME, process.env.PASSWORD);

  // Create base form structure
  const formCollection = await createFormCollection({
    tenantId: tenant.id,
    data: { name: "QM20 Assessment Form" },
  });

  // Process each section from extracted data
  for (const section of extractedData) {
    const formSection = await createFormSection({
      tenantId: tenant.id,
      formId: formId,
      data: {
        title: section.title,
        description: section.informationText,
      },
    });

    // Create dynamic form fields
    await createFormFields(tenant.id, formSection.id, section);
  }
}

The pipeline handles:

  • Authentication and session management
  • Hierarchical form creation (collections → sections → fields)
  • Dynamic field generation based on question types
  • Metadata and mapping preservation
  1. Error Handling and Validation
const createFormField = async ({
  tenantId,
  formSectionId,
  type,
  initialTitle,
}) => {
  try {
    const field = await createField(/* ... */);
    await updateFormField(tenant.id, field.id, {
      richDescription: "",
      richTitle: initialTitle,
      disabled: false,
      metadata: {
        validation: {
          required: false,
          formats: ["PDF", "DOC", "DOCX" /* ... */],
        },
      },
    });
    return field;
  } catch (error) {
    console.error(`Failed to create field: ${initialTitle}`);
    throw error;
  }
};

Built-in safeguards include:

  • Proper error handling for API calls
  • Field validation rules
  • File format restrictions
  • Rich text support for complex content

What I Learned

  1. PDF Data Extraction Strategy

The initial approach using PDF parsing libraries like pdf-parse or pdf2json proved challenging due to:

  • Inconsistent text extraction
  • Loss of formatting and structure
  • Difficulty handling tables and layouts

LLMs provided a more elegant solution by:

  • Understanding document context
  • Maintaining relationships between elements
  • Producing clean, structured output
  1. GraphQL API Orchestration

Managing multiple dependent API calls required careful orchestration:

  • Sequential processing for proper parent-child relationships
  • Error handling with appropriate rollbacks
  • Rate limiting consideration
  • Progress tracking for long-running operations
  1. Business Process Automation

The project highlighted how technical solutions can directly impact business development:

  • Reduced sales cycle by providing immediate value
  • Demonstrated platform capabilities effectively
  • Saved significant manual work for the CSM team
  • Created reusable automation patterns

What's Next?

  1. Scalability Improvements
    • Batch processing for multiple PDFs
    • Parallel processing where possible
    • Caching for improved performance
  2. Enhanced Extraction
    • Support for more complex PDF layouts
    • Additional compliance template types
    • Multi-language support
  3. Integration Enhancements
    • Automated testing for generated forms
    • Version control for templates
    • Change tracking and diff generation

Key Takeaways

I learned the power of combining modern AI tools with traditional automation to solve real business challenges. By thinking creatively about PDF data extraction and leveraging LLMs, I turned what could have been weeks of manual work into an automated process.

The solution saved immediate time and resources + created a repeatable pattern for future partner onboarding. Most importantly, it transformed a traditional sales approach into a value-first demonstration which resonated with our potential partner.


Note: This case study focuses on the technical implementation while respecting confidentiality around specific partner details and compliance requirements.