gdpr complianceaws lambdaresume parsingllm keyword extractionserverless architecturepython pipelineslangchain integration

Automating GDPR-Compliant Resume Keyword Extraction with LLM-Powered Python Pipelines on AWS Serverless Architecture

By Maria José González Antelo· June 18, 2026
Automating GDPR-Compliant Resume Keyword Extraction with LLM-Powered Python Pipelines on AWS Serverless Architecture

Automating GDPR-Compliant Resume Keyword Extraction with LLM-Powered Python Pipelines on AWS Serverless Architecture


Context & Architecture

From a product leadership perspective, automating the extraction of professional skills from resumes isn't a challenge of "prompting," but a challenge of data sovereignty and architectural latency.

To scale this for millions of users—as I have done in previous high-traffic platforms—you cannot rely on monolithic processing. The architecture must be event-driven. I leverage an AWS Serverless stack to ensure that PII (Personally Identifiable Information) is handled in transient memory and purged immediately after the extraction phase, adhering strictly to GDPR's "purpose limitation" and "storage limitation" principles.

Technical Stack:

  • Compute: AWS Lambda (Python 3.11)
  • Orchestration: AWS Step Functions (for retry logic and error handling)
  • LLM Integration: LangChain + OpenAI GPT-4o-mini (via API) or AWS Bedrock (for VPC-isolated data)
  • Storage: AWS S3 (Encrypted at rest)
  • Compliance: Transient processing (no persistent PII storage in logs)

Implementation: The Python Pipeline

The following implementation focuses on a modular extraction pipeline. We use a structured output approach to ensure the LLM returns valid JSON, which is critical for downstream database integration and search indexing.

import json
import boto3
import os
from typing import List, Dict
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field

# --- Compliance Schema ---
class SkillExtraction(BaseModel):
    hard_skills: List[str] = Field(description="Technical skills, tools, and frameworks")
    soft_skills: List[str] = Field(description="Interpersonal and leadership capabilities")
    experience_level: str = Field(description="Junior, Mid, Senior, or Executive")
    confidence_score: float = Field(description="Confidence level of extraction 0.0-1.0")

# --- Configuration ---
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
s3_client = boto3.client('s3')

def extract_text_from_s3(bucket: str, key: str) -> str:
    """Retrieves raw resume text. In production, integrate with AWS Textract."""
    response = s3_client.get_object(Bucket=bucket, Key=key)
    return response['Body'].read().decode('utf-8')

def lambda_handler(event, context):
    """
    Main entry point for AWS Lambda.
    Processes the resume and extracts keywords without persisting PII.
    """
    bucket = event['bucket']
    key = event['key']

    # 1. Data Retrieval
    raw_text = extract_text_from_s3(bucket, key)

    # 2. LLM Setup with Pydantic for structured output
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    parser = PydanticOutputParser(pydantic_object=SkillExtraction)

    prompt = PromptTemplate(
        template="Extract professional keywords from the following resume text.\n{format_instructions}\nText: {text}",
        input_variables=["text"],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    # 3. Execution
    chain = prompt | llm | parser
    try:
        result = chain.invoke({"text": raw_text})

        # GDPR Guardrail: Ensure raw_text is wiped from memory after this point
        # and only structured, non-PII keywords are returned.
        return {
            "statusCode": 200,
            "body": result.dict()
        }
    except Exception as e:
        print(f"Error during extraction: {str(e)}")
        return {"statusCode": 500, "body": "Processing Error"}

# Example Event: {"bucket": "resumes-encrypted-bucket", "key": "candidate_01.txt"}

Strategic Execution & Compliance Guardrails

To transition this from a snippet to a production-grade system, the following RAID (Risks, Assumptions, Issues, Dependencies) considerations must be applied:

  1. Data Minimization: Do not pass the entire resume to the LLM if only the "Experience" section is required. Use a pre-processing step to truncate the text, reducing token costs and privacy risk.
  2. VPC Isolation: For enterprise-grade compliance, deploy the Lambda within a VPC and use AWS Bedrock. This ensures that your data never leaves the AWS environment, bypassing the public internet.
  3. Encryption: All S3 buckets must use AES-256 encryption. The Lambda execution role should have the absolute minimum permissions (s3:GetObject only) to prevent lateral movement.
  4. Latency Optimization: For high-volume platforms, avoid synchronous calls. Use an asynchronous pattern: S3 Upload $\rightarrow$ EventBridge $\rightarrow$ Lambda $\rightarrow$ DynamoDB.

From Infrastructure to Outcome

Building the pipeline is the technical baseline, but the business value lies in how this data transforms the user experience. In the recruitment space, the gap between a "static PDF" and a "searchable profile" is where the conversion happens.

This is exactly the philosophy behind CVChatly. Instead of forcing candidates to manually update keywords, we leverage AI-driven tools to transform professional profiles into 24/7 recruiter-ready showcases. By combining structured data extraction with a conversational AI interface, we eliminate the friction of traditional applications.

If you are looking to scale your own AI-driven product or need strategic guidance on transforming a technical vision into a compliant, market-ready MVP, I offer consultancy focused on high-scale architecture and regulatory engineering.

Explore the future of AI-driven career tools at https://www.cvchatly.com.


About the Author: Maria José González Antelo is a CPO and ICT Project Director with 20+ years of experience in enterprise architecture and AI product strategy. She specializes in scaling high-traffic platforms and implementing complex compliance frameworks like GDPR and the UK Online Safety Act.

Automating GDPR-Compliant Resume Keyword Extraction with LLM-Powered Python Pipelines on AWS Serverless Architecture · CVChatly