agent.pybusinessLangchainv1.0.0

LangChain: Data Extraction Agent

Extracts structured data from unstructured documents like invoices, resumes, and reports.

Setup time: ~10 min
Model: GPT-4o
Cost: ~$0.15/day
Last updated: Mar 16, 2026
byRunbooks Communitycontributor

Template

agent.py
# Install: pip install langchain langchain-openai pydantic

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing import Optional
import json
import sys
import os

# --- Define extraction schemas ---
class Invoice(BaseModel):
    vendor: str = Field(description="Company or person who sent the invoice")
    invoice_number: Optional[str] = Field(description="Invoice or reference number")
    date: str = Field(description="Invoice date in YYYY-MM-DD format")
    due_date: Optional[str] = Field(description="Payment due date in YYYY-MM-DD format")
    total: float = Field(description="Total amount due")
    currency: str = Field(default="USD", description="Currency code")
    line_items: list[dict] = Field(default=[], description="List of line items with description and amount")

class Contact(BaseModel):
    name: str = Field(description="Full name")
    email: Optional[str] = Field(description="Email address")
    phone: Optional[str] = Field(description="Phone number")
    company: Optional[str] = Field(description="Company name")
    role: Optional[str] = Field(description="Job title or role")

# --- Extraction chain ---
llm = ChatOpenAI(model="gpt-4o", temperature=0)

def extract_data(text: str, schema_class: type[BaseModel]) -> dict:
    schema_json = json.dumps(schema_class.model_json_schema(), indent=2)
    prompt = ChatPromptTemplate.from_messages([
        ("system", f"""Extract structured data from the text. Return valid JSON matching this schema:
{schema_json}

If a field is not found, use null. Be precise with numbers and dates."""),
        ("human", "Text to extract from:\n\n{text}"),
    ])
    chain = prompt | llm
    result = chain.invoke({"text": text})
    parsed = json.loads(result.content)
    return schema_class(**parsed).model_dump()

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python agent.py <invoice|contact> <file.txt>")
        sys.exit(1)
    schema_map = {"invoice": Invoice, "contact": Contact}
    schema = schema_map.get(sys.argv[1])
    if not schema:
        print(f"Unknown schema: {sys.argv[1]}. Use: {list(schema_map.keys())}")
        sys.exit(1)
    with open(sys.argv[2]) as f:
        text = f.read()
    result = extract_data(text, schema)
    print(json.dumps(result, indent=2))

Setup

  1. 1

    Copy the agent.py content above.

  2. 2

    Create a Python virtual environment and install dependencies.

  3. 3

    Set your OPENAI_API_KEY environment variable.

  4. 4

    Run: python agent.py

Run with LangChain

This is a Python script using LangChain. Set up a virtual environment and install dependencies.

# 1. Create a virtual environment
python -m venv venv && source venv/bin/activate

# 2. Install dependencies
pip install langchain langchain-openai langchain-community chromadb

# 3. Set your API key
export OPENAI_API_KEY="sk-..."

# 4. Save the agent.py from above and run it
python agent.py

Version History

v1.0.0Initial releaseMar 16, 2026

Framework

Langchain

Requirements

Python 3.10+
OpenAI API key

Estimated cost

~$0.15/day

on GPT-4o model

File type

agent.py

Version

v1.0.0

Updated Mar 16, 2026

Contributor

Runbooks Community

Community submission

You might also like