Enable CI/CD by adding .onedev-buildspec.yml
| .mvn/wrapper | Loading last commit info... | |
| src | ||
| .dockerignore | ||
| .gitignore | ||
| Dockerfile | ||
| README.md | ||
| mvnw | ||
| mvnw.cmd | ||
| pom.xml |
README.md
MiruIQ Hub
A Quarkus-based REST API for intelligent document data extraction powered by LLMs. Upload documents, define extraction schemas, and get structured data back.
Features
- Document Extraction - Upload PDFs/images and extract structured data using AI
- Schema Management - Create, store, and reuse extraction schemas with
$refsupport - AI Schema Generation - Generate schemas from natural language descriptions or sample documents
- Data Validation - Validate extracted data with text, numeric, and semantic validation strategies
- Multi-tenant - JWT and API key authentication with per-user rate limiting
Tech Stack
- Framework: Quarkus 3.19
- Language: Java 17
- Storage: Apache Paimon (data lake), MinIO/S3 (files), PostgreSQL (metadata)
- AI: OpenAI-compatible API (GPT-4, local models via OpenRouter, etc.)
- PDF Processing: Apache PDFBox
Quick Start
Prerequisites
- Java 17+
- Docker (for MinIO and PostgreSQL)
- An OpenAI-compatible API key
Configuration
Set the following environment variables:
# Required - OpenAI Configuration
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_MODEL="gpt-4o-mini"
# Optional - Database (defaults to local Docker)
export JDBC_URL="localhost:5432"
export JDBC_DATABASE="document_store"
export JDBC_USERNAME="paimon"
export JDBC_PASSWORD="paimon"
# Optional - S3/MinIO (defaults to local MinIO)
export S3_ENDPOINT="http://localhost:9000"
export S3_BUCKET="media-store"
export AWS_ACCESS_KEY_ID="minioadmin"
export AWS_SECRET_ACCESS_KEY="minioadmin"
Run in Development Mode
./mvnw quarkus:dev
The API will be available at http://localhost:8082.
Run with Docker
docker build -t miruiq-hub .
docker run -p 8082:8082 \
-e OPENAI_API_KEY="your-key" \
-e OPENAI_BASE_URL="https://api.openai.com/v1" \
-e OPENAI_MODEL="gpt-4o-mini" \
miruiq-hub
API Reference
Authentication
All endpoints require authentication via:
X-API-Keyheader, orAuthorizationheader (Bearer token)
Endpoints
Extractions
| Method | Endpoint | Description |
|---|---|---|
POST | /extractions | Create extraction job (multipart: schema + documents) |
GET | /extractions/{id} | Get extraction status and results |
GET | /extractions?label={label} | Find extraction by label |
POST | /extractions/{id}/validate | Validate extraction results |
GET | /extractions/{id}/validation | Get validation results |
Schemas
| Method | Endpoint | Description |
|---|---|---|
POST | /schemas | Create schema template |
GET | /schemas | List all schemas |
GET | /schemas/{id} | Get schema by ID |
GET | /schemas/{id}/resolved | Get schema with $ref expanded |
GET | /schemas/by-name/{name} | Get schema by name |
PUT | /schemas/{id} | Update schema |
DELETE | /schemas/{id} | Delete schema |
POST | /schemas/generate | Generate schema from description/images |
Validation
| Method | Endpoint | Description |
|---|---|---|
POST | /validation/validate/{requestId} | Validate by request ID |
POST | /validation/validate/by-label/{label} | Validate by label |
POST | /validation/validate/multi | Validate multiple extractions |
GET | /validation/results/{requestId} | Get validation results |
Example: Create an Extraction
curl -X POST http://localhost:8082/extractions \
-H "X-API-Key: your-api-key" \
-F 'schema={
"type": "object",
"properties": {
"invoice_number": { "type": "string" },
"total_amount": { "type": "number" },
"date": { "type": "string", "format": "date" }
}
}' \
-F "documents=@invoice.pdf" \
-F "label=invoice-001"
Example: Generate a Schema with AI
curl -X POST http://localhost:8082/schemas/generate \
-H "X-API-Key: your-api-key" \
-F "description=Extract customer name, order items with quantities and prices, and total amount" \
-F "images=@sample-receipt.jpg"
Testing
# Run unit tests only (default)
./mvnw test
# Run unit + LLM tests (requires running LLM server)
./mvnw test -Pwith-llm
# Run all tests (requires LLM + Flink pipeline)
./mvnw test -Pfull
Configuration Reference
| Property | Default | Description |
|---|---|---|
HTTP_PORT | 8082 | Main API port |
PDF_MAX_PAGES | 100 | Maximum pages per PDF |
MAX_FILES_PER_REQUEST | 10 | Max documents per extraction |
RATE_LIMIT_PER_USER_PER_MINUTE | 1000 | Rate limit per user |
HTTP_MAX_BODY_SIZE | 100M | Maximum upload size |
License
Proprietary - MiruIQ