Object Storage Integration Guide¶
Overview¶
This guide covers the integration of object storage (S3/MinIO compatible) into the backend document controller. The integration allows documents to be stored in cloud object storage instead of local filesystem, providing better scalability, reliability, and cost efficiency.
Architecture¶
Components¶
- ObjectStorageService - High-level document storage abstraction
- StorageProvider - Abstract base class for storage implementations
- S3StorageProvider - AWS S3/MinIO compatible implementation
- LocalStorageProvider - Local filesystem implementation (fallback)
- ObjectStorageFactory - Factory for creating and managing storage providers
Key Features¶
- ✅ Abstract storage layer supporting multiple backends
- ✅ Hybrid encryption: AES for documents + RSA for keys
- ✅ Presigned URLs for secure file access (S3 only)
- ✅ Hierarchical storage organization
- ✅ Backward compatibility with local filesystem
- ✅ Graceful fallback mechanisms
- ✅ Migration utilities for existing documents
Configuration¶
Environment Variables¶
# Storage type: "s3" or "local"
STORAGE_TYPE=s3
# S3 Configuration
S3_ACCESS_KEY=your_access_key
S3_SECRET_KEY=your_secret_key
S3_BUCKET=bank-documents
S3_REGION=us-east-1
S3_ENDPOINT_URL=https://s3.amazonaws.com # Optional, for MinIO use custom endpoint
# Local Storage Configuration (fallback)
LOCAL_STORAGE_PATH=encrypted_files/
.env Example¶
# For AWS S3
STORAGE_TYPE=s3
S3_ACCESS_KEY=AKIA1234567890ABCDEF
S3_SECRET_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
S3_BUCKET=bank-documents
S3_REGION=us-east-1
# For MinIO
STORAGE_TYPE=s3
S3_ACCESS_KEY=minioadmin
S3_SECRET_KEY=minioadmin
S3_BUCKET=bank-documents
S3_ENDPOINT_URL=http://minio:9000
# Fallback to local storage
STORAGE_TYPE=local
LOCAL_STORAGE_PATH=encrypted_files/
Database Schema Changes¶
New Columns on document.attachment¶
ALTER TABLE document.attachment ADD COLUMN storage_path VARCHAR(500);
ALTER TABLE document.attachment ADD COLUMN storage_type VARCHAR(50) DEFAULT 'local';
New Columns on document.attachment_staging¶
ALTER TABLE document.attachment_staging ADD COLUMN storage_path VARCHAR(500);
ALTER TABLE document.attachment_staging ADD COLUMN storage_type VARCHAR(50) DEFAULT 'local';
Usage¶
Initialization in Application¶
from core.config import settings
from services.object_storage_service import init_storage_service
# Initialize storage service during application startup
config = {
"STORAGE_TYPE": settings.STORAGE_TYPE,
"S3_ACCESS_KEY": settings.S3_ACCESS_KEY,
"S3_SECRET_KEY": settings.S3_SECRET_KEY,
"S3_BUCKET": settings.S3_BUCKET,
"S3_REGION": settings.S3_REGION,
"S3_ENDPOINT_URL": settings.S3_ENDPOINT_URL,
"LOCAL_STORAGE_PATH": settings.LOCAL_STORAGE_PATH,
}
storage_service = init_storage_service(config)
In Document Controller¶
Files are automatically stored in object storage when uploaded:
# Upload automatically uses object storage
@r.post("/file/add")
async def upload_file(request: CustomRequest, payload: FileUploadRequest):
return await upload_staged_file(
request.state.db_session,
request.state.current_user,
payload
)
Direct Usage¶
from services.object_storage_service import get_storage_service
storage_service = get_storage_service()
# Store a document
storage_path = await storage_service.store_encrypted_document(
file_id=123,
filename="document.pdf",
encrypted_data={"ciphertext": "...", "tag": "...", ...},
user_id=456
)
# Retrieve a document
encrypted_data = await storage_service.retrieve_encrypted_document(storage_path)
# Delete a document
await storage_service.delete_document(storage_path)
# Get presigned URL (S3 only)
url = storage_service.get_download_url(storage_path, expiration_hours=24)
File Organization¶
Storage Path Hierarchy¶
documents/
├── user_123/
│ ├── 2024/01/15/
│ │ ├── 1_passport.pdf_pdf.json
│ │ ├── 2_license.pdf_pdf.json
│ └── 2024/01/16/
│ └── 3_visa.pdf_pdf.json
├── user_456/
│ └── 2024/01/15/
│ └── 4_certificate.pdf_pdf.json
└── system/
└── 2024/01/15/
└── 5_template.pdf_pdf.json
Migration from Local to S3¶
Prerequisites¶
- ✅ boto3 installed (included in requirements.txt)
- ✅ AWS S3 bucket created and credentials configured
- ✅ S3 IAM permissions:
s3:PutObject,s3:GetObject,s3:DeleteObject
Running Migration¶
# Migrate to S3
python -m scripts.migrate_documents_to_storage s3
# Migrate to local storage (fallback)
python -m scripts.migrate_documents_to_storage local
Migration Script Features¶
- 📊 Batch processing to avoid memory issues
- ✅ Progress reporting
- 🔍 Verification of migration integrity
- ❌ Error handling and logging
- ⏭️ Skips already migrated files
Migration Report Example¶
==================================================
📋 MIGRATION REPORT
==================================================
Total files: 1523
✅ Successful: 1521
❌ Failed: 2
⏭️ Skipped: 0
❌ Errors:
- File not found: /path/to/missing_file.json
- Attachment 456: Connection timeout
==================================================
Error Handling¶
Graceful Fallback¶
The system includes automatic fallback mechanisms:
try:
# Try to retrieve from object storage
encrypted_data = await storage_service.retrieve_encrypted_document(storage_path)
except Exception as e:
logger.warning(f"Failed to retrieve from storage: {e}")
# Fallback to local filesystem
if os.path.exists(attachment.filepath):
with open(attachment.filepath, "rb") as f:
encrypted_data = bytes_to_dict(f.read())
Performance Considerations¶
Upload Performance¶
- Local Storage: ~10-50ms per file
- S3 (same region): ~50-200ms per file
- S3 (cross-region): ~200-500ms per file
Optimization Tips¶
- Use multi-part upload for large files
- Batch uploads when possible
- Enable S3 transfer acceleration for better performance
- Use CloudFront CDN for frequently accessed documents
- Consider S3 Intelligent-Tiering for cost optimization
Security Best Practices¶
IAM Permissions (AWS S3)¶
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::bank-documents",
"arn:aws:s3:::bank-documents/*"
]
}
]
}
Encryption¶
- All documents are encrypted at rest using AES-256 with GCM mode
- AES keys are protected using RSA-2048 encryption
- Each document has unique encryption key and nonce
Access Control¶
- Use presigned URLs with time-limited expiration (default 24 hours)
- Store credentials in environment variables or AWS Secrets Manager
- Enable S3 bucket versioning for audit trail
- Enable S3 MFA Delete for critical buckets
Monitoring and Logging¶
Key Metrics to Monitor¶
- Upload/download success rate
- Average response time per operation
- S3 API call volume
- Storage costs
- Failed migration attempts
Log Examples¶
✅ S3 storage initialized: bucket=bank-documents, endpoint=None
✅ File uploaded to S3: s3://bank-documents/documents/user_123/2024/01/15/1_passport.pdf_pdf.json
✅ File downloaded from S3: documents/user_123/2024/01/15/1_passport.pdf_pdf.json
✅ Presigned URL generated: documents/user_123/2024/01/15/1_passport.pdf_pdf.json
❌ Failed to connect to S3: Connection refused
⚠️ Failed to retrieve from object storage, falling back to local
Troubleshooting¶
S3 Connection Issues¶
# Error: "Unable to locate credentials"
# Solution: Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables
# Error: "NoSuchBucket"
# Solution: Ensure bucket exists and is accessible with provided credentials
# Error: "AccessDenied"
# Solution: Check IAM permissions for the user/role
Migration Issues¶
# Error: "File not found on server"
# Solution: Verify LOCAL_STORAGE_PATH and check if files exist
# Error: "Connection timeout"
# Solution: Check S3 endpoint URL, network connectivity, and credentials
# Error: "Botocore parsing failed"
# Solution: Ensure S3_REGION is valid (e.g., us-east-1, eu-west-1)
Testing¶
Unit Tests¶
import pytest
from services.object_storage_service import (
LocalStorageProvider,
S3StorageProvider,
ObjectStorageFactory,
)
@pytest.mark.asyncio
async def test_local_storage_upload():
provider = LocalStorageProvider("test_storage/")
path = await provider.upload_file(
"test.txt",
b"test content"
)
assert path.endswith("test.txt")
@pytest.mark.asyncio
async def test_s3_storage_fallback():
# Test graceful fallback when S3 is unavailable
...
Cost Estimation¶
AWS S3 Pricing Example (us-east-1)¶
- Upload: $0.005 per 1,000 requests
- Download: $0.0004 per 1,000 requests
- Storage: $0.023 per GB/month
- Transfer out: $0.09 per GB
For 1,000 documents (100MB total): - Monthly storage: ~$2.30 - Monthly API calls: ~$0.01 (100 uploads + 100 downloads) - Total: ~$2.31/month
Future Enhancements¶
- [ ] Support for Google Cloud Storage (GCS)
- [ ] Support for Azure Blob Storage
- [ ] Multi-cloud failover
- [ ] Automatic document expiration and cleanup
- [ ] Compression before upload
- [ ] Parallel upload/download
- [ ] Progress tracking for large files
- [ ] Bulk operations API
- [ ] Document version control
- [ ] Access audit logging
Support & Questions¶
For issues or questions about object storage integration:
1. Check the logs: tail -f logs/*.log
2. Review migration report: python -m scripts.migrate_documents_to_storage --verify
3. Test connectivity: python -c "from services.object_storage_service import ObjectStorageFactory; ObjectStorageFactory.initialize(config)"