Architectureο
Technical architecture and design decisions for PutPlace.
System Overviewο
PutPlace is a distributed file metadata storage and content deduplication system built with modern Python technologies.
High-Level Architectureο
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Clients β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Server β β Server β β Server β β Laptop β β
β β A β β B β β C β β β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
βββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββ
β β β β
β HTTPS + X-API-Key Authentication β
β β β β
βββββββββββββββΌββββββββββββββΌββββββββββββββ
β
βββββββββββββββΌββββββββββββββ
β PutPlace API β
β (FastAPI) β
β β
β ββββββββββββββββββββββββ β
β β Authentication β β
β β (API Keys) β β
β ββββββββββββββββββββββββ β
β ββββββββββββββββββββββββ β
β β File Metadata β β
β β Processing β β
β ββββββββββββββββββββββββ β
β ββββββββββββββββββββββββ β
β β Deduplication β β
β β Logic β β
β ββββββββββββββββββββββββ β
ββββββ¬βββββββββββββββββββ¬βββββ
β β
ββββββββββΌβββββββββ ββββββΌβββββββββ
β MongoDB β β Storage β
β (Metadata) β β Backend β
βββββββββββββββββββ ββββββ¬βββββββββ
β
ββββββββ΄βββββββ
β β
ββββββΌβββββ βββββΌβββββ
β Local β β AWS β
β FS β β S3 β
βββββββββββ ββββββββββ
Core Componentsο
1. FastAPI Application (main.py)ο
Purpose: REST API server
Key Features:
Asynchronous request handling
Automatic OpenAPI documentation
Dependency injection
Lifespan event management
Endpoints:
Health checks (
/,/health)File operations (
/put_file,/upload_file/{sha256},/get_file/{sha256})Authentication (
/api_keys/*)
Technology Stack:
FastAPI: Web framework
Uvicorn: ASGI server (development)
Gunicorn: Process manager (production)
2. Data Models (models.py)ο
Purpose: Pydantic models for validation and serialization
Key Models:
FileMetadata: Client file metadataFileMetadataResponse: Server response with MongoDB IDFileMetadataUploadResponse: Response with upload requirement infoAPIKeyCreate: API key creation requestAPIKeyResponse: API key with actual key (shown once)APIKeyInfo: API key metadata (without key)
Technology Stack:
Pydantic v2: Data validation and settings
Type hints for IDE support
JSON schema generation
3. Database Layer (database.py)ο
Purpose: MongoDB interface with async operations
Key Features:
Async MongoDB operations using PyMongo async (native asyncio)
Connection pooling
Automatic index creation
Error handling and logging
Collections:
file_metadata: File metadata and upload statusapi_keys: API key hashes and metadata
Indexes:
file_metadata:sha256(unique)hostname + filepath(compound)has_file_content
api_keys:key_hash(unique)is_active
Technology Stack:
PyMongo Async: Native async MongoDB driver (PyMongo 4.10+)
Direct asyncio implementation for better performance vs deprecated Motor library
4. Authentication (auth.py)ο
Purpose: API key authentication and management
Key Features:
SHA256-hashed key storage
Token generation with
secretsmoduleAPI key verification
Usage tracking (
last_used_at)
Security:
Keys hashed before storage (SHA256)
64-character hex tokens (256 bits of entropy)
Constant-time comparison
Automatic timestamp updates
Technology Stack:
secrets: Cryptographically secure random generation
hashlib: SHA256 hashing
FastAPI Security: Header-based authentication
5. Storage Backends (storage.py)ο
Purpose: Abstract storage with multiple backend implementations
Architecture:
class StorageBackend(ABC):
@abstractmethod
async def store(sha256: str, content: bytes) -> bool
@abstractmethod
async def retrieve(sha256: str) -> Optional[bytes]
@abstractmethod
async def exists(sha256: str) -> bool
@abstractmethod
async def delete(sha256: str) -> bool
Implementations:
LocalStorageο
Stores files in local filesystem
Directory structure:
{base_path}/{sha256[:2]}/{sha256}256 subdirectories for distribution
Async file I/O with
aiofiles
S3Storageο
Stores files in AWS S3
Key structure:
{prefix}{sha256[:2]}/{sha256}Supports multiple credential methods
Async S3 operations with
aioboto3
Technology Stack:
aiofiles: Async file operations
aioboto3: Async AWS SDK
Abstract Base Classes: Enforces interface
6. Configuration (config.py)ο
Purpose: Centralized configuration management
Key Features:
Environment variable loading
.envfile supportType validation with Pydantic
Default values
Configuration Groups:
API settings
MongoDB settings
Storage settings
AWS settings (optional)
Logging settings
Technology Stack:
Pydantic Settings: Configuration management
python-dotenv:
.envfile loading
7. Client (ppclient.py)ο
Purpose: Command-line client for file scanning
Key Features:
Recursive directory scanning
SHA256 calculation
Pattern-based exclusion
Progress display with Rich
Configuration file support
Workflow:
Scan directory for files
Calculate file metadata (SHA256, size, permissions, etc.)
Send metadata to server
Upload file content if required
Display progress and results
Technology Stack:
httpx: HTTP client with async support
rich: Terminal output formatting
configargparse: Unified CLI/env/file configuration
Data Flowο
File Upload Workflowο
βββββββββββ
β Client β
ββββββ¬βββββ
β
β 1. Scan file
β - Calculate SHA256
β - Get file stats
βΌ
βββββββββββββββββββ
β File Metadata β
β {filepath, β
β hostname, β
β sha256, β
β size, ...} β
ββββββ¬βββββββββββββ
β
β 2. POST /put_file
β Headers: X-API-Key
βΌ
βββββββββββββββ
β FastAPI β
β Server β
ββββββββ¬βββββββ
β
β 3. Verify API key
βΌ
ββββββββββββββββ
β Auth System β
β - Hash key β
β - Check DB β
β - Update β
β last_used β
ββββββββ¬ββββββββ
β
β 4. Store metadata
βΌ
ββββββββββββββββ βββββββββββββββ
β MongoDB ββββββΆβ Check for β
β β β duplicate β
ββββββββββββββββ β (SHA256) β
ββββββββ¬βββββββ
β
βββββββββββββ΄ββββββββββββ
β β
Exists New File
β β
βΌ βΌ
ββββββββββββββββ βββββββββββββββ
β Response: β β Response: β
β upload_ β β upload_ β
β required β β required β
β = false β β = true β
ββββββββββββββββ ββββββββ¬βββββββ
β
ββββββββββββββββββ
β
β 5. POST /upload_file/{sha256}
β Multipart: file content
βΌ
βββββββββββββββ
β FastAPI β
β Server β
ββββββββ¬βββββββ
β
β 6. Verify SHA256 matches
βΌ
βββββββββββββββ
β Calculate β
β SHA256 of β
β uploaded β
β content β
ββββββββ¬βββββββ
β
β 7. Store file
βΌ
βββββββββββββββ
β Storage β
β Backend β
ββββββββ¬βββββββ
β
ββββββββ΄βββββββ
β β
βΌ βΌ
ββββββββββ βββββββββ
β Local β β S3 β
β FS β β β
ββββββββββ βββββββββ
β
β 8. Mark uploaded
βΌ
ββββββββββββ
β MongoDB β
β Update β
β has_file β
β _content β
ββββββββββββ
Design Decisionsο
1. Content-Addressable Storage (CAS)ο
Decision: Use SHA256 as file identifier
Rationale:
Enables automatic deduplication
Immutable file identification
Collision-resistant (practically impossible)
Standard in content-addressable systems
Trade-offs:
Must calculate SHA256 for every file (CPU cost)
Cannot store multiple versions of same content with different metadata
Metadata stored separately from content
2. Two-Phase Upload Protocolο
Decision: Separate metadata upload from content upload
Rationale:
Allows server to check for duplicates before content upload
Saves bandwidth for duplicate files
Enables metadata tracking without content
Supports different storage backends
Trade-offs:
Two round trips instead of one
More complex client logic
Potential for metadata without content (if client fails)
3. Asynchronous Architectureο
Decision: Use async/await throughout
Rationale:
Better concurrency for I/O-bound operations
Efficient handling of multiple clients
Non-blocking database and storage operations
Modern Python best practice
Trade-offs:
More complex code (async/await everywhere)
Requires async-compatible libraries
Debugging can be more challenging
4. Abstract Storage Backendο
Decision: Storage backend abstraction with multiple implementations
Rationale:
Flexibility to switch storage backends
Easy to add new backends
Testability (mock storage)
Separation of concerns
Trade-offs:
Extra abstraction layer
Slightly more complex codebase
Must implement full interface for each backend
5. API Key Authenticationο
Decision: Header-based API key authentication
Rationale:
Simple to implement and use
Standard for API authentication
No session management needed
Stateless (scales horizontally)
Trade-offs:
Less secure than OAuth2/JWT (no expiration)
No fine-grained permissions (future enhancement)
Keys must be rotated manually
6. MongoDB for Metadataο
Decision: Use MongoDB instead of relational database
Rationale:
Flexible schema (easy to add fields)
Native JSON/Python dict mapping
Good performance for document lookups
Horizontal scaling with sharding
Trade-offs:
No ACID transactions across documents (not needed for this use case)
More complex aggregations than SQL
Requires MongoDB infrastructure
7. Pydantic for Validationο
Decision: Use Pydantic v2 for all data models
Rationale:
Type safety and validation
Automatic API documentation
JSON schema generation
IDE support with type hints
Trade-offs:
Learning curve for Pydantic
More verbose than plain dicts
Runtime overhead (minimal)
Scalabilityο
Horizontal Scalingο
PutPlace is designed for horizontal scaling:
Stateless API:
No session state
All state in MongoDB
Multiple API instances can run in parallel
Load balancer distributes requests
MongoDB Scaling:
Replica sets for read scaling
Sharding for write scaling
Indexes for query performance
Storage Scaling:
Local: Limited to single server
S3: Unlimited scaling
Vertical Scalingο
API Server:
More CPU cores β More Gunicorn workers
More RAM β Larger connection pools
Formula:
workers = (4 Γ CPU cores) + 1
MongoDB:
More RAM β Larger working set in memory
More CPU β Better query performance
SSD β Faster I/O
Performance Optimizationsο
Database Indexes:
SHA256 index for fast duplicate detection
Compound index on hostname+filepath
has_file_content index for upload checks
Connection Pooling:
MongoDB connection pool (default: 10-100)
HTTP client connection reuse
Async I/O:
Non-blocking database operations
Non-blocking storage operations
Concurrent request handling
Content Distribution:
256 subdirectories for local storage
S3 automatic distribution
Security Architectureο
Authenticationο
βββββββββββ
β Client β
ββββββ¬βββββ
β
β X-API-Key: abc123...
βΌ
ββββββββββββββββ
β FastAPI β
β Security β
ββββββββ¬ββββββββ
β
β get_current_api_key()
βΌ
ββββββββββββββββ
β Hash key β
β SHA256 β
ββββββββ¬ββββββββ
β
β key_hash
βΌ
ββββββββββββββββ
β MongoDB β
β api_keys β
ββββββββ¬ββββββββ
β
β Find by key_hash
βΌ
ββββββββββββββββ
β Verify β
β is_active β
ββββββββ¬ββββββββ
β
β Update last_used_at
βΌ
ββββββββββββββββ
β Return β
β API key β
β metadata β
ββββββββββββββββ
Data Protectionο
API Keys:
Hashed with SHA256 before storage
Never stored in plaintext
Shown only once during creation
Transport:
HTTPS enforced in production
TLS 1.2+ only
Strong cipher suites
Storage:
S3 encryption at rest
IAM roles instead of credentials
Bucket policies for access control
File Content:
SHA256 verification on upload
Content-addressable (immutable)
Duplicate detection prevents overwrite
Testing Architectureο
Test Pyramidο
ββββββββββ
β E2E β (Integration tests)
ββββββββββ
ββββββββββββββ
β API Tests β (Endpoint tests)
ββββββββββββββ
ββββββββββββββββββββ
β Unit Tests β (Component tests)
ββββββββββββββββββββ
Test Categoriesο
Unit Tests:
test_database.py: Database operationstest_auth.py: Authentication logictest_storage.py: Storage backends
API Tests:
test_api.py: API endpointsRequest/response validation
Authentication checks
Client Tests:
test_client.py: Client functionalityFile scanning
Upload logic
Test Coverageο
Target: 100% code coverage
Current coverage:
Core: 100%
API endpoints: 100%
Authentication: 100%
Storage backends: 100%
Client: 95% (excludes main)
Future Enhancementsο
Planned Featuresο
Fine-grained Permissions:
Read-only API keys
Namespace-based access control
Per-bucket permissions
Chunked Uploads:
Support for very large files (>5GB)
Resumable uploads
Parallel chunk uploads
File Versioning:
Track multiple versions of same file
Version history
Rollback capability
Query API:
Search by metadata
Time-based queries
Aggregation API
Webhook Notifications:
File upload events
Duplicate detection events
Error notifications
Additional Storage Backends:
Google Cloud Storage
Azure Blob Storage
MinIO (S3-compatible)
Architectural Improvementsο
Caching Layer:
Redis for frequently accessed metadata
Reduce MongoDB load
Faster response times
Message Queue:
Async background processing
File content scanning
Thumbnail generation
Metrics and Monitoring:
Prometheus metrics
Grafana dashboards
Alert rules
Rate Limiting:
Per-API-key rate limits
Burst handling
Quotas