File Upload Workflow - Content Deduplicationď
Overviewď
PutPlace uses a two-phase approach for file uploads with automatic deduplication based on SHA256 hashes:
Phase 1: Metadata Registration - Client sends file metadata (including SHA256)
Phase 2: Conditional Upload - Server indicates if file content upload is needed
This approach prevents unnecessary file transfers when the server already has the file content.
How It Worksď
Client Workflowď
1. Client scans file and calculates SHA256 hash
2. Client sends metadata to POST /put_file
3. Server responds with upload_required flag:
- upload_required: true â File is unique, upload needed
- upload_required: false â File already exists, skip upload
4. If upload_required: Client uploads to POST /upload_file/{sha256}
5. Server verifies SHA256 matches and marks file as uploaded
Server Logicď
POST /put_file (Metadata Registration):
# Check if we already have this file content
has_content = await db.has_file_content(sha256)
if has_content:
return {upload_required: false} # Deduplication!
else:
return {upload_required: true, upload_url: "/upload_file/{sha256}"}
POST /upload_file/{sha256} (File Upload):
# Verify uploaded content matches claimed SHA256
calculated_hash = hashlib.sha256(content).hexdigest()
if calculated_hash != sha256:
raise HTTPException("Hash mismatch")
# Mark file as uploaded in database
await db.mark_file_uploaded(sha256, hostname, filepath)
API Examplesď
Example 1: First Upload (Unique File)ď
Step 1: Register Metadata
curl -X POST http://localhost:8000/put_file \
-H "Content-Type: application/json" \
-d '{
"filepath": "/var/log/app.log",
"hostname": "server01",
"ip_address": "192.168.1.100",
"sha256": "abc123...",
"file_size": 2048,
"file_mode": 33188,
"file_uid": 1000,
"file_gid": 1000,
"file_mtime": 1609459200.0,
"file_atime": 1609459200.0,
"file_ctime": 1609459200.0
}'
Response:
{
"filepath": "/var/log/app.log",
"hostname": "server01",
"sha256": "abc123...",
"upload_required": true,
"upload_url": "/upload_file/abc123...",
"has_file_content": false,
"id": "65f1234..."
}
Step 2: Upload File Content (because upload_required=true)
curl -X POST "http://localhost:8000/upload_file/abc123...?hostname=server01&filepath=/var/log/app.log" \
-F "file=@/var/log/app.log"
Response:
{
"message": "File uploaded successfully",
"sha256": "abc123...",
"size": 2048,
"hostname": "server01",
"filepath": "/var/log/app.log"
}
Example 2: Duplicate File (Already Exists)ď
Step 1: Register Metadata
curl -X POST http://localhost:8000/put_file \
-H "Content-Type: application/json" \
-d '{
"filepath": "/home/user/copy_of_app.log",
"hostname": "server02",
"sha256": "abc123...",
...
}'
Response:
{
"filepath": "/home/user/copy_of_app.log",
"hostname": "server02",
"sha256": "abc123...",
"upload_required": false,
"upload_url": null,
"has_file_content": false,
"id": "65f5678..."
}
Step 2: Skip Upload (because upload_required=false)
Client detects upload_required: false and skips the upload entirely!
Database Schemaď
FileMetadata Documentď
{
// Core metadata
"filepath": "/var/log/app.log",
"hostname": "server01",
"ip_address": "192.168.1.100",
"sha256": "abc123...",
// File stats
"file_size": 2048,
"file_mode": 33188,
"file_uid": 1000,
"file_gid": 1000,
"file_mtime": 1609459200.0,
"file_atime": 1609459200.0,
"file_ctime": 1609459200.0,
// Content tracking (NEW)
"has_file_content": true, // Whether server has file content
"file_uploaded_at": "2024-01-15...", // When content was uploaded
// Timestamps
"created_at": "2024-01-15..."
}
Client Implementationď
The client should:
â Calculate SHA256 before contacting server
â Send metadata to
/put_fileâ Check
upload_requiredin responseâ Only upload file if
upload_required == trueâ Use the provided
upload_url
Example Client Logic:
# Step 1: Calculate and send metadata
sha256 = calculate_sha256(filepath)
stats = get_file_stats(filepath)
metadata = {
"filepath": filepath,
"hostname": hostname,
"sha256": sha256,
**stats
}
response = httpx.post(f"{api_url}/put_file", json=metadata)
data = response.json()
# Step 2: Upload file only if needed
if data["upload_required"]:
with open(filepath, "rb") as f:
files = {"file": f}
params = {"hostname": hostname, "filepath": filepath}
upload_response = httpx.post(
f"{api_url}/upload_file/{sha256}",
files=files,
params=params
)
print(f"â File uploaded: {filepath}")
else:
print(f"â File already exists (deduplicated): {filepath}")
Benefitsď
1. Bandwidth Savingsď
Duplicate files across multiple hosts only uploaded once
Same file from different locations â single transfer
2. Storage Efficiencyď
Single copy of file content per unique SHA256
Metadata stored separately for each location
3. Fast Operationsď
Check for existing content is a simple database query
Skip upload for common files (system libraries, config files, etc.)
4. Integrity Verificationď
Server verifies SHA256 of uploaded content
Prevents corruption or tampering
Ensures uploaded file matches metadata
Security Considerationsď
â SHA256 Verification: Uploaded content must match claimed hash â No Trust on First Contact: Server verifies every upload â Metadata Binding: Upload requires matching hostname+filepath+SHA256 â ď¸ TODO: Add authentication/authorization for uploads â ď¸ TODO: Add file size limits â ď¸ TODO: Add storage quotas per user/hostname
Storage Optionsď
The current implementation marks files as uploaded but doesnât persist the actual content. You can extend this:
Option 1: Local Filesystemď
storage_path = Path(f"/var/putplace/files/{sha256[:2]}/{sha256}")
storage_path.parent.mkdir(parents=True, exist_ok=True)
with open(storage_path, "wb") as f:
f.write(content)
Option 2: S3/Object Storageď
import boto3
s3 = boto3.client('s3')
s3.put_object(
Bucket='putplace-files',
Key=f'files/{sha256}',
Body=content
)
Option 3: Content-Addressable Storage (CAS)ď
Use existing CAS solutions (Perkeep, Git-LFS, etc.)
Store files by their SHA256
Natural deduplication
Monitoring & Metricsď
Track these metrics:
Deduplication Rate: % of files that skip upload
Upload Success Rate: successful uploads / total attempts
Storage Savings: Total bytes saved by deduplication
Average File Size: Track typical file sizes
Hash Collisions: Should be zero (monitor anyway)
API Endpointsď
Endpoint |
Method |
Purpose |
|---|---|---|
|
POST |
Register file metadata, get upload decision |
|
POST |
Upload file content (if required) |
|
GET |
Retrieve file metadata by SHA256 |
|
GET |
Health check with database status |
Future Enhancementsď
Chunked uploads for large files
Resume interrupted uploads
Download endpoint to retrieve file content
File expiration/TTL
Storage backend plugins
Compression before storage
Encryption at rest
Access control lists (ACLs)
Audit logging