532 lines
16 KiB
Markdown
532 lines
16 KiB
Markdown
# Subscription Cleanup - Simplified Design
|
|
|
|
## Problem Summary
|
|
|
|
The c-relay Nostr relay experienced severe performance degradation (90-100% CPU) due to subscription accumulation in the database. Investigation revealed **323,644 orphaned subscriptions** that were never properly closed when WebSocket connections dropped.
|
|
|
|
## Solution: Two-Component Approach
|
|
|
|
This simplified design focuses on two pragmatic solutions that align with Nostr's stateless design:
|
|
|
|
1. **Startup Cleanup**: Close all subscriptions on relay restart
|
|
2. **Connection Age Limit**: Disconnect clients after a configurable time period
|
|
|
|
Both solutions force clients to reconnect and re-establish subscriptions, which is standard Nostr behavior.
|
|
|
|
---
|
|
|
|
## Component 1: Startup Cleanup
|
|
|
|
### Purpose
|
|
Ensure clean state on every relay restart by closing all subscriptions in the database.
|
|
|
|
### Implementation
|
|
|
|
**File:** [`src/subscriptions.c`](src/subscriptions.c)
|
|
|
|
**New Function:**
|
|
```c
|
|
void cleanup_all_subscriptions_on_startup(void) {
|
|
if (!g_db) {
|
|
DEBUG_ERROR("Database not initialized for startup cleanup");
|
|
return;
|
|
}
|
|
|
|
DEBUG_LOG("Startup cleanup: Marking all active subscriptions as disconnected");
|
|
|
|
// Mark all 'created' subscriptions as disconnected
|
|
const char* update_sql =
|
|
"UPDATE subscriptions "
|
|
"SET ended_at = strftime('%s', 'now') "
|
|
"WHERE event_type = 'created' AND ended_at IS NULL";
|
|
|
|
sqlite3_stmt* stmt;
|
|
int rc = sqlite3_prepare_v2(g_db, update_sql, -1, &stmt, NULL);
|
|
if (rc != SQLITE_OK) {
|
|
DEBUG_ERROR("Failed to prepare startup cleanup query: %s", sqlite3_errmsg(g_db));
|
|
return;
|
|
}
|
|
|
|
rc = sqlite3_step(stmt);
|
|
int updated_count = sqlite3_changes(g_db);
|
|
sqlite3_finalize(stmt);
|
|
|
|
if (updated_count > 0) {
|
|
// Log a single 'disconnected' event for the startup cleanup
|
|
const char* insert_sql =
|
|
"INSERT INTO subscriptions (subscription_id, wsi_pointer, client_ip, event_type) "
|
|
"VALUES ('startup_cleanup', '', 'system', 'disconnected')";
|
|
|
|
rc = sqlite3_prepare_v2(g_db, insert_sql, -1, &stmt, NULL);
|
|
if (rc == SQLITE_OK) {
|
|
sqlite3_step(stmt);
|
|
sqlite3_finalize(stmt);
|
|
}
|
|
|
|
DEBUG_LOG("Startup cleanup: Marked %d subscriptions as disconnected", updated_count);
|
|
} else {
|
|
DEBUG_LOG("Startup cleanup: No active subscriptions found");
|
|
}
|
|
}
|
|
```
|
|
|
|
**Integration Point:** [`src/main.c:1810`](src/main.c:1810)
|
|
|
|
```c
|
|
// Initialize subscription manager mutexes
|
|
if (pthread_mutex_init(&g_subscription_manager.subscriptions_lock, NULL) != 0) {
|
|
DEBUG_ERROR("Failed to initialize subscriptions mutex");
|
|
sqlite3_close(g_db);
|
|
return 1;
|
|
}
|
|
|
|
if (pthread_mutex_init(&g_subscription_manager.ip_tracking_lock, NULL) != 0) {
|
|
DEBUG_ERROR("Failed to initialize IP tracking mutex");
|
|
pthread_mutex_destroy(&g_subscription_manager.subscriptions_lock);
|
|
sqlite3_close(g_db);
|
|
return 1;
|
|
}
|
|
|
|
// **NEW: Startup cleanup - close all subscriptions**
|
|
cleanup_all_subscriptions_on_startup();
|
|
|
|
// Start WebSocket relay server
|
|
DEBUG_LOG("Starting WebSocket relay server...");
|
|
if (start_websocket_relay(port_override, strict_port) != 0) {
|
|
DEBUG_ERROR("Failed to start WebSocket relay");
|
|
// ... cleanup code
|
|
}
|
|
```
|
|
|
|
### Benefits
|
|
- **Immediate relief**: Restart relay to fix subscription issues
|
|
- **Clean slate**: Every restart starts with zero active subscriptions
|
|
- **Simple**: Single SQL UPDATE statement
|
|
- **Nostr-aligned**: Clients are designed to reconnect after relay restart
|
|
- **No configuration needed**: Always runs on startup
|
|
|
|
---
|
|
|
|
## Component 2: Connection Age Limit
|
|
|
|
### Purpose
|
|
Automatically disconnect clients after a configurable time period, forcing them to reconnect and re-establish subscriptions.
|
|
|
|
### Why Disconnect Instead of Just Closing Subscriptions?
|
|
|
|
**Option 1: Send CLOSED message (keep connection)**
|
|
- ❌ Not all clients handle `CLOSED` messages properly
|
|
- ❌ Silent failure - client thinks it's subscribed but isn't
|
|
- ❌ Partial cleanup - connection still consumes resources
|
|
- ❌ More complex to implement
|
|
|
|
**Option 2: Disconnect client entirely (force reconnection)** ✅
|
|
- ✅ Universal compatibility - all clients handle WebSocket reconnection
|
|
- ✅ Complete resource cleanup (memory, file descriptors, etc.)
|
|
- ✅ Simple implementation - single operation
|
|
- ✅ Well-tested code path (same as network interruptions)
|
|
- ✅ Forces re-authentication if needed
|
|
|
|
### Implementation
|
|
|
|
**File:** [`src/websockets.c`](src/websockets.c)
|
|
|
|
**New Function:**
|
|
```c
|
|
/**
|
|
* Check connection age and disconnect clients that have been connected too long.
|
|
* This forces clients to reconnect and re-establish subscriptions.
|
|
*
|
|
* Uses libwebsockets' lws_vhost_foreach_wsi() to iterate through all active
|
|
* connections and checks their connection_established timestamp from per_session_data.
|
|
*/
|
|
void check_connection_age(int max_connection_seconds) {
|
|
if (max_connection_seconds <= 0 || !ws_context) {
|
|
return;
|
|
}
|
|
|
|
time_t now = time(NULL);
|
|
time_t cutoff = now - max_connection_seconds;
|
|
|
|
// Get the default vhost
|
|
struct lws_vhost *vhost = lws_get_vhost_by_name(ws_context, "default");
|
|
if (!vhost) {
|
|
DEBUG_ERROR("Failed to get vhost for connection age check");
|
|
return;
|
|
}
|
|
|
|
// Iterate through all active WebSocket connections
|
|
// Note: lws_vhost_foreach_wsi() calls our callback for each connection
|
|
struct lws *wsi = NULL;
|
|
while ((wsi = lws_vhost_foreach_wsi(vhost, wsi)) != NULL) {
|
|
// Get per-session data which contains connection_established timestamp
|
|
struct per_session_data *pss = (struct per_session_data *)lws_wsi_user(wsi);
|
|
|
|
if (pss && pss->connection_established > 0) {
|
|
// Check if connection is older than cutoff
|
|
if (pss->connection_established < cutoff) {
|
|
// Connection is too old - close it
|
|
long age_seconds = now - pss->connection_established;
|
|
|
|
DEBUG_LOG("Closing connection from %s (age: %lds, limit: %ds)",
|
|
pss->client_ip, age_seconds, max_connection_seconds);
|
|
|
|
// Close with normal status and reason message
|
|
lws_close_reason(wsi, LWS_CLOSE_STATUS_NORMAL,
|
|
(unsigned char *)"connection age limit", 21);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Key Implementation Details:**
|
|
|
|
1. **No database needed**: Active connections are tracked by libwebsockets itself
|
|
2. **Uses existing timestamp**: `pss->connection_established` is already set on line 456 of websockets.c
|
|
3. **Built-in iterator**: `lws_vhost_foreach_wsi()` safely iterates through all active connections
|
|
4. **Per-session data**: Each connection's `per_session_data` is accessible via `lws_wsi_user()`
|
|
5. **Safe closure**: `lws_close_reason()` properly closes the WebSocket with a status code and message
|
|
|
|
**Integration Point:** [`src/websockets.c:2176`](src/websockets.c:2176) - in existing event loop
|
|
|
|
```c
|
|
// Main event loop with proper signal handling
|
|
while (g_server_running && !g_shutdown_flag) {
|
|
int result = lws_service(ws_context, 1000);
|
|
|
|
if (result < 0) {
|
|
DEBUG_ERROR("libwebsockets service error");
|
|
break;
|
|
}
|
|
|
|
// Check if it's time to post status update
|
|
time_t current_time = time(NULL);
|
|
int status_post_hours = get_config_int("kind_1_status_posts_hours", 0);
|
|
|
|
if (status_post_hours > 0) {
|
|
int seconds_interval = status_post_hours * 3600;
|
|
if (current_time - last_status_post_time >= seconds_interval) {
|
|
last_status_post_time = current_time;
|
|
generate_and_post_status_event();
|
|
}
|
|
}
|
|
|
|
// **NEW: Check for connection age limit**
|
|
int max_connection_seconds = get_config_int("max_connection_seconds", 86400); // Default 24 hours
|
|
if (max_connection_seconds > 0) {
|
|
check_connection_age(max_connection_seconds);
|
|
}
|
|
}
|
|
```
|
|
|
|
### Configuration
|
|
|
|
**Parameter:** `max_connection_seconds`
|
|
- **Default:** `86400` (24 hours)
|
|
- **Range:** `0` = disabled, `>0` = disconnect after X seconds
|
|
- **Units:** Seconds (for consistency with other time-based configs)
|
|
|
|
**Example configurations:**
|
|
```json
|
|
{
|
|
"max_connection_seconds": 86400 // 86400 seconds = 24 hours (default)
|
|
}
|
|
```
|
|
|
|
```json
|
|
{
|
|
"max_connection_seconds": 43200 // 43200 seconds = 12 hours
|
|
}
|
|
```
|
|
|
|
```json
|
|
{
|
|
"max_connection_seconds": 3600 // 3600 seconds = 1 hour
|
|
}
|
|
```
|
|
|
|
```json
|
|
{
|
|
"max_connection_seconds": 0 // Disabled
|
|
}
|
|
```
|
|
|
|
### Client Behavior
|
|
|
|
When disconnected due to age limit, clients will:
|
|
1. Detect WebSocket closure
|
|
2. Wait briefly (exponential backoff)
|
|
3. Reconnect to relay
|
|
4. Re-authenticate if needed (NIP-42)
|
|
5. Re-establish all subscriptions
|
|
6. Resume normal operation
|
|
|
|
This is **exactly what happens** during network interruptions, so it's a well-tested code path in all Nostr clients.
|
|
|
|
### Benefits
|
|
- **No new threads**: Uses existing event loop
|
|
- **Minimal overhead**: Check runs once per second (same as `lws_service`)
|
|
- **Simple implementation**: Iterate through active connections
|
|
- **Consistent pattern**: Matches existing status post checking
|
|
- **Universal compatibility**: All clients handle reconnection
|
|
- **Complete cleanup**: Frees all resources associated with connection
|
|
- **Configurable**: Can be adjusted per relay needs or disabled entirely
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Startup Cleanup (1-2 hours)
|
|
|
|
1. **Add `cleanup_all_subscriptions_on_startup()` function**
|
|
- File: [`src/subscriptions.c`](src/subscriptions.c)
|
|
- SQL UPDATE to mark all active subscriptions as disconnected
|
|
- Add logging for cleanup count
|
|
|
|
2. **Integrate in main()**
|
|
- File: [`src/main.c:1810`](src/main.c:1810)
|
|
- Call after mutex initialization, before WebSocket server start
|
|
|
|
3. **Test**
|
|
- Create subscriptions in database
|
|
- Restart relay
|
|
- Verify all subscriptions marked as disconnected
|
|
- Verify `active_subscriptions_log` shows 0 subscriptions
|
|
|
|
**Estimated Time:** 1-2 hours
|
|
|
|
### Phase 2: Connection Age Limit (2-3 hours)
|
|
|
|
1. **Add `check_connection_age()` function**
|
|
- File: [`src/websockets.c`](src/websockets.c)
|
|
- Iterate through active connections
|
|
- Close connections older than limit
|
|
|
|
2. **Integrate in event loop**
|
|
- File: [`src/websockets.c:2176`](src/websockets.c:2176)
|
|
- Add check after status post check
|
|
- Use same pattern as status posts
|
|
|
|
3. **Add configuration parameter**
|
|
- Add `max_connection_seconds` to default config
|
|
- Default: 86400 (24 hours)
|
|
|
|
4. **Test**
|
|
- Connect client
|
|
- Wait for timeout (or reduce timeout for testing)
|
|
- Verify client disconnected
|
|
- Verify client reconnects automatically
|
|
- Verify subscriptions re-established
|
|
|
|
**Estimated Time:** 2-3 hours
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
### Startup Cleanup Tests
|
|
|
|
```bash
|
|
# Test 1: Clean startup with existing subscriptions
|
|
- Create 100 active subscriptions in database
|
|
- Restart relay
|
|
- Verify all subscriptions marked as disconnected
|
|
- Verify active_subscriptions_log shows 0 subscriptions
|
|
|
|
# Test 2: Clean startup with no subscriptions
|
|
- Start relay with empty database
|
|
- Verify no errors
|
|
- Verify startup cleanup logs "No active subscriptions found"
|
|
|
|
# Test 3: Clients reconnect after restart
|
|
- Create subscriptions before restart
|
|
- Restart relay
|
|
- Connect clients and create new subscriptions
|
|
- Verify new subscriptions tracked correctly
|
|
```
|
|
|
|
### Connection Age Limit Tests
|
|
|
|
```bash
|
|
# Test 1: Connection disconnected after timeout
|
|
- Set max_connection_seconds to 60 (for testing)
|
|
- Connect client
|
|
- Wait 61 seconds
|
|
- Verify client disconnected
|
|
- Verify client reconnects automatically
|
|
|
|
# Test 2: Subscriptions re-established after reconnection
|
|
- Connect client with subscriptions
|
|
- Wait for timeout
|
|
- Verify client reconnects
|
|
- Verify subscriptions re-established
|
|
- Verify events still delivered
|
|
|
|
# Test 3: Disabled when set to 0
|
|
- Set max_connection_seconds to 0
|
|
- Connect client
|
|
- Wait extended period
|
|
- Verify client NOT disconnected
|
|
```
|
|
|
|
### Integration Tests
|
|
|
|
```bash
|
|
# Test 1: Combined behavior
|
|
- Start relay (startup cleanup runs)
|
|
- Connect multiple clients
|
|
- Create subscriptions
|
|
- Wait for connection timeout
|
|
- Verify clients reconnect
|
|
- Restart relay
|
|
- Verify clean state
|
|
|
|
# Test 2: Load test
|
|
- Connect 100 clients
|
|
- Each creates 5 subscriptions
|
|
- Wait for connection timeout
|
|
- Verify all clients reconnect
|
|
- Verify all subscriptions re-established
|
|
- Monitor CPU usage (should remain low)
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
### Component 1: Startup Cleanup
|
|
- ✅ Relay starts with zero active subscriptions
|
|
- ✅ All previous subscriptions marked as disconnected on startup
|
|
- ✅ Clients successfully reconnect and re-establish subscriptions
|
|
- ✅ Relay restart can be used as emergency fix for subscription issues
|
|
- ✅ No errors during startup cleanup process
|
|
|
|
### Component 2: Connection Age Limit
|
|
- ✅ Clients disconnected after configured time period
|
|
- ✅ Clients automatically reconnect
|
|
- ✅ Subscriptions re-established after reconnection
|
|
- ✅ No impact on relay performance
|
|
- ✅ Configuration parameter works correctly (including disabled state)
|
|
|
|
### Overall Success
|
|
- ✅ CPU usage remains low (<10%)
|
|
- ✅ No orphaned subscriptions accumulate
|
|
- ✅ Database size remains stable
|
|
- ✅ No manual intervention required
|
|
|
|
---
|
|
|
|
## Configuration Reference
|
|
|
|
**New Configuration Parameters:**
|
|
|
|
```json
|
|
{
|
|
"max_connection_seconds": 86400
|
|
}
|
|
```
|
|
|
|
**Recommended Settings:**
|
|
|
|
- **Production:**
|
|
- `max_connection_seconds: 86400` (24 hours)
|
|
|
|
- **Development:**
|
|
- `max_connection_seconds: 3600` (1 hour for faster testing)
|
|
|
|
- **High-traffic:**
|
|
- `max_connection_seconds: 43200` (12 hours)
|
|
|
|
- **Disabled:**
|
|
- `max_connection_seconds: 0`
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If issues arise after deployment:
|
|
|
|
1. **Disable connection age limit:**
|
|
- Set `max_connection_seconds: 0` in config
|
|
- Restart relay
|
|
- Monitor for stability
|
|
|
|
2. **Revert code changes:**
|
|
- Remove `check_connection_age()` call from event loop
|
|
- Remove `cleanup_all_subscriptions_on_startup()` call from main
|
|
- Restart relay
|
|
|
|
3. **Database cleanup (if needed):**
|
|
- Manually clean up orphaned subscriptions using SQL:
|
|
```sql
|
|
UPDATE subscriptions
|
|
SET ended_at = strftime('%s', 'now')
|
|
WHERE event_type = 'created' AND ended_at IS NULL;
|
|
```
|
|
|
|
---
|
|
|
|
## Comparison with Original Design
|
|
|
|
### Original Design (5 Components)
|
|
1. Startup cleanup
|
|
2. Fix WebSocket disconnection logging
|
|
3. Enhance subscription removal with reason parameter
|
|
4. Periodic cleanup task (background thread)
|
|
5. Optimize database VIEW
|
|
6. Subscription expiration (optional)
|
|
|
|
### Simplified Design (2 Components)
|
|
1. Startup cleanup
|
|
2. Connection age limit
|
|
|
|
### Why Simplified is Better
|
|
|
|
**Advantages:**
|
|
- **Simpler**: 2 components vs 5-6 components
|
|
- **Faster to implement**: 3-5 hours vs 11-17 hours
|
|
- **Easier to maintain**: Less code, fewer moving parts
|
|
- **More reliable**: Fewer potential failure points
|
|
- **Nostr-aligned**: Leverages client reconnection behavior
|
|
- **No new threads**: Uses existing event loop
|
|
- **Universal compatibility**: All clients handle reconnection
|
|
|
|
**What We're Not Losing:**
|
|
- Startup cleanup is identical in both designs
|
|
- Connection age limit achieves the same goal as periodic cleanup + expiration
|
|
- Disconnection forces complete cleanup (better than just logging)
|
|
- Database VIEW optimization not needed if subscriptions don't accumulate
|
|
|
|
**Trade-offs:**
|
|
- Less granular logging (but simpler)
|
|
- No historical subscription analytics (but cleaner database)
|
|
- Clients must reconnect periodically (but this is standard Nostr behavior)
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
This simplified design solves the subscription accumulation problem with two pragmatic solutions:
|
|
|
|
1. **Startup cleanup** ensures every relay restart starts with a clean slate
|
|
2. **Connection age limit** prevents long-term accumulation by forcing periodic reconnection
|
|
|
|
Both solutions align with Nostr's stateless design where clients are expected to handle reconnection. The implementation is simple, maintainable, and leverages existing code patterns.
|
|
|
|
**Key Benefits:**
|
|
- ✅ Solves the root problem (subscription accumulation)
|
|
- ✅ Simple to implement (3-5 hours total)
|
|
- ✅ Easy to maintain (minimal code)
|
|
- ✅ Universal compatibility (all clients handle reconnection)
|
|
- ✅ No new threads or background tasks
|
|
- ✅ Configurable and can be disabled if needed
|
|
- ✅ Relay restart as emergency fix
|
|
|
|
**Next Steps:**
|
|
1. Implement Component 1 (Startup Cleanup)
|
|
2. Test thoroughly
|
|
3. Implement Component 2 (Connection Age Limit)
|
|
4. Test thoroughly
|
|
5. Deploy to production
|
|
6. Monitor CPU usage and subscription counts |