v0.7.34 - We seemed to maybe finally fixed the monitoring error?

This commit is contained in:
Your Name
2025-10-22 10:19:43 -04:00
parent 9cb9b746d8
commit 9179d57cc9
21 changed files with 2877 additions and 503 deletions

View File

@@ -0,0 +1,298 @@
# Libwebsockets Proper Pattern - Message Queue Design
## Problem Analysis
### Current Violation
We're calling `lws_write()` directly from multiple code paths:
1. **Event broadcast** (subscriptions.c:667) - when events arrive
2. **OK responses** (websockets.c:855) - when processing EVENT messages
3. **EOSE responses** (websockets.c:976) - when processing REQ messages
4. **COUNT responses** (websockets.c:1922) - when processing COUNT messages
This violates libwebsockets' design pattern which requires:
- **`lws_write()` ONLY called from `LWS_CALLBACK_SERVER_WRITEABLE`**
- Application queues messages and requests writeable callback
- Libwebsockets handles write timing and socket buffer management
### Consequences of Violation
1. Partial writes when socket buffer is full
2. Multiple concurrent write attempts before callback fires
3. "write already pending" errors with single buffer
4. Frame corruption from interleaved partial writes
5. "Invalid frame header" errors on client side
## Correct Architecture
### Message Queue Pattern
```
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
├─────────────────────────────────────────────────────────────┤
│ │
│ Event Arrives → Queue Message → Request Writeable Callback │
│ REQ Received → Queue EOSE → Request Writeable Callback │
│ EVENT Received→ Queue OK → Request Writeable Callback │
│ COUNT Received→ Queue COUNT → Request Writeable Callback │
│ │
└─────────────────────────────────────────────────────────────┘
lws_callback_on_writable(wsi)
┌─────────────────────────────────────────────────────────────┐
│ LWS_CALLBACK_SERVER_WRITEABLE │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Dequeue next message from queue │
│ 2. Call lws_write() with message data │
│ 3. If queue not empty, request another callback │
│ │
└─────────────────────────────────────────────────────────────┘
libwebsockets handles:
- Socket buffer management
- Partial write handling
- Frame atomicity
```
## Data Structures
### Message Queue Node
```c
typedef struct message_queue_node {
unsigned char* data; // Message data (with LWS_PRE space)
size_t length; // Message length (without LWS_PRE)
enum lws_write_protocol type; // LWS_WRITE_TEXT, etc.
struct message_queue_node* next;
} message_queue_node_t;
```
### Per-Session Data Updates
```c
struct per_session_data {
// ... existing fields ...
// Message queue (replaces single buffer)
message_queue_node_t* message_queue_head;
message_queue_node_t* message_queue_tail;
int message_queue_count;
int writeable_requested; // Flag to prevent duplicate requests
};
```
## Implementation Functions
### 1. Queue Message (Application Layer)
```c
int queue_message(struct lws* wsi, struct per_session_data* pss,
const char* message, size_t length,
enum lws_write_protocol type)
{
// Allocate node
message_queue_node_t* node = malloc(sizeof(message_queue_node_t));
// Allocate buffer with LWS_PRE space
node->data = malloc(LWS_PRE + length);
memcpy(node->data + LWS_PRE, message, length);
node->length = length;
node->type = type;
node->next = NULL;
// Add to queue (FIFO)
pthread_mutex_lock(&pss->session_lock);
if (!pss->message_queue_head) {
pss->message_queue_head = node;
pss->message_queue_tail = node;
} else {
pss->message_queue_tail->next = node;
pss->message_queue_tail = node;
}
pss->message_queue_count++;
pthread_mutex_unlock(&pss->session_lock);
// Request writeable callback (only if not already requested)
if (!pss->writeable_requested) {
pss->writeable_requested = 1;
lws_callback_on_writable(wsi);
}
return 0;
}
```
### 2. Process Queue (Writeable Callback)
```c
int process_message_queue(struct lws* wsi, struct per_session_data* pss)
{
pthread_mutex_lock(&pss->session_lock);
// Get next message from queue
message_queue_node_t* node = pss->message_queue_head;
if (!node) {
pss->writeable_requested = 0;
pthread_mutex_unlock(&pss->session_lock);
return 0; // Queue empty
}
// Remove from queue
pss->message_queue_head = node->next;
if (!pss->message_queue_head) {
pss->message_queue_tail = NULL;
}
pss->message_queue_count--;
pthread_mutex_unlock(&pss->session_lock);
// Write message (libwebsockets handles partial writes)
int result = lws_write(wsi, node->data + LWS_PRE, node->length, node->type);
// Free node
free(node->data);
free(node);
// If queue not empty, request another callback
pthread_mutex_lock(&pss->session_lock);
if (pss->message_queue_head) {
lws_callback_on_writable(wsi);
} else {
pss->writeable_requested = 0;
}
pthread_mutex_unlock(&pss->session_lock);
return (result < 0) ? -1 : 0;
}
```
## Refactoring Changes
### Before (WRONG - Direct Write)
```c
// websockets.c:855 - OK response
int write_result = lws_write(wsi, buf + LWS_PRE, response_len, LWS_WRITE_TEXT);
if (write_result < 0) {
DEBUG_ERROR("Write failed");
} else if ((size_t)write_result != response_len) {
// Partial write - queue remaining data
queue_websocket_write(wsi, pss, ...);
}
```
### After (CORRECT - Queue Message)
```c
// websockets.c:855 - OK response
queue_message(wsi, pss, response_str, response_len, LWS_WRITE_TEXT);
// That's it! Writeable callback will handle the actual write
```
### Before (WRONG - Direct Write in Broadcast)
```c
// subscriptions.c:667 - EVENT broadcast
int write_result = lws_write(current_temp->wsi, buf + LWS_PRE, msg_len, LWS_WRITE_TEXT);
if (write_result < 0) {
DEBUG_ERROR("Write failed");
} else if ((size_t)write_result != msg_len) {
queue_websocket_write(...);
}
```
### After (CORRECT - Queue Message)
```c
// subscriptions.c:667 - EVENT broadcast
struct per_session_data* pss = lws_wsi_user(current_temp->wsi);
queue_message(current_temp->wsi, pss, msg_str, msg_len, LWS_WRITE_TEXT);
// Writeable callback will handle the actual write
```
## Benefits of Correct Pattern
1. **No Partial Write Handling Needed**
- Libwebsockets handles partial writes internally
- We just queue complete messages
2. **No "Write Already Pending" Errors**
- Queue can hold unlimited messages
- Each processed sequentially from callback
3. **Thread Safety**
- Queue operations protected by session lock
- Write only from single callback thread
4. **Frame Atomicity**
- Libwebsockets ensures complete frame transmission
- No interleaved partial writes
5. **Simpler Code**
- No complex partial write state machine
- Just queue and forget
6. **Better Performance**
- Libwebsockets optimizes write timing
- Batches writes when socket ready
## Migration Steps
1. ✅ Identify all `lws_write()` call sites
2. ✅ Confirm violation of libwebsockets pattern
3. ⏳ Design message queue structure
4. ⏳ Implement `queue_message()` function
5. ⏳ Implement `process_message_queue()` function
6. ⏳ Update `per_session_data` structure
7. ⏳ Refactor OK response to use queue
8. ⏳ Refactor EOSE response to use queue
9. ⏳ Refactor COUNT response to use queue
10. ⏳ Refactor EVENT broadcast to use queue
11. ⏳ Update `LWS_CALLBACK_SERVER_WRITEABLE` handler
12. ⏳ Add queue cleanup in `LWS_CALLBACK_CLOSED`
13. ⏳ Remove old partial write code
14. ⏳ Test with rapid multiple events
15. ⏳ Test with large events (>4KB)
16. ⏳ Test under load
17. ⏳ Verify no frame errors
## Testing Strategy
### Test 1: Multiple Rapid Events
```bash
# Send 10 events rapidly to same client
for i in {1..10}; do
echo '["EVENT",{"kind":1,"content":"test'$i'","created_at":'$(date +%s)',...}]' | \
websocat ws://localhost:8888 &
done
```
**Expected**: All events queued and sent sequentially, no errors
### Test 2: Large Events
```bash
# Send event >4KB (forces multiple socket writes)
nak event --content "$(head -c 5000 /dev/urandom | base64)" | \
websocat ws://localhost:8888
```
**Expected**: Event queued, libwebsockets handles partial writes internally
### Test 3: Concurrent Connections
```bash
# 100 concurrent connections, each sending events
for i in {1..100}; do
(echo '["REQ","sub'$i'",{}]'; sleep 1) | websocat ws://localhost:8888 &
done
```
**Expected**: All subscriptions work, events broadcast correctly
## Success Criteria
- ✅ No `lws_write()` calls outside `LWS_CALLBACK_SERVER_WRITEABLE`
- ✅ No "write already pending" errors in logs
- ✅ No "Invalid frame header" errors on client side
- ✅ All messages delivered in correct order
- ✅ Large events (>4KB) handled correctly
- ✅ Multiple rapid events to same client work
- ✅ Concurrent connections stable under load
## References
- [libwebsockets documentation](https://libwebsockets.org/lws-api-doc-main/html/index.html)
- [LWS_CALLBACK_SERVER_WRITEABLE](https://libwebsockets.org/lws-api-doc-main/html/group__callback-when-writeable.html)
- [lws_callback_on_writable()](https://libwebsockets.org/lws-api-doc-main/html/group__callback-when-writeable.html#ga96f3ad8e1e2c3e0c8e0b0e5e5e5e5e5e)

View File

@@ -0,0 +1,200 @@
# WebSocket Write Queue Design
## Problem Statement
The current partial write handling implementation uses a single buffer per session, which fails when multiple events need to be sent to the same client in rapid succession. This causes:
1. First event gets partial write → queued successfully
2. Second event tries to write → **FAILS** with "write already pending"
3. Subsequent events fail similarly, causing data loss
### Server Log Evidence
```
[WARN] WS_FRAME_PARTIAL: EVENT partial write, sub=1 sent=3210 expected=5333
[TRACE] Queued partial write: len=2123
[WARN] WS_FRAME_PARTIAL: EVENT partial write, sub=1 sent=3210 expected=5333
[WARN] queue_websocket_write: write already pending, cannot queue new write
[ERROR] Failed to queue partial EVENT write for sub=1
```
## Root Cause
WebSocket frames must be sent **atomically** - you cannot interleave multiple frames. The current single-buffer approach correctly enforces this, but it rejects new writes instead of queuing them.
## Solution: Write Queue Architecture
### Design Principles
1. **Frame Atomicity**: Complete one WebSocket frame before starting the next
2. **Sequential Processing**: Process queued writes in FIFO order
3. **Memory Safety**: Proper cleanup on connection close or errors
4. **Thread Safety**: Protect queue operations with existing session lock
### Data Structures
#### Write Queue Node
```c
struct write_queue_node {
unsigned char* buffer; // Buffer with LWS_PRE space
size_t total_len; // Total length of data to write
size_t offset; // How much has been written so far
int write_type; // LWS_WRITE_TEXT, etc.
struct write_queue_node* next; // Next node in queue
};
```
#### Per-Session Write Queue
```c
struct per_session_data {
// ... existing fields ...
// Write queue for handling multiple pending writes
struct write_queue_node* write_queue_head; // First item to write
struct write_queue_node* write_queue_tail; // Last item in queue
int write_queue_length; // Number of items in queue
int write_in_progress; // Flag: 1 if currently writing
};
```
### Algorithm Flow
#### 1. Enqueue Write (`queue_websocket_write`)
```
IF write_queue is empty AND no write in progress:
- Attempt immediate write with lws_write()
- IF complete:
- Return success
- ELSE (partial write):
- Create queue node with remaining data
- Add to queue
- Set write_in_progress flag
- Request LWS_CALLBACK_SERVER_WRITEABLE
ELSE:
- Create queue node with full data
- Append to queue tail
- IF no write in progress:
- Request LWS_CALLBACK_SERVER_WRITEABLE
```
#### 2. Process Queue (`process_pending_write`)
```
WHILE write_queue is not empty:
- Get head node
- Calculate remaining data (total_len - offset)
- Attempt write with lws_write()
IF write fails (< 0):
- Log error
- Remove and free head node
- Continue to next node
ELSE IF partial write (< remaining):
- Update offset
- Request LWS_CALLBACK_SERVER_WRITEABLE
- Break (wait for next callback)
ELSE (complete write):
- Remove and free head node
- Continue to next node
IF queue is empty:
- Clear write_in_progress flag
```
#### 3. Cleanup (`LWS_CALLBACK_CLOSED`)
```
WHILE write_queue is not empty:
- Get head node
- Free buffer
- Free node
- Move to next
Clear queue pointers
```
### Memory Management
1. **Allocation**: Each queue node allocates buffer with `LWS_PRE + data_len`
2. **Ownership**: Queue owns all buffers until write completes or connection closes
3. **Deallocation**: Free buffer and node when:
- Write completes successfully
- Write fails with error
- Connection closes
### Thread Safety
- Use existing `pss->session_lock` to protect queue operations
- Lock during:
- Enqueue operations
- Dequeue operations
- Queue traversal for cleanup
### Performance Considerations
1. **Queue Length Limit**: Implement max queue length (e.g., 100 items) to prevent memory exhaustion
2. **Memory Pressure**: Monitor total queued bytes per session
3. **Backpressure**: If queue exceeds limit, close connection with NOTICE
### Error Handling
1. **Allocation Failure**: Return error, log, send NOTICE to client
2. **Write Failure**: Remove failed frame, continue with next
3. **Queue Overflow**: Close connection with appropriate NOTICE
## Implementation Plan
### Phase 1: Data Structure Changes
1. Add `write_queue_node` structure to `websockets.h`
2. Update `per_session_data` with queue fields
3. Remove old single-buffer fields
### Phase 2: Queue Operations
1. Implement `enqueue_write()` helper
2. Implement `dequeue_write()` helper
3. Update `queue_websocket_write()` to use queue
4. Update `process_pending_write()` to process queue
### Phase 3: Integration
1. Update all `lws_write()` call sites
2. Update `LWS_CALLBACK_CLOSED` cleanup
3. Add queue length monitoring
### Phase 4: Testing
1. Test with rapid multiple events to same client
2. Test with large events (>4KB)
3. Test under load with concurrent connections
4. Verify no "Invalid frame header" errors
## Expected Outcomes
1. **No More Rejections**: All writes queued successfully
2. **Frame Integrity**: Complete frames sent atomically
3. **Memory Safety**: Proper cleanup on all paths
4. **Performance**: Minimal overhead for queue management
## Metrics to Monitor
1. Average queue length per session
2. Maximum queue length observed
3. Queue overflow events (if limit implemented)
4. Write completion rate
5. Partial write frequency
## Alternative Approaches Considered
### 1. Larger Single Buffer
**Rejected**: Doesn't solve the fundamental problem of multiple concurrent writes
### 2. Immediate Write Retry
**Rejected**: Could cause busy-waiting and CPU waste
### 3. Drop Frames on Conflict
**Rejected**: Violates reliability requirements
## References
- libwebsockets documentation on `lws_write()` and `LWS_CALLBACK_SERVER_WRITEABLE`
- WebSocket RFC 6455 on frame structure
- Nostr NIP-01 on relay-to-client communication