Troubleshooting Guide
Comprehensive troubleshooting guide for Axon OS, covering common issues, diagnostic procedures, and resolution steps.
Quick Diagnosis
System Health Check
# Check Axon OS status
axonos status
# View system resources
axonos health --verbose
# Check all services
axonos services status
# View recent logs
axonos logs --tail 100
Common Issues Quick Reference
Issue | Quick Check | Quick Fix |
---|---|---|
Service won't start | axonos status | axonos restart |
Database connection | axonos config test-db | Check credentials |
High memory usage | axonos monitor memory | Restart workflows |
Slow performance | axonos monitor performance | Clear cache |
Node errors | axonos nodes validate | Reinstall nodes |
Service Issues
Axon OS Won't Start
Symptoms
- Service fails to start
- Port binding errors
- Configuration errors
Diagnostic Steps
# Check service status
systemctl status axonos
# View startup logs
journalctl -u axonos -f
# Validate configuration
axonos config validate
# Check port availability
netstat -tlnp | grep :8080
# Verify permissions
ls -la /opt/axonos/
Common Causes & Solutions
-
Port Already in Use
# Find process using port
lsof -i :8080
# Kill the process or change port
kill -9 <PID>
# OR update configuration
vim /etc/axonos/config.yml -
Configuration Errors
# Validate configuration syntax
axonos config validate
# Check for missing environment variables
axonos config check-env
# Restore default configuration
cp /opt/axonos/config.example.yml /etc/axonos/config.yml -
Permission Issues
# Fix ownership
chown -R axonos:axonos /opt/axonos/
chown -R axonos:axonos /var/lib/axonos/
# Fix permissions
chmod 755 /opt/axonos/bin/axonos
chmod 600 /etc/axonos/config.yml -
Dependency Issues
# Check PostgreSQL
systemctl status postgresql
systemctl start postgresql
# Check Redis
systemctl status redis
redis-cli ping
# Test database connection
axonos config test-db
Service Crashes
Symptoms
- Unexpected service termination
- Out of memory errors
- Segmentation faults
Diagnostic Steps
# Check crash logs
journalctl -u axonos --since "1 hour ago"
# Check system logs
tail -f /var/log/syslog
# Check memory usage
free -h
ps aux --sort=-%mem | head
# Check disk space
df -h
# Generate memory dump (if enabled)
gdb /opt/axonos/bin/axonos core
Solutions
-
Memory Issues
# Increase memory limits in config.yml
memory:
max_heap_size: "2g"
max_workflow_memory: "1g"
# Enable memory monitoring
monitoring:
memory_alerts: true
memory_threshold: 80 -
Resource Limits
# Check system limits
ulimit -a
# Update limits in /etc/security/limits.conf
axonos soft nofile 65536
axonos hard nofile 65536
axonos soft nproc 32768
axonos hard nproc 32768
Database Issues
Connection Problems
Symptoms
- "Connection refused" errors
- "Authentication failed" errors
- Timeouts
Diagnostic Steps
# Test database connectivity
psql -h localhost -U axonos_user -d axonos
# Check PostgreSQL status
systemctl status postgresql
# View PostgreSQL logs
tail -f /var/log/postgresql/postgresql-*.log
# Check connection parameters
axonos config show database
Solutions
-
Connection Refused
# Start PostgreSQL
systemctl start postgresql
systemctl enable postgresql
# Check if PostgreSQL is listening
netstat -tlnp | grep :5432
# Update postgresql.conf
echo "listen_addresses = '*'" >> /etc/postgresql/*/main/postgresql.conf -
Authentication Failed
# Reset password
sudo -u postgres psql
ALTER USER axonos_user PASSWORD 'new_password';
# Update pg_hba.conf
echo "host axonos axonos_user 127.0.0.1/32 md5" >> /etc/postgresql/*/main/pg_hba.conf
# Reload configuration
systemctl reload postgresql -
Connection Pool Exhaustion
# Increase pool size in config.yml
database:
pool:
max_connections: 200
min_connections: 10
connection_timeout: 60s
Database Performance Issues
Symptoms
- Slow query execution
- High CPU usage
- Lock timeouts
Diagnostic Steps
-- Check active connections
SELECT count(*) FROM pg_stat_activity;
-- Find slow queries
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Check for locks
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;
Solutions
-
Query Optimization
-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_workflows_user_id ON workflows(user_id);
CREATE INDEX CONCURRENTLY idx_executions_status ON executions(status);
CREATE INDEX CONCURRENTLY idx_executions_created_at ON executions(created_at);
-- Update table statistics
ANALYZE;
-- Vacuum tables
VACUUM ANALYZE workflows;
VACUUM ANALYZE executions; -
PostgreSQL Tuning
# postgresql.conf optimizations
shared_buffers = 256MB
effective_cache_size = 1GB
maintenance_work_mem = 64MB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
Workflow Execution Issues
Workflows Fail to Start
Symptoms
- Workflows remain in "pending" status
- "Node not found" errors
- Resource allocation failures
Diagnostic Steps
# Check workflow status
axonos workflows list --status pending
# View workflow logs
axonos workflows logs <workflow_id>
# Check node availability
axonos nodes list --status available
# Check resource usage
axonos monitor resources
Solutions
-
Node Registry Issues
# Refresh node registry
axonos nodes refresh
# Reinstall missing nodes
axonos nodes install <node_name>
# Check node dependencies
axonos nodes validate <node_name> -
Resource Constraints
# Increase resource limits
workflow:
max_concurrent_executions: 100
max_memory_per_execution: "1g"
max_cpu_per_execution: "2"
Workflow Execution Errors
Symptoms
- Workflows fail during execution
- Node timeout errors
- Data transformation errors
Diagnostic Steps
# Get detailed workflow execution logs
axonos workflows logs <workflow_id> --verbose
# Check node execution logs
axonos nodes logs <node_id>
# View workflow execution timeline
axonos workflows timeline <workflow_id>
# Check data flow
axonos workflows data-flow <workflow_id>
Solutions
-
Node Timeout Issues
# Increase timeouts in workflow definition
nodes:
- id: data_processor
timeout: 300s # 5 minutes
retry_attempts: 3
retry_delay: 30s -
Memory Issues
# Optimize memory usage
execution:
memory_optimization: true
garbage_collection: aggressive
data_streaming: true -
Data Format Issues
// Add data validation
interface NodeInput {
data: any;
schema?: JsonSchema;
}
function validateInput(input: NodeInput): boolean {
if (input.schema) {
return validate(input.data, input.schema);
}
return true;
}
Performance Issues
High CPU Usage
Diagnostic Steps
# Check process CPU usage
top -p $(pgrep axonos)
# Check workflow execution load
axonos monitor cpu --breakdown
# Profile application
perf record -p $(pgrep axonos) -g sleep 30
perf report
Solutions
-
Optimize Workflow Concurrency
workflow:
max_concurrent_executions: 50 # Reduce from default
node_parallelism: 5 # Reduce parallel nodes -
Enable CPU Throttling
execution:
cpu_throttling: true
cpu_limit_percent: 80
High Memory Usage
Diagnostic Steps
# Check memory usage by component
axonos monitor memory --breakdown
# Check for memory leaks
valgrind --tool=memcheck --leak-check=full axonos
# Monitor garbage collection
axonos monitor gc --real-time
Solutions
-
Memory Optimization
memory:
gc_strategy: "aggressive"
max_heap_size: "2g"
enable_memory_profiling: true
execution:
data_streaming: true
result_caching: false
memory_cleanup_interval: 60s -
Workflow Optimization
// Use streaming for large datasets
async function processLargeDataset(input: DataStream): Promise<DataStream> {
return input.pipe(
transform(processChunk),
batch(1000),
compress()
);
}
Slow API Response Times
Diagnostic Steps
# Check API response times
axonos monitor api --metrics
# Profile API endpoints
axonos profile api --duration 60s
# Check database query performance
axonos monitor db --slow-queries
Solutions
-
Enable Response Caching
api:
caching:
enabled: true
ttl: 300s
cache_headers: true
cache:
redis:
enabled: true
max_memory: "512mb" -
Optimize Database Queries
-- Add composite indexes
CREATE INDEX CONCURRENTLY idx_workflows_user_status
ON workflows(user_id, status);
-- Optimize with query hints
SELECT /*+ USE_INDEX(workflows, idx_workflows_user_status) */
* FROM workflows WHERE user_id = ? AND status = ?;
Network Issues
Connection Timeouts
Symptoms
- API requests timeout
- WebSocket connections drop
- Inter-service communication failures
Diagnostic Steps
# Check network connectivity
ping <target_host>
telnet <target_host> <port>
# Check firewall rules
iptables -L -n
# Monitor network traffic
netstat -i
ss -tulnp
# Check DNS resolution
nslookup <hostname>
dig <hostname>
Solutions
-
Increase Timeouts
network:
connect_timeout: 30s
read_timeout: 60s
write_timeout: 30s
api:
request_timeout: 120s
keepalive_timeout: 75s -
Configure Load Balancing
upstream axonos_backend {
least_conn;
server axonos-1:8080 max_fails=3 fail_timeout=30s;
server axonos-2:8080 max_fails=3 fail_timeout=30s;
keepalive 32;
}
SSL/TLS Issues
Symptoms
- Certificate validation errors
- Handshake failures
- Mixed content warnings
Diagnostic Steps
# Test SSL connection
openssl s_client -connect axonos.example.com:443
# Check certificate validity
openssl x509 -in /etc/ssl/certs/axonos.crt -text -noout
# Verify certificate chain
openssl verify -CAfile /etc/ssl/certs/ca.crt /etc/ssl/certs/axonos.crt
Solutions
-
Update Certificates
# Renew certificates
certbot renew
# Update certificate in configuration
cp /etc/letsencrypt/live/axonos.example.com/fullchain.pem /etc/ssl/certs/axonos.crt
cp /etc/letsencrypt/live/axonos.example.com/privkey.pem /etc/ssl/private/axonos.key
# Restart service
systemctl restart axonos -
Fix Certificate Chain
# Create proper certificate chain
cat /etc/ssl/certs/axonos.crt /etc/ssl/certs/intermediate.crt > /etc/ssl/certs/axonos-chain.crt
Node Development Issues
Node Compilation Errors
Symptoms
- TypeScript compilation failures
- Runtime import errors
- Missing dependencies
Diagnostic Steps
# Check node compilation
axonos nodes compile <node_name>
# Validate node structure
axonos nodes validate <node_name>
# Check dependencies
npm ls --depth=0
# View compilation logs
axonos nodes logs <node_name> --level debug
Solutions
-
Fix TypeScript Issues
// Ensure proper type definitions
interface NodeInput {
[key: string]: any;
}
interface NodeOutput {
[key: string]: any;
}
export default class MyNode implements AxonNode {
async execute(input: NodeInput): Promise<NodeOutput> {
// Implementation
return {};
}
} -
Resolve Dependencies
# Install missing dependencies
npm install --save <dependency>
# Update package.json
npm update
# Clear node modules cache
rm -rf node_modules package-lock.json
npm install
Node Execution Failures
Symptoms
- Nodes crash during execution
- Timeout errors
- Resource allocation failures
Diagnostic Steps
# Check node execution logs
axonos nodes logs <node_id> --execution <execution_id>
# Monitor node resource usage
axonos nodes monitor <node_id>
# Test node in isolation
axonos nodes test <node_name> --input test_data.json
Solutions
-
Add Error Handling
export default class RobustNode implements AxonNode {
async execute(input: NodeInput): Promise<NodeOutput> {
try {
// Node logic here
const result = await this.processData(input);
return { success: true, data: result };
} catch (error) {
console.error('Node execution failed:', error);
return {
success: false,
error: error.message,
stack: error.stack
};
}
}
} -
Optimize Resource Usage
// Use streaming for large data
import { Transform } from 'stream';
export default class StreamingNode implements AxonNode {
async execute(input: NodeInput): Promise<NodeOutput> {
const transform = new Transform({
objectMode: true,
transform(chunk, encoding, callback) {
// Process chunk
this.push(processChunk(chunk));
callback();
}
});
return { stream: transform };
}
}
Log Analysis
Enabling Debug Logging
logging:
level: "debug"
components:
workflow_engine: "debug"
node_registry: "debug"
api_server: "info"
database: "warn"
Log Patterns to Watch
Error Patterns
# Database connection errors
grep "connection refused\|authentication failed" /var/log/axonos/app.log
# Memory issues
grep "OutOfMemoryError\|memory limit exceeded" /var/log/axonos/app.log
# Timeout errors
grep "timeout\|deadline exceeded" /var/log/axonos/app.log
# Permission errors
grep "permission denied\|access denied" /var/log/axonos/app.log
Performance Patterns
# Slow operations
grep "slow query\|execution time" /var/log/axonos/app.log | grep -E "[0-9]+ms" | awk '$NF > 1000'
# High resource usage
grep "cpu usage\|memory usage" /var/log/axonos/app.log | grep -E "[8-9][0-9]%|100%"
Log Analysis Tools
Using jq
for JSON Logs
# Parse JSON logs
cat /var/log/axonos/app.log | jq 'select(.level == "error")'
# Filter by component
cat /var/log/axonos/app.log | jq 'select(.component == "workflow_engine")'
# Extract error messages
cat /var/log/axonos/app.log | jq -r 'select(.level == "error") | .message'
Custom Log Analysis Script
#!/bin/bash
# analyze_logs.sh
LOG_FILE="/var/log/axonos/app.log"
TIME_WINDOW="1h"
echo "=== Axon OS Log Analysis ==="
echo "Time window: last $TIME_WINDOW"
echo "Log file: $LOG_FILE"
echo
# Error summary
echo "=== Error Summary ==="
grep -E "ERROR|FATAL" "$LOG_FILE" | tail -20
echo -e "\n=== Top Error Types ==="
grep -E "ERROR|FATAL" "$LOG_FILE" | \
sed 's/.*ERROR\|FATAL.*: //' | \
sort | uniq -c | sort -nr | head -10
echo -e "\n=== Performance Issues ==="
grep -E "slow|timeout|memory" "$LOG_FILE" | tail -10
echo -e "\n=== Recent Warnings ==="
grep "WARN" "$LOG_FILE" | tail -10
Recovery Procedures
Database Recovery
Backup Restoration
# Stop Axon OS
systemctl stop axonos
# Restore database from backup
pg_restore -h localhost -U axonos_user -d axonos /backups/axonos_backup.sql
# Verify data integrity
axonos db verify
# Start Axon OS
systemctl start axonos
Point-in-Time Recovery
# Restore to specific timestamp
pg_restore -h localhost -U axonos_user -d axonos \
--before="2024-01-15 14:30:00" /backups/axonos_backup.sql
System Recovery
Configuration Recovery
# Restore configuration from backup
cp /backups/config/axonos.yml /etc/axonos/
# Reset to factory defaults
axonos config reset --confirm
# Reinitialize system
axonos init --force
Emergency Recovery Mode
# Start in recovery mode
axonos --recovery-mode --safe-mode
# Access recovery console
axonos recovery console
# Run recovery commands
recovery> check-integrity
recovery> repair-database
recovery> rebuild-indexes
recovery> exit
Getting Help
Support Channels
- Documentation: docs.axonos.dev
- Community Forum: community.axonos.dev
- Discord: discord.gg/axonos
- GitHub Issues: github.com/axonos/axonos/issues
Creating Support Tickets
Required Information
# System information
axonos system info > system_info.txt
# Configuration (sanitized)
axonos config export --sanitize > config_export.yml
# Recent logs
axonos logs --since 1h > recent_logs.txt
# Performance metrics
axonos monitor report --duration 1h > performance_report.json
Support Template
## Issue Description
Brief description of the issue
## Environment
- Axon OS Version:
- Operating System:
- Database Version:
- Node.js Version:
## Steps to Reproduce
1.
2.
3.
## Expected Behavior
What should happen
## Actual Behavior
What actually happens
## Logs and Diagnostics
[Attach log files and diagnostic outputs]
## Additional Context
Any other relevant information