Troubleshooting Guide

Comprehensive troubleshooting guide for Axon OS, covering common issues, diagnostic procedures, and resolution steps.

Quick Diagnosis

System Health Check

# Check Axon OS status
axonos status

# View system resources
axonos health --verbose

# Check all services
axonos services status

# View recent logs
axonos logs --tail 100

Common Issues Quick Reference

Issue	Quick Check	Quick Fix
Service won't start	`axonos status`	`axonos restart`
Database connection	`axonos config test-db`	Check credentials
High memory usage	`axonos monitor memory`	Restart workflows
Slow performance	`axonos monitor performance`	Clear cache
Node errors	`axonos nodes validate`	Reinstall nodes

Service Issues

Axon OS Won't Start

Symptoms

Service fails to start
Port binding errors
Configuration errors

Diagnostic Steps

# Check service status
systemctl status axonos

# View startup logs
journalctl -u axonos -f

# Validate configuration
axonos config validate

# Check port availability
netstat -tlnp | grep :8080

# Verify permissions
ls -la /opt/axonos/

Common Causes & Solutions

Port Already in Use

# Find process using port
lsof -i :8080

# Kill the process or change port
kill -9 <PID>
# OR update configuration
vim /etc/axonos/config.yml

Configuration Errors

# Validate configuration syntax
axonos config validate

# Check for missing environment variables
axonos config check-env

# Restore default configuration
cp /opt/axonos/config.example.yml /etc/axonos/config.yml

Permission Issues

# Fix ownership
chown -R axonos:axonos /opt/axonos/
chown -R axonos:axonos /var/lib/axonos/

# Fix permissions
chmod 755 /opt/axonos/bin/axonos
chmod 600 /etc/axonos/config.yml

Dependency Issues

# Check PostgreSQL
systemctl status postgresql
systemctl start postgresql

# Check Redis
systemctl status redis
redis-cli ping

# Test database connection
axonos config test-db

Service Crashes

Symptoms

Unexpected service termination
Out of memory errors
Segmentation faults

Diagnostic Steps

# Check crash logs
journalctl -u axonos --since "1 hour ago"

# Check system logs
tail -f /var/log/syslog

# Check memory usage
free -h
ps aux --sort=-%mem | head

# Check disk space
df -h

# Generate memory dump (if enabled)
gdb /opt/axonos/bin/axonos core

Solutions

Memory Issues

# Increase memory limits in config.yml
memory:
  max_heap_size: "2g"
  max_workflow_memory: "1g"

# Enable memory monitoring
monitoring:
  memory_alerts: true
  memory_threshold: 80

Resource Limits

# Check system limits
ulimit -a

# Update limits in /etc/security/limits.conf
axonos soft nofile 65536
axonos hard nofile 65536
axonos soft nproc 32768
axonos hard nproc 32768

Database Issues

Connection Problems

Symptoms

"Connection refused" errors
"Authentication failed" errors
Timeouts

Diagnostic Steps

# Test database connectivity
psql -h localhost -U axonos_user -d axonos

# Check PostgreSQL status
systemctl status postgresql

# View PostgreSQL logs
tail -f /var/log/postgresql/postgresql-*.log

# Check connection parameters
axonos config show database

Solutions

Connection Refused

# Start PostgreSQL
systemctl start postgresql
systemctl enable postgresql

# Check if PostgreSQL is listening
netstat -tlnp | grep :5432

# Update postgresql.conf
echo "listen_addresses = '*'" >> /etc/postgresql/*/main/postgresql.conf

Authentication Failed

# Reset password
sudo -u postgres psql
ALTER USER axonos_user PASSWORD 'new_password';

# Update pg_hba.conf
echo "host axonos axonos_user 127.0.0.1/32 md5" >> /etc/postgresql/*/main/pg_hba.conf

# Reload configuration
systemctl reload postgresql

Connection Pool Exhaustion

# Increase pool size in config.yml
database:
  pool:
    max_connections: 200
    min_connections: 10
    connection_timeout: 60s

Database Performance Issues

Symptoms

Slow query execution
High CPU usage
Lock timeouts

Diagnostic Steps

-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Find slow queries
SELECT query, mean_time, calls 
FROM pg_stat_statements 
ORDER BY mean_time DESC 
LIMIT 10;

-- Check for locks
SELECT blocked_locks.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query AS blocked_statement,
       blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity 
  ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
  ON blocking_locks.locktype = blocked_locks.locktype
  AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
  AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
  AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
  AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
  AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
  AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
  AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
  AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
  AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
  AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity 
  ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;

Solutions

Query Optimization

-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_workflows_user_id ON workflows(user_id);
CREATE INDEX CONCURRENTLY idx_executions_status ON executions(status);
CREATE INDEX CONCURRENTLY idx_executions_created_at ON executions(created_at);

-- Update table statistics
ANALYZE;

-- Vacuum tables
VACUUM ANALYZE workflows;
VACUUM ANALYZE executions;

PostgreSQL Tuning

# postgresql.conf optimizations
shared_buffers = 256MB
effective_cache_size = 1GB
maintenance_work_mem = 64MB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200

Workflow Execution Issues

Workflows Fail to Start

Symptoms

Workflows remain in "pending" status
"Node not found" errors
Resource allocation failures

Diagnostic Steps

# Check workflow status
axonos workflows list --status pending

# View workflow logs
axonos workflows logs <workflow_id>

# Check node availability
axonos nodes list --status available

# Check resource usage
axonos monitor resources

Solutions

Node Registry Issues

# Refresh node registry
axonos nodes refresh

# Reinstall missing nodes
axonos nodes install <node_name>

# Check node dependencies
axonos nodes validate <node_name>

Resource Constraints

# Increase resource limits
workflow:
  max_concurrent_executions: 100
  max_memory_per_execution: "1g"
  max_cpu_per_execution: "2"

Workflow Execution Errors

Symptoms

Workflows fail during execution
Node timeout errors
Data transformation errors

Diagnostic Steps

# Get detailed workflow execution logs
axonos workflows logs <workflow_id> --verbose

# Check node execution logs
axonos nodes logs <node_id>

# View workflow execution timeline
axonos workflows timeline <workflow_id>

# Check data flow
axonos workflows data-flow <workflow_id>

Solutions

Node Timeout Issues

# Increase timeouts in workflow definition
nodes:
  - id: data_processor
    timeout: 300s  # 5 minutes
    retry_attempts: 3
    retry_delay: 30s

Memory Issues

# Optimize memory usage
execution:
  memory_optimization: true
  garbage_collection: aggressive
  data_streaming: true

Data Format Issues

// Add data validation
interface NodeInput {
  data: any;
  schema?: JsonSchema;
}

function validateInput(input: NodeInput): boolean {
  if (input.schema) {
    return validate(input.data, input.schema);
  }
  return true;
}

Performance Issues

High CPU Usage

Diagnostic Steps

# Check process CPU usage
top -p $(pgrep axonos)

# Check workflow execution load
axonos monitor cpu --breakdown

# Profile application
perf record -p $(pgrep axonos) -g sleep 30
perf report

Solutions

Optimize Workflow Concurrency

workflow:
  max_concurrent_executions: 50  # Reduce from default
  node_parallelism: 5            # Reduce parallel nodes

Enable CPU Throttling

execution:
  cpu_throttling: true
  cpu_limit_percent: 80

High Memory Usage

Diagnostic Steps

# Check memory usage by component
axonos monitor memory --breakdown

# Check for memory leaks
valgrind --tool=memcheck --leak-check=full axonos

# Monitor garbage collection
axonos monitor gc --real-time

Solutions

Memory Optimization

memory:
  gc_strategy: "aggressive"
  max_heap_size: "2g"
  enable_memory_profiling: true

execution:
  data_streaming: true
  result_caching: false
  memory_cleanup_interval: 60s

Workflow Optimization

// Use streaming for large datasets
async function processLargeDataset(input: DataStream): Promise<DataStream> {
  return input.pipe(
    transform(processChunk),
    batch(1000),
    compress()
  );
}

Slow API Response Times

Diagnostic Steps

# Check API response times
axonos monitor api --metrics

# Profile API endpoints
axonos profile api --duration 60s

# Check database query performance
axonos monitor db --slow-queries

Solutions

Enable Response Caching

api:
  caching:
    enabled: true
    ttl: 300s
    cache_headers: true

cache:
  redis:
    enabled: true
    max_memory: "512mb"

Optimize Database Queries

-- Add composite indexes
CREATE INDEX CONCURRENTLY idx_workflows_user_status 
ON workflows(user_id, status);

-- Optimize with query hints
SELECT /*+ USE_INDEX(workflows, idx_workflows_user_status) */
* FROM workflows WHERE user_id = ? AND status = ?;

Network Issues

Connection Timeouts

Symptoms

API requests timeout
WebSocket connections drop
Inter-service communication failures

Diagnostic Steps

# Check network connectivity
ping <target_host>
telnet <target_host> <port>

# Check firewall rules
iptables -L -n

# Monitor network traffic
netstat -i
ss -tulnp

# Check DNS resolution
nslookup <hostname>
dig <hostname>

Solutions

Increase Timeouts

network:
  connect_timeout: 30s
  read_timeout: 60s
  write_timeout: 30s

api:
  request_timeout: 120s
  keepalive_timeout: 75s

Configure Load Balancing

upstream axonos_backend {
    least_conn;
    server axonos-1:8080 max_fails=3 fail_timeout=30s;
    server axonos-2:8080 max_fails=3 fail_timeout=30s;
    keepalive 32;
}

SSL/TLS Issues

Symptoms

Certificate validation errors
Handshake failures
Mixed content warnings

Diagnostic Steps

# Test SSL connection
openssl s_client -connect axonos.example.com:443

# Check certificate validity
openssl x509 -in /etc/ssl/certs/axonos.crt -text -noout

# Verify certificate chain
openssl verify -CAfile /etc/ssl/certs/ca.crt /etc/ssl/certs/axonos.crt

Solutions

Update Certificates

# Renew certificates
certbot renew

# Update certificate in configuration
cp /etc/letsencrypt/live/axonos.example.com/fullchain.pem /etc/ssl/certs/axonos.crt
cp /etc/letsencrypt/live/axonos.example.com/privkey.pem /etc/ssl/private/axonos.key

# Restart service
systemctl restart axonos

Fix Certificate Chain

# Create proper certificate chain
cat /etc/ssl/certs/axonos.crt /etc/ssl/certs/intermediate.crt > /etc/ssl/certs/axonos-chain.crt

Node Development Issues

Node Compilation Errors

Symptoms

TypeScript compilation failures
Runtime import errors
Missing dependencies

Diagnostic Steps

# Check node compilation
axonos nodes compile <node_name>

# Validate node structure
axonos nodes validate <node_name>

# Check dependencies
npm ls --depth=0

# View compilation logs
axonos nodes logs <node_name> --level debug

Solutions

Fix TypeScript Issues

// Ensure proper type definitions
interface NodeInput {
  [key: string]: any;
}

interface NodeOutput {
  [key: string]: any;
}

export default class MyNode implements AxonNode {
  async execute(input: NodeInput): Promise<NodeOutput> {
    // Implementation
    return {};
  }
}

Resolve Dependencies

# Install missing dependencies
npm install --save <dependency>

# Update package.json
npm update

# Clear node modules cache
rm -rf node_modules package-lock.json
npm install

Node Execution Failures

Symptoms

Nodes crash during execution
Timeout errors
Resource allocation failures

Diagnostic Steps

# Check node execution logs
axonos nodes logs <node_id> --execution <execution_id>

# Monitor node resource usage
axonos nodes monitor <node_id>

# Test node in isolation
axonos nodes test <node_name> --input test_data.json

Solutions

Add Error Handling

export default class RobustNode implements AxonNode {
  async execute(input: NodeInput): Promise<NodeOutput> {
    try {
      // Node logic here
      const result = await this.processData(input);
      return { success: true, data: result };
    } catch (error) {
      console.error('Node execution failed:', error);
      return { 
        success: false, 
        error: error.message,
        stack: error.stack 
      };
    }
  }
}

Optimize Resource Usage

// Use streaming for large data
import { Transform } from 'stream';

export default class StreamingNode implements AxonNode {
  async execute(input: NodeInput): Promise<NodeOutput> {
    const transform = new Transform({
      objectMode: true,
      transform(chunk, encoding, callback) {
        // Process chunk
        this.push(processChunk(chunk));
        callback();
      }
    });
    
    return { stream: transform };
  }
}

Log Analysis

Enabling Debug Logging

logging:
  level: "debug"
  components:
    workflow_engine: "debug"
    node_registry: "debug"
    api_server: "info"
    database: "warn"

Log Patterns to Watch

Error Patterns

# Database connection errors
grep "connection refused\|authentication failed" /var/log/axonos/app.log

# Memory issues
grep "OutOfMemoryError\|memory limit exceeded" /var/log/axonos/app.log

# Timeout errors
grep "timeout\|deadline exceeded" /var/log/axonos/app.log

# Permission errors
grep "permission denied\|access denied" /var/log/axonos/app.log

Performance Patterns

# Slow operations
grep "slow query\|execution time" /var/log/axonos/app.log | grep -E "[0-9]+ms" | awk '$NF > 1000'

# High resource usage
grep "cpu usage\|memory usage" /var/log/axonos/app.log | grep -E "[8-9][0-9]%|100%"

Log Analysis Tools

Using `jq` for JSON Logs

# Parse JSON logs
cat /var/log/axonos/app.log | jq 'select(.level == "error")'

# Filter by component
cat /var/log/axonos/app.log | jq 'select(.component == "workflow_engine")'

# Extract error messages
cat /var/log/axonos/app.log | jq -r 'select(.level == "error") | .message'

Custom Log Analysis Script

#!/bin/bash
# analyze_logs.sh

LOG_FILE="/var/log/axonos/app.log"
TIME_WINDOW="1h"

echo "=== Axon OS Log Analysis ==="
echo "Time window: last $TIME_WINDOW"
echo "Log file: $LOG_FILE"
echo

# Error summary
echo "=== Error Summary ==="
grep -E "ERROR|FATAL" "$LOG_FILE" | tail -20

echo -e "\n=== Top Error Types ==="
grep -E "ERROR|FATAL" "$LOG_FILE" | \
  sed 's/.*ERROR\|FATAL.*: //' | \
  sort | uniq -c | sort -nr | head -10

echo -e "\n=== Performance Issues ==="
grep -E "slow|timeout|memory" "$LOG_FILE" | tail -10

echo -e "\n=== Recent Warnings ==="
grep "WARN" "$LOG_FILE" | tail -10

Recovery Procedures

Database Recovery

Backup Restoration

# Stop Axon OS
systemctl stop axonos

# Restore database from backup
pg_restore -h localhost -U axonos_user -d axonos /backups/axonos_backup.sql

# Verify data integrity
axonos db verify

# Start Axon OS
systemctl start axonos

Point-in-Time Recovery

# Restore to specific timestamp
pg_restore -h localhost -U axonos_user -d axonos \
  --before="2024-01-15 14:30:00" /backups/axonos_backup.sql

System Recovery

Configuration Recovery

# Restore configuration from backup
cp /backups/config/axonos.yml /etc/axonos/

# Reset to factory defaults
axonos config reset --confirm

# Reinitialize system
axonos init --force

Emergency Recovery Mode

# Start in recovery mode
axonos --recovery-mode --safe-mode

# Access recovery console
axonos recovery console

# Run recovery commands
recovery> check-integrity
recovery> repair-database
recovery> rebuild-indexes
recovery> exit

Getting Help

Support Channels

Documentation: docs.axonos.dev
Community Forum: community.axonos.dev
Discord: discord.gg/axonos
GitHub Issues: github.com/axonos/axonos/issues

Creating Support Tickets

Required Information

# System information
axonos system info > system_info.txt

# Configuration (sanitized)
axonos config export --sanitize > config_export.yml

# Recent logs
axonos logs --since 1h > recent_logs.txt

# Performance metrics
axonos monitor report --duration 1h > performance_report.json

Support Template

## Issue Description
Brief description of the issue

## Environment
- Axon OS Version: 
- Operating System: 
- Database Version: 
- Node.js Version: 

## Steps to Reproduce
1. 
2. 
3. 

## Expected Behavior
What should happen

## Actual Behavior
What actually happens

## Logs and Diagnostics
[Attach log files and diagnostic outputs]

## Additional Context
Any other relevant information

Quick Diagnosis​

System Health Check​

Common Issues Quick Reference​

Service Issues​

Axon OS Won't Start​

Symptoms​

Diagnostic Steps​

Common Causes & Solutions​

Service Crashes​

Symptoms​

Diagnostic Steps​

Solutions​

Database Issues​

Connection Problems​

Symptoms​

Diagnostic Steps​

Solutions​

Database Performance Issues​

Symptoms​

Diagnostic Steps​

Solutions​

Workflow Execution Issues​

Workflows Fail to Start​

Symptoms​

Diagnostic Steps​

Solutions​

Workflow Execution Errors​

Symptoms​

Diagnostic Steps​

Solutions​

Performance Issues​

High CPU Usage​

Diagnostic Steps​

Solutions​

High Memory Usage​

Diagnostic Steps​

Solutions​

Slow API Response Times​

Diagnostic Steps​

Solutions​

Network Issues​

Connection Timeouts​

Symptoms​

Diagnostic Steps​

Solutions​

SSL/TLS Issues​

Symptoms​

Diagnostic Steps​

Solutions​

Node Development Issues​

Node Compilation Errors​

Symptoms​

Diagnostic Steps​

Solutions​

Node Execution Failures​

Symptoms​

Diagnostic Steps​

Solutions​

Log Analysis​

Enabling Debug Logging​

Log Patterns to Watch​

Error Patterns​

Performance Patterns​

Log Analysis Tools​

Using jq for JSON Logs​

Custom Log Analysis Script​

Recovery Procedures​

Database Recovery​

Backup Restoration​

Point-in-Time Recovery​

System Recovery​

Configuration Recovery​

Emergency Recovery Mode​

Getting Help​

Support Channels​

Creating Support Tickets​

Required Information​

Support Template​

Need Help?​

Quick Diagnosis

System Health Check

Common Issues Quick Reference

Service Issues

Axon OS Won't Start

Symptoms

Diagnostic Steps

Common Causes & Solutions

Service Crashes

Symptoms

Diagnostic Steps

Solutions

Database Issues

Connection Problems

Symptoms

Diagnostic Steps

Solutions

Database Performance Issues

Symptoms

Diagnostic Steps

Solutions

Workflow Execution Issues

Workflows Fail to Start

Symptoms

Diagnostic Steps

Solutions

Workflow Execution Errors

Symptoms

Diagnostic Steps

Solutions

Performance Issues

High CPU Usage

Diagnostic Steps

Solutions

High Memory Usage

Diagnostic Steps

Solutions

Slow API Response Times

Diagnostic Steps

Solutions

Network Issues

Connection Timeouts

Symptoms

Diagnostic Steps

Solutions

SSL/TLS Issues

Symptoms

Diagnostic Steps

Solutions

Node Development Issues

Node Compilation Errors

Symptoms

Diagnostic Steps

Solutions

Node Execution Failures

Symptoms

Diagnostic Steps

Solutions

Log Analysis

Enabling Debug Logging

Log Patterns to Watch

Error Patterns

Performance Patterns

Log Analysis Tools

Using `jq` for JSON Logs

Custom Log Analysis Script

Recovery Procedures

Database Recovery

Backup Restoration

Point-in-Time Recovery

System Recovery

Configuration Recovery

Emergency Recovery Mode

Getting Help

Support Channels

Creating Support Tickets

Required Information

Support Template

Need Help?