Skip to main content

Troubleshooting Guide

Comprehensive troubleshooting guide for Axon OS, covering common issues, diagnostic procedures, and resolution steps.

Quick Diagnosis

System Health Check

# Check Axon OS status
axonos status

# View system resources
axonos health --verbose

# Check all services
axonos services status

# View recent logs
axonos logs --tail 100

Common Issues Quick Reference

IssueQuick CheckQuick Fix
Service won't startaxonos statusaxonos restart
Database connectionaxonos config test-dbCheck credentials
High memory usageaxonos monitor memoryRestart workflows
Slow performanceaxonos monitor performanceClear cache
Node errorsaxonos nodes validateReinstall nodes

Service Issues

Axon OS Won't Start

Symptoms

  • Service fails to start
  • Port binding errors
  • Configuration errors

Diagnostic Steps

# Check service status
systemctl status axonos

# View startup logs
journalctl -u axonos -f

# Validate configuration
axonos config validate

# Check port availability
netstat -tlnp | grep :8080

# Verify permissions
ls -la /opt/axonos/

Common Causes & Solutions

  1. Port Already in Use

    # Find process using port
    lsof -i :8080

    # Kill the process or change port
    kill -9 <PID>
    # OR update configuration
    vim /etc/axonos/config.yml
  2. Configuration Errors

    # Validate configuration syntax
    axonos config validate

    # Check for missing environment variables
    axonos config check-env

    # Restore default configuration
    cp /opt/axonos/config.example.yml /etc/axonos/config.yml
  3. Permission Issues

    # Fix ownership
    chown -R axonos:axonos /opt/axonos/
    chown -R axonos:axonos /var/lib/axonos/

    # Fix permissions
    chmod 755 /opt/axonos/bin/axonos
    chmod 600 /etc/axonos/config.yml
  4. Dependency Issues

    # Check PostgreSQL
    systemctl status postgresql
    systemctl start postgresql

    # Check Redis
    systemctl status redis
    redis-cli ping

    # Test database connection
    axonos config test-db

Service Crashes

Symptoms

  • Unexpected service termination
  • Out of memory errors
  • Segmentation faults

Diagnostic Steps

# Check crash logs
journalctl -u axonos --since "1 hour ago"

# Check system logs
tail -f /var/log/syslog

# Check memory usage
free -h
ps aux --sort=-%mem | head

# Check disk space
df -h

# Generate memory dump (if enabled)
gdb /opt/axonos/bin/axonos core

Solutions

  1. Memory Issues

    # Increase memory limits in config.yml
    memory:
    max_heap_size: "2g"
    max_workflow_memory: "1g"

    # Enable memory monitoring
    monitoring:
    memory_alerts: true
    memory_threshold: 80
  2. Resource Limits

    # Check system limits
    ulimit -a

    # Update limits in /etc/security/limits.conf
    axonos soft nofile 65536
    axonos hard nofile 65536
    axonos soft nproc 32768
    axonos hard nproc 32768

Database Issues

Connection Problems

Symptoms

  • "Connection refused" errors
  • "Authentication failed" errors
  • Timeouts

Diagnostic Steps

# Test database connectivity
psql -h localhost -U axonos_user -d axonos

# Check PostgreSQL status
systemctl status postgresql

# View PostgreSQL logs
tail -f /var/log/postgresql/postgresql-*.log

# Check connection parameters
axonos config show database

Solutions

  1. Connection Refused

    # Start PostgreSQL
    systemctl start postgresql
    systemctl enable postgresql

    # Check if PostgreSQL is listening
    netstat -tlnp | grep :5432

    # Update postgresql.conf
    echo "listen_addresses = '*'" >> /etc/postgresql/*/main/postgresql.conf
  2. Authentication Failed

    # Reset password
    sudo -u postgres psql
    ALTER USER axonos_user PASSWORD 'new_password';

    # Update pg_hba.conf
    echo "host axonos axonos_user 127.0.0.1/32 md5" >> /etc/postgresql/*/main/pg_hba.conf

    # Reload configuration
    systemctl reload postgresql
  3. Connection Pool Exhaustion

    # Increase pool size in config.yml
    database:
    pool:
    max_connections: 200
    min_connections: 10
    connection_timeout: 60s

Database Performance Issues

Symptoms

  • Slow query execution
  • High CPU usage
  • Lock timeouts

Diagnostic Steps

-- Check active connections
SELECT count(*) FROM pg_stat_activity;

-- Find slow queries
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Check for locks
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.GRANTED;

Solutions

  1. Query Optimization

    -- Add missing indexes
    CREATE INDEX CONCURRENTLY idx_workflows_user_id ON workflows(user_id);
    CREATE INDEX CONCURRENTLY idx_executions_status ON executions(status);
    CREATE INDEX CONCURRENTLY idx_executions_created_at ON executions(created_at);

    -- Update table statistics
    ANALYZE;

    -- Vacuum tables
    VACUUM ANALYZE workflows;
    VACUUM ANALYZE executions;
  2. PostgreSQL Tuning

    # postgresql.conf optimizations
    shared_buffers = 256MB
    effective_cache_size = 1GB
    maintenance_work_mem = 64MB
    checkpoint_completion_target = 0.7
    wal_buffers = 16MB
    default_statistics_target = 100
    random_page_cost = 1.1
    effective_io_concurrency = 200

Workflow Execution Issues

Workflows Fail to Start

Symptoms

  • Workflows remain in "pending" status
  • "Node not found" errors
  • Resource allocation failures

Diagnostic Steps

# Check workflow status
axonos workflows list --status pending

# View workflow logs
axonos workflows logs <workflow_id>

# Check node availability
axonos nodes list --status available

# Check resource usage
axonos monitor resources

Solutions

  1. Node Registry Issues

    # Refresh node registry
    axonos nodes refresh

    # Reinstall missing nodes
    axonos nodes install <node_name>

    # Check node dependencies
    axonos nodes validate <node_name>
  2. Resource Constraints

    # Increase resource limits
    workflow:
    max_concurrent_executions: 100
    max_memory_per_execution: "1g"
    max_cpu_per_execution: "2"

Workflow Execution Errors

Symptoms

  • Workflows fail during execution
  • Node timeout errors
  • Data transformation errors

Diagnostic Steps

# Get detailed workflow execution logs
axonos workflows logs <workflow_id> --verbose

# Check node execution logs
axonos nodes logs <node_id>

# View workflow execution timeline
axonos workflows timeline <workflow_id>

# Check data flow
axonos workflows data-flow <workflow_id>

Solutions

  1. Node Timeout Issues

    # Increase timeouts in workflow definition
    nodes:
    - id: data_processor
    timeout: 300s # 5 minutes
    retry_attempts: 3
    retry_delay: 30s
  2. Memory Issues

    # Optimize memory usage
    execution:
    memory_optimization: true
    garbage_collection: aggressive
    data_streaming: true
  3. Data Format Issues

    // Add data validation
    interface NodeInput {
    data: any;
    schema?: JsonSchema;
    }

    function validateInput(input: NodeInput): boolean {
    if (input.schema) {
    return validate(input.data, input.schema);
    }
    return true;
    }

Performance Issues

High CPU Usage

Diagnostic Steps

# Check process CPU usage
top -p $(pgrep axonos)

# Check workflow execution load
axonos monitor cpu --breakdown

# Profile application
perf record -p $(pgrep axonos) -g sleep 30
perf report

Solutions

  1. Optimize Workflow Concurrency

    workflow:
    max_concurrent_executions: 50 # Reduce from default
    node_parallelism: 5 # Reduce parallel nodes
  2. Enable CPU Throttling

    execution:
    cpu_throttling: true
    cpu_limit_percent: 80

High Memory Usage

Diagnostic Steps

# Check memory usage by component
axonos monitor memory --breakdown

# Check for memory leaks
valgrind --tool=memcheck --leak-check=full axonos

# Monitor garbage collection
axonos monitor gc --real-time

Solutions

  1. Memory Optimization

    memory:
    gc_strategy: "aggressive"
    max_heap_size: "2g"
    enable_memory_profiling: true

    execution:
    data_streaming: true
    result_caching: false
    memory_cleanup_interval: 60s
  2. Workflow Optimization

    // Use streaming for large datasets
    async function processLargeDataset(input: DataStream): Promise<DataStream> {
    return input.pipe(
    transform(processChunk),
    batch(1000),
    compress()
    );
    }

Slow API Response Times

Diagnostic Steps

# Check API response times
axonos monitor api --metrics

# Profile API endpoints
axonos profile api --duration 60s

# Check database query performance
axonos monitor db --slow-queries

Solutions

  1. Enable Response Caching

    api:
    caching:
    enabled: true
    ttl: 300s
    cache_headers: true

    cache:
    redis:
    enabled: true
    max_memory: "512mb"
  2. Optimize Database Queries

    -- Add composite indexes
    CREATE INDEX CONCURRENTLY idx_workflows_user_status
    ON workflows(user_id, status);

    -- Optimize with query hints
    SELECT /*+ USE_INDEX(workflows, idx_workflows_user_status) */
    * FROM workflows WHERE user_id = ? AND status = ?;

Network Issues

Connection Timeouts

Symptoms

  • API requests timeout
  • WebSocket connections drop
  • Inter-service communication failures

Diagnostic Steps

# Check network connectivity
ping <target_host>
telnet <target_host> <port>

# Check firewall rules
iptables -L -n

# Monitor network traffic
netstat -i
ss -tulnp

# Check DNS resolution
nslookup <hostname>
dig <hostname>

Solutions

  1. Increase Timeouts

    network:
    connect_timeout: 30s
    read_timeout: 60s
    write_timeout: 30s

    api:
    request_timeout: 120s
    keepalive_timeout: 75s
  2. Configure Load Balancing

    upstream axonos_backend {
    least_conn;
    server axonos-1:8080 max_fails=3 fail_timeout=30s;
    server axonos-2:8080 max_fails=3 fail_timeout=30s;
    keepalive 32;
    }

SSL/TLS Issues

Symptoms

  • Certificate validation errors
  • Handshake failures
  • Mixed content warnings

Diagnostic Steps

# Test SSL connection
openssl s_client -connect axonos.example.com:443

# Check certificate validity
openssl x509 -in /etc/ssl/certs/axonos.crt -text -noout

# Verify certificate chain
openssl verify -CAfile /etc/ssl/certs/ca.crt /etc/ssl/certs/axonos.crt

Solutions

  1. Update Certificates

    # Renew certificates
    certbot renew

    # Update certificate in configuration
    cp /etc/letsencrypt/live/axonos.example.com/fullchain.pem /etc/ssl/certs/axonos.crt
    cp /etc/letsencrypt/live/axonos.example.com/privkey.pem /etc/ssl/private/axonos.key

    # Restart service
    systemctl restart axonos
  2. Fix Certificate Chain

    # Create proper certificate chain
    cat /etc/ssl/certs/axonos.crt /etc/ssl/certs/intermediate.crt > /etc/ssl/certs/axonos-chain.crt

Node Development Issues

Node Compilation Errors

Symptoms

  • TypeScript compilation failures
  • Runtime import errors
  • Missing dependencies

Diagnostic Steps

# Check node compilation
axonos nodes compile <node_name>

# Validate node structure
axonos nodes validate <node_name>

# Check dependencies
npm ls --depth=0

# View compilation logs
axonos nodes logs <node_name> --level debug

Solutions

  1. Fix TypeScript Issues

    // Ensure proper type definitions
    interface NodeInput {
    [key: string]: any;
    }

    interface NodeOutput {
    [key: string]: any;
    }

    export default class MyNode implements AxonNode {
    async execute(input: NodeInput): Promise<NodeOutput> {
    // Implementation
    return {};
    }
    }
  2. Resolve Dependencies

    # Install missing dependencies
    npm install --save <dependency>

    # Update package.json
    npm update

    # Clear node modules cache
    rm -rf node_modules package-lock.json
    npm install

Node Execution Failures

Symptoms

  • Nodes crash during execution
  • Timeout errors
  • Resource allocation failures

Diagnostic Steps

# Check node execution logs
axonos nodes logs <node_id> --execution <execution_id>

# Monitor node resource usage
axonos nodes monitor <node_id>

# Test node in isolation
axonos nodes test <node_name> --input test_data.json

Solutions

  1. Add Error Handling

    export default class RobustNode implements AxonNode {
    async execute(input: NodeInput): Promise<NodeOutput> {
    try {
    // Node logic here
    const result = await this.processData(input);
    return { success: true, data: result };
    } catch (error) {
    console.error('Node execution failed:', error);
    return {
    success: false,
    error: error.message,
    stack: error.stack
    };
    }
    }
    }
  2. Optimize Resource Usage

    // Use streaming for large data
    import { Transform } from 'stream';

    export default class StreamingNode implements AxonNode {
    async execute(input: NodeInput): Promise<NodeOutput> {
    const transform = new Transform({
    objectMode: true,
    transform(chunk, encoding, callback) {
    // Process chunk
    this.push(processChunk(chunk));
    callback();
    }
    });

    return { stream: transform };
    }
    }

Log Analysis

Enabling Debug Logging

logging:
level: "debug"
components:
workflow_engine: "debug"
node_registry: "debug"
api_server: "info"
database: "warn"

Log Patterns to Watch

Error Patterns

# Database connection errors
grep "connection refused\|authentication failed" /var/log/axonos/app.log

# Memory issues
grep "OutOfMemoryError\|memory limit exceeded" /var/log/axonos/app.log

# Timeout errors
grep "timeout\|deadline exceeded" /var/log/axonos/app.log

# Permission errors
grep "permission denied\|access denied" /var/log/axonos/app.log

Performance Patterns

# Slow operations
grep "slow query\|execution time" /var/log/axonos/app.log | grep -E "[0-9]+ms" | awk '$NF > 1000'

# High resource usage
grep "cpu usage\|memory usage" /var/log/axonos/app.log | grep -E "[8-9][0-9]%|100%"

Log Analysis Tools

Using jq for JSON Logs

# Parse JSON logs
cat /var/log/axonos/app.log | jq 'select(.level == "error")'

# Filter by component
cat /var/log/axonos/app.log | jq 'select(.component == "workflow_engine")'

# Extract error messages
cat /var/log/axonos/app.log | jq -r 'select(.level == "error") | .message'

Custom Log Analysis Script

#!/bin/bash
# analyze_logs.sh

LOG_FILE="/var/log/axonos/app.log"
TIME_WINDOW="1h"

echo "=== Axon OS Log Analysis ==="
echo "Time window: last $TIME_WINDOW"
echo "Log file: $LOG_FILE"
echo

# Error summary
echo "=== Error Summary ==="
grep -E "ERROR|FATAL" "$LOG_FILE" | tail -20

echo -e "\n=== Top Error Types ==="
grep -E "ERROR|FATAL" "$LOG_FILE" | \
sed 's/.*ERROR\|FATAL.*: //' | \
sort | uniq -c | sort -nr | head -10

echo -e "\n=== Performance Issues ==="
grep -E "slow|timeout|memory" "$LOG_FILE" | tail -10

echo -e "\n=== Recent Warnings ==="
grep "WARN" "$LOG_FILE" | tail -10

Recovery Procedures

Database Recovery

Backup Restoration

# Stop Axon OS
systemctl stop axonos

# Restore database from backup
pg_restore -h localhost -U axonos_user -d axonos /backups/axonos_backup.sql

# Verify data integrity
axonos db verify

# Start Axon OS
systemctl start axonos

Point-in-Time Recovery

# Restore to specific timestamp
pg_restore -h localhost -U axonos_user -d axonos \
--before="2024-01-15 14:30:00" /backups/axonos_backup.sql

System Recovery

Configuration Recovery

# Restore configuration from backup
cp /backups/config/axonos.yml /etc/axonos/

# Reset to factory defaults
axonos config reset --confirm

# Reinitialize system
axonos init --force

Emergency Recovery Mode

# Start in recovery mode
axonos --recovery-mode --safe-mode

# Access recovery console
axonos recovery console

# Run recovery commands
recovery> check-integrity
recovery> repair-database
recovery> rebuild-indexes
recovery> exit

Getting Help

Support Channels

  1. Documentation: docs.axonos.dev
  2. Community Forum: community.axonos.dev
  3. Discord: discord.gg/axonos
  4. GitHub Issues: github.com/axonos/axonos/issues

Creating Support Tickets

Required Information

# System information
axonos system info > system_info.txt

# Configuration (sanitized)
axonos config export --sanitize > config_export.yml

# Recent logs
axonos logs --since 1h > recent_logs.txt

# Performance metrics
axonos monitor report --duration 1h > performance_report.json

Support Template

## Issue Description
Brief description of the issue

## Environment
- Axon OS Version:
- Operating System:
- Database Version:
- Node.js Version:

## Steps to Reproduce
1.
2.
3.

## Expected Behavior
What should happen

## Actual Behavior
What actually happens

## Logs and Diagnostics
[Attach log files and diagnostic outputs]

## Additional Context
Any other relevant information

Need Help?