Troubleshooting Common Validator Issues

Running a validator node can sometimes present challenges. Here’s a guide to troubleshooting some common issues you might encounter with your Shardeum node.

Node Stability & Unexpected Stops

Issue: Your validator node stops unexpectedly or fails to stay running.

Possible Causes & Solutions

Network Version Mismatch:
- Symptom: Node logs show errors like invoke-exit: exitUncleanly: isReadyToJoin: Not ready to join or the node stops shortly after starting, especially after a known network upgrade.
- Cause: The Shardeum network (testnet, stagenet) has been upgraded to a new version, and your validator is running an older, incompatible Docker image.
- Solution:
  1. Check official Shardeum channels (Discord, Telegram, forums) for announcements about the latest validator image tag for the specific network you're on.
  2. Pull the new Docker image: docker pull ghcr.io/shardeum/shardeum-validator:NEW_TAG
  3. Update your docker-compose.yml to use the NEW_TAG.
  4. Recreate your container: docker-compose down && docker-compose up -d.
Port Accessibility Issues:
- Symptom: Node status shows state: stopped with exitMessage: Unable to access external or internal ports... or the node is stuck in waiting-for-network.
- Cause: The network cannot reach your validator on its configured P2P ports (SHMINT, SHMEXT).
- Solution:
  1. Verify your docker-compose.yml: Ensure correct port mapping (e.g., "9001:9001", "10001:10001"). The host port and container port for SHMINT and SHMEXT must match the values set in the environment variables.
  2. Firewall: Check your server's firewall (e.g., ufw on Ubuntu, cloud provider security groups). Ensure the host ports used for SHMINT and SHMEXT are open for incoming TCP traffic from the internet.
  3. Public IP: Confirm your node is correctly detecting its public IP. If it's behind a complex NAT or VPN, this might be an issue.
  4. Test reachability: curl http://YOUR_SERVER_PUBLIC_IP:HOST_PORT_FOR_SHMEXT/nodeinfo.
Insufficient System Resources:
- Symptom: Node crashes, becomes unresponsive, or logs show out-of-memory errors.
- Cause: Your server doesn't meet the minimum CPU, RAM, or disk space requirements.
- Solution: Upgrade your server resources according to Shardeum's official recommendations. Ensure ample free disk space.
Corrupted Data / Database Issues:
- Symptom: Node fails to start, errors related to database files in logs.
- Cause: Improper shutdown, disk errors, or other issues might corrupt the node's local data.
- Solution:
  1. Backup secrets.json! This file contains your validator's identity.
  2. You might try stopping the node, removing the contents of the data directory (the one mapped in your docker-compose.yml volume, except for secrets.json), and restarting. The node will then resync.
  3. If secrets.json is corrupted and you have no backup, the stake associated with that specific validator identity might be difficult to recover without the programmatic unstaking methods (see advanced guides).
Software Bugs:
- Symptom: Unexplained crashes, persistent errors even with correct configuration.
- Cause: A bug in the current validator software version.
- Solution:
  1. Check official Shardeum channels for any known issues or patches.
  2. Provide detailed logs to the Shardeum team (see below).

Providing Logs for Troubleshooting

When you encounter an issue, detailed logs are invaluable for diagnosis.

Key Logs

PM2 Logs (inside the container): PM2 manages the validator processes within the Docker container.
- Access the container shell: docker exec -it <your_container_name_or_id> /bin/bash (or shardeum/shell.sh if available in your config volume).
- View logs: pm2 logs (tails live logs).
- To get all logs for PM2 itself and the processes it manages (validator, operator-gui):
  pm2 logs --nostream > /home/node/config/pm2_all_logs.txt
  Then copy this pm2_all_logs.txt file from your mapped volume on the host.
- Specific log files are usually in /home/node/.pm2/logs/:
  `validator-error.log` `validator-out.log` `operator-gui-error.log` `operator-gui-out.log`
Operator CLI Output: If you're running operator-cli commands, copy the full command and its complete output, including any error messages.

Copy the relevant text from the log files or console.
For larger log snippets, save them to a .txt file.
When reporting an issue, provide:
- A clear description of what you were doing.
- The exact command(s) you ran.
- The full output/error message.
- Relevant log file contents.
- The validator image tag you are using (e.g., stagenet-v1.19.0).
- Your node's public key (publicKey from /nodeinfo or operator-cli status).
- Your nominator address (the wallet address used for staking).

Specific Error Messages

Transaction timestamp out of range:
- Usually means your server's clock is out of sync with the network. Ensure your system time is synchronized using NTP.
- Can also be caused by high network latency or issues with the RPC endpoint.
Error: No stake found (during unstake):
- The wallet you're using (identified by its private key) has no SHM staked to any node.
- A previous unstake attempt for this stake might have succeeded (even if the CLI seemed stuck). Check your wallet balance on the explorer.
This node is in the network's Standby list. You can unstake only after the node leaves the Standby list! (or similar messages for active / ready state):
- You are trying to unstake a node that is not properly stopped and past its stake lock period. Follow the correct unstaking sequence: wait for standby, then operator-cli stop, then wait for stakeState.unlocked: true and stakeState.remainingTime: 0 before running operator-cli unstake.
Stake amount is less than minimum required stake amount:
- You are trying to stake less than the network's minimum requirement (e.g., less than 2400 SHM on stagenet).
AxiosError: timeout of XXXXms exceeded / Unable to fetch data from network (out of retries: unknown reason) (during stake/status):
- Could be temporary network issues or problems with the archivers/RPC endpoint. Try again after a few minutes.
- Can also occur if the staker wallet has insufficient funds (the error message isn't always direct for this case).
- If trying to stake, and stakeable.reason in operator-cli status shows "Network request failed, allowing stake by default", it means the CLI couldn't verify the 30-min staking cooldown due to network issues. The stake might still fail if you're within the cooldown.
TypeError: Cannot read properties of null (reading 'status') (from operator-cli status):
- The node process might have crashed or is not running properly. Check pm2 logs inside the container.
- Could also indicate an issue with the secrets.json file or the node's ability to read its own state.
Failed to execute unstake transaction: Error: processing response error (body={\"jsonrpc\":\"2.0\",\"id\":53,\"error\":{\"code\":101,...}}):
- This is a generic RPC error from the network. The actual error is in the message field within the JSON body. For example: "message\":\"This node is still selected in the network. You can unstake only after the node leaves the network!\".