Cassini — Cronos Incentivised Testnet Event Highlights
Cassini, the Cronos Incentivised Testnet, is a competition aimed to stress test Cronos in a practical, real-world setting before the…
Cassini, the Cronos Incentivised Testnet, is a competition aimed to stress test Cronos in a practical, real-world setting before the mainnet launch on 8 November. It is a crucial step in preparation for the Cronos mainnet launch. The competition was open to the general public, particularly DApp users and developers, participating as a Tester or a Builder. And it lasted for two weeks between 28 September 2021 and 11 October 2021.
Cronos is the Ethereum Virtual Machine (EVM) chain running in parallel to the Crypto.org Chain to scale the DeFi and decentralised application (DApp) ecosystem by providing developers with the ability to instantly port apps from Ethereum and EVM-compatible chains. We have achieved remarkable progress since the first launch of Cronos testnet back in July and an update in August. Before the official release of the Cronos Mainnet, the Cassini event was a valuable experience to the Cronos team as it provided us an opportunity to test out different things under a stressed network. The dry-run brought to light some expected issues as well as newly-identified areas for improvement. These lessons will support us in our preparation for the Cronos Mainnet launch.
What Were the Tasks in Cassini?
Cassini was divided into three phases:
Phase 0: Registration Period and Preparation
(28 September to 18 October): participants signed up as a Builder and a Tester of their choices
2. Phase 1: Competition Begins
(5 October to 11 October): Incentivised testnet launched, and Builders and Testers perform their tasks accordingly
3. Phase 2: Network Attack and Bugs Finding
(11 October to 18 October): the attack phase, where participants were encouraged to launch an attack on the network as permitted by these Official Rules and Terms
Throughout the competition, participants were scored based on the completion of different tasks. Both Testers and Builders have various tasks to perform and earn stores according to the reward mechanism corresponding to the specific tasks, including but not limited to the total number of transactions conducted, the validity of their attack, the number of transactions on their smart contracts, etc.
Statistics
The team received 7723 registrations within the first 3 days after the announcement of the dry-run. In total, we received 13748 registrations, including those from long-time supporters of Crypto.org and blockchain enthusiasts with solid experience in setting up validator nodes on other blockchains. Among the registered participants, 12,510 joined the network as Testers and 122 joined as Builders.
At the end of the event, there were 200k blocks proposed and 15m transactions broadcasted to the network in just over 4 weeks, creating a highly-stressed environment that helped us identify the network’s capacity.
What We Learned
The Cassini dry-run not only helped us be prepared for Cronos Mainnet, but we also observed and monitored the network behavior closely, and discovered issues that could help make the Tendermint and the Cosmos SDK more secure during this process.
Improvement on Mempool and Network Pressures to Full Nodes
The network was handling a much smaller number of parties before the launch of Cassini. Since the competition brought a much higher loading, the networks encountered a few technical problems and could not reflect the status simultaneously. After the detection, we found that the heavy traffic had a load impact on the network, which also caused the mempool to reach as much as 5000. This enforced intense pressure on full nodes, which further resulted in the synchronization problem between nodes and the network.
The team has acted quickly and identified the efficient solution accordingly. We took a few attempts for a fix on this. Firstly, we adjusted the mempool size to 2500 to prevent the P2P process from being blocked, which was proven to be an effective solution to alleviate the issue. Additionally, we upgraded all nodes with more robust instance types which are sc5a.8xlarge and c5a.4xlarge. During the process, we also gave a try on setting mempool.recheck to be false to re-activate the mempool. This method didn’t work as expected as it couldn’t flush the committed transactions.
The improvement on this will be further powered with the release of Tendermint v0.35, which includes a new p2p layer, mempool improvements. This new update will equip the transactions with priority levels and enable the timeout features.
Stabilizing the Performance of the Cassini Faucet
Some participants reported that the Cassini faucet was not functioning as expected as the performance was inconsistent and unable to allocate the testing tokens all the time. After investigating the performance, we identified the failure of the transaction is caused by sequence mismatch. Also, the nodes behind the faucet were unable to remain synchronized all the time.
Since we identified root causes for this issue, we made changes on the node corresponding to the faucet part — instead of sending transactions in a higher amount, it was adjusted to point to the single node for transaction handling. Also, we put effort into resolving the synchronized issue on the node.
Eventually, the facet has been working smoother than before. Cassini helped the team to identify the inefficiencies of the faucet so that we could roll out more improvements on it.
Improvement on the Explorer
Problem 1: Explorer failed to show when the node was not synchronized
During the Cassini dry-run, the explorer was unable to show when the node was not synchronized — once the node synchronization issue occurred, the explorer went down and kept on rebooting. This was due to the health check failing causing k8 to restart the pod.
Following that, we made a quick decision to scale up the explorer resources as well as the internal full nodes behind the explorer. The health check was removed to maintain the high availability of the explorer page but was not always in sync.
Problem 2: Explorer was not indexing new blocks/transactions (lagging)
Another issue we encountered was that the explorer had a time latency in syncing and reflecting the real block data. It failed to display the live data and only showed the blocks that were composed a few hours ago. Upon investigation, we found that many queries timed out and returned error responses.
After debugging, we identified one of the bottlenecks to be the load balancer and the idle connection timeout period. This fixed the node sync issue on internal full nodes for a while. However, as the network had more traffic, the problem reappeared and it was due to the internal full nodes not being able to accept any more connections.
Future Optimisations on the Explorer
Following the lessons learned from running the explorer against high traffic network, the following improvements can be implemented in our future networks:
First of all, going forward, we will separate the indexing service and the web app service of the explorer. The indexing service can only have one running instance which we will scale vertically. As for the web app service, we will scale horizontally (by adding multiple instances) and vertically, as deemed necessary. Secondly, we can have separate nodes for tracing behind a load balancer which is costly. Finally, for production environments, we will monitor the load on the internal full nodes that the explorer is running against and we will adjust the concurrency settings to limit the batch size and number of connections that the indexer creates against those nodes. This will slow down the indexing process but will ensure that the full nodes keep accepting connections and do not encounter failures.
ATTACK-REWARD-Excessive Batch RPC Request
During the phase of Network Attack and Bugs Finding in Cassini dry-run, an attack to network revealed a potential problem. The attack was initiated by sending multiple batch requests that contain thousands of eth_estimateGas calls to the RPC server, which further led the RPC server to be desynced. Specifically, this caused the RPC server not to catch up with the latest block height, and in a load-balanced server, users could only access the inconsistent block height.
Also, the attack looks like a deliberate attack that purposely blocked other players to send transactions, so the attackers would gain a higher success rate to be qualified with the rewards.
Our further detection of the attack revealed several phenomena. The attack only imposed a negative impact on seed nodes, whereas others were not affected. It also caused failure for Metamask to conduct sending transactions to https://cassini.crypto.org:8545/ and the transactions log was repeating as shown below.
Oct 17 18:47:56 ip-10–202–1–21 cronosd[6405]: 6:47PM ERR account not found error=”account tcrc1sntvtxh8pkrljnr7zddk05l4jh4c9d2h5ra5p2 does not exist: unknown address” cosmos-address=tcrc1sntvtxh8pkrljnr7zddk05l4jh4c9d2h5ra5p2 ethereum-address=0x84D6c59Ae70D87f94c7e135b67D3F595Eb82b557 module=evm
Oct 17 18:47:56 ip-10–202–1–21 cronosd[6405]: 6:47PM ERR account not found error=”account tcrc1sntvtxh8pkrljnr7zddk05l4jh4c9d2h5ra5p2 does not exist: unknown address” cosmos-address=tcrc1sntvtxh8pkrljnr7zddk05l4jh4c9d2h5ra5p2 ethereum-address=0x84D6c59Ae70D87f94c7e135b67D3F595Eb82b557 module=evm
Oct 17 18:47:56 ip-10–202–1–21 cronosd[6405]: 6:47PM ERR account not found error=”account tcrc1sntvtxh8pkrljnr7zddk05l4jh4c9d2h5ra5p2 does not exist: unknown address” cosmos-address=tcrc1sntvtxh8pkrljnr7zddk05l4jh4c9d2h5ra5p2 ethereum-address=0x84D6c59Ae70D87f94c7e135b67D3F595Eb82b557 module=evm
Detaching nodes from eth-rpc target group was our first step to tackle the situation. This was able to make the network sync again, whereas unable to serve eth-rpc. In addition, we leveraged Cloudflare and added it to block multiple json requests to enhance the security of the network.
What’s Next?
The Cassini dry run played a role as stress testing, which is crucial for us to learn the potential issues and stress tolerance of the network. We’ve applied what we’ve learned from the Cassini to optimise the Cronos mainnet and strive our best to offer the robust possible Cronos mainnet to the ecosystem. Thank you to all the Testers and Builders for participation, event results will be announced separately soon.
If you are interested in becoming a validator on Cronos mainnet, we recommend you try Cronos Testnet as the starting point, and our Cronos documentation can help you get started. We look forward to seeing you at the launch of Cronos mainnet next week on 8 November!