As we move towards higher adoption of digital payments, the pursuit of exascale payment systems is paramount. During the 11th International Symposium on Applied Computing for Software and Smart Systems (ACSS) held on July 19-20, 2024, Suparna Mitra and Sumit Misra presented "Key Considerations for Building Exa-scale Payment Systems." Their insights shed light on the challenges and innovations needed to build faster payment systems capable of handling millions of transactions a month, which apparently violates the CAP Theorem of computer science.
The Scale and Impact
According to the 2024 ACI Prime Time for Real-Time Report, India’s UPI, Brazil’s Pix, Thailand’s PromptPay, and China’s IBPS are leading the global real-time market with 129.3B, 37.4B, 20.4B, and 17.2B transactions respectively in 2023. The ubiquitous reach of these exascale real-time payment systems has made them the most popular payment choice in these countries. The societal impact of such systems is significant, touching the lives of approximately 40% of the population in India, 77% in Brazil, 80% in China, and 90% in Thailand.
The following critical factors need to be considered for building successful exascale payment systems:
- If we consider about 3000-6000 transactions per second (TPS) on an average with peaks that are typically 3 times the average or more, we are talking about 9K-18K TPS. Accommodating higher volumes during festive seasons, and some growth in adoption, such platforms need to support 10K – 30K TPS.
- A feature-rich financial transaction needs ~10 state transitions, like authentication, authorization, debit, credit, notification, etc. 10K-30k TPS needs 100K-300K state transitions per second.
- Being a real-time payment system, the execution latency cannot be more than a few seconds as at the consumer end, the user experience has to be less than 6-8 seconds.
- Such nationally important financial services demand 99.999% availability which means a downtime of a maximum of 5 minutes in a year.
- For payment messages, end-to-end security is a natural demand. We must not forget that this system will generate several terabytes of data every day which has to be efficiently handled.
- The industry trend is to move away from hardware and software vendor lock-in solutions. Hence we cannot use proprietary hardware and/or software; rather, the system must operate without relying on provider-specific hardware or software and should be able to operate on commodity hardware and open-source software. This clearly means that the solution will have to be distributed and support partition tolerance.
Handling 100K-300K state transition with consistency, while providing 99.999% availability, and operating with partition tolerance is where innovation is needed as this violates a key tenet of computer science – the CAP Theorem.
Navigating the CAP Theorem (Consistency, Availability, and Partition tolerance)
The CAP Theorem posits that a distributed system can deliver only two of the three desired characteristics: consistency, availability, and partition tolerance. Financial systems cannot compromise on consistency and for high throughput, partition tolerance is needed. To meet the 99.999% availability requirement, the system must ensure seamless transaction processing even during data center outages.
So, how can we solve this conundrum? When faced with the challenges of quantum physics, physicists had to drop the idea where the “observer” remained outside the frame and included bystanders in the loop. Similarly, we included the “user” in the mix and reimagined ‘Availability’ of the CAP Theorem and implementation of ‘Consistency’ to solve the conundrum.
Reimagining Consistency from ACID1 Compliance implementation perspective
As mentioned earlier, financial transactions typically undergo around ten state transitions. At 10K-30k TPS, this translates to 100K-300K state transitions per second. Standard relational database management systems (RDBMS) would struggle to implement ACID consistency under this load, necessitating a large infrastructure to support it. Instead, in-memory databases like Redis, complemented by stream-based persistence solutions such as Kafka, offer a viable alternative. While Redis ensures Atomicity, Consistency, and Isolation, Kafka provides Durability, achieving full ACID compliance.
Reimagining Availability
The architecture of an exascale payment system requires redefinition with Processing Units (PUs) which are self-sufficient software modules that include state machines and data. There are multiple PUs within and across the data centres. Transactions are sharded using a consistent hashing algorithm, ensuring that all messages related to a particular transaction are processed within the same PU. Thus, we take the processing to the data in the distributed environment rather than taking data to the processing. This approach offers linear scalability and rapid processing by co-locating the application and Online Transaction Processing (OLTP) within the same unit, reducing network overhead and enhancing performance.
Any PU can fail. When a PU fails there are two options, (a) either to detect early and notify or (b) to try and maintain the state-transition map redundantly across PU-pairs within a data centre or PU-pairs across data centres and route the transaction to that companion PU-pair.
The overhead of maintaining PU-pairs within a data centre is minimal while doing the same across data centres is enormous. Here, we should bring the “user-in-the-loop” and instead of cross-site redundancy, we should inform the user that the transaction is failing due to some reason and request the user to retry. During this period, we should mark the PU or even the data centre as “not responsive” and route the next attempt via a good PU / data centre.
Thus, the system is designed to auto-detect and isolate failed components within milliseconds. An active-active deployment across multiple data centers ensures that even if one site fails, new transactions are automatically rerouted to available sites within seconds. This is the fail-fast principle of execution.
Another area of concern is if a large bank shows signs of failure. This can result in creating excessive backpressure on the system to the extent that the system itself starts to crack. The fail-fast mechanism is applied to underperforming banks and pre-emptively managed to prevent systemic congestion.
Summary
In summary, by innovating the Consistency and Availability aspects in the context of the financial system, we can solve the CAP Theorem challenges, but we cannot stop there. Looking at the record-breaking pace at which these systems are growing, it is important to aim for an order-of-magnitude increase in scalability with newer innovations.
Conclusion
The presentation at ACSS 2024 highlighted a forward-thinking approach to building exascale payment systems that are not only technologically advanced but also socially and economically impactful. As the financial industry moves towards a future where transactions occur at lightning speed, the innovations discussed will play a pivotal role in shaping the global payment landscape. By prioritizing customer expectations and leveraging cutting-edge technology, exascale payment systems are setting new standards for trust, efficiency, and resilience in digital payments.

Leave a Reply