Research - Arthur Lu

PIR Search Using Hyperbolic Text Embeddings

Arthur Lu, Shufan Li

Private Information retrieval (PIR) schemes such as SimplePIR allow users to privately fetch entries from a public database in an obvious manner. Building on top of PIR, more complex systems such as Coeus and Tiptoe has achieved oblivious document ranking and retrieval. While effective and useful, existing systems only support ranking by cosine similarity of embeddings in euclidean space. Hyperbolic embeddings have recently shown promising results in a wide range of applications such as hierarchical representation and source code search. In this work, we explore the porblem of oblivious document ranking and retrieval based on hyperbolic embeddings. One of the key challenge is the complex, highly-non-linear nature of the hyperbolic distance function, which are to compute naively using existing schemes. To address this problem, we derived an equivalent, simpler surrogate metric which always yield the same ranking as the hyperbolic distance metric, as well as a scheme to compute this metric in an oblivious manner. Experiment results show that the proposed scheme is correct and achieves better performance thanks to multithreaded optimizations and usage of specialized hardwares such as GPUs.

Automated Grammar Generation Using LLMs

Aman Ganapathy Manvattira, Annaelle Baiget, Arthur Lu, Emmett Cocke, Krish Patel, James Shiffer

Large language models (LLMs) have a high potential for processing highly structured text inputs to generate grammar representations. Leveraging LLMs in generating grammar would reduce the time spent and effort required to create a grammar for fuzzing, unit testing, and input validation. In this project, we create a system that handles grammar creation using automated feedback and human feedback. We develop a pipeline for assisted generation of grammar on unseen domains. We show the potential for LLMs to generate complex grammars which can be used for many software testing applications and reflect on its limitations with complex unseen domains.

Evaluation of Caching Strategies for Microservices

Arthur Lu, Derek Wang, Isha Atul Pardikar, Purva Gaikwad, Xuanzhe Han

Microservices are a popular architecture due to their logical separation of concerns among multiple teams of developers. However, performance and scalability remain ongoing challenges with significant research focus. One approach to improving performance and scalability is caching. In this paper, we explore advanced caching strategies and evaluate their effectiveness in accessing data within microservices. We test different eviction policies, cache topologies, and data prefetching techniques on common access patterns. Our results show that these strategies perform well on select patterns, highlighting their potential to outperform state-of-the-art solutions such as MuCache. We hope that advanced strategies can serve as a drop-in upgrade for existing microservice caches.

Decision Tree Optimization for RMT Architectures

Arthur Lu, Nathan Huey, Jai Parera, Krish Patel

Processing packets in-network and learning meta-information about them is gaining traction in networking. By shifting computation from end nodes to the network, models can leverage the efficient computation afforded to routers by different architectures such as the reconfigurable match-action table (RMT). By performing this shift in computation while maintaining wire speeds (line rate), model inference can be performed at no cost to latency. Furthermore, Decision trees are naturally suited for the RMT model this due to their layered, hierarchical structure. Previous works have proposed this exact use of ternary content addressable memory (TCAM). We investigate the proposed implementation by simulating it in a Python ideal-RMT simulator and propose new optimizations for TCAM usage. We propose a new priority aware tree compression technique that reduces TCAM usage significantly without reducing model accuracy, and a range boundary optimization technique that additionally reduces TCAM usage at some cost to accuracy. We find that these optimizations are promising towards reducing the TCAM usage on RMT devices and discuss future directions for research.

ExplainFuzz : Explainability and wellformedness for probabilistic test generation

Annaelle Baiget, London Bielicke, Arthur Lu, Brooke Simon

Understanding and explaining the structure of generated test inputs is crucial for effective testing, debugging, and analysis of software systems. However, existing approaches—such as probabilistic context-free grammars (pCFGs) and large language models (LLMs)—lack the ability to provide fine-grained statistical explanations about generated test inputs and their structure. We introduce ExplainFuzz, a novel framework that leverages probabilistic circuits (PCs) to model and query the distribution of grammar-based inputs in an interpretable manner. Starting from a context-free grammar (CFG), we refactor it to support PC compilation, and train the resulting Probabilistic Circuit on a synthetically generated corpus produced with Grammarinator during a fuzzing campaign. The trained PC supports a variety of probabilistic queries, offering insight into the statistical distribution of generated inputs. Additionally, for the SQL domain, we demonstrate a custom generator that transforms PC generated samples into executable queries by leveraging PC's generation capabilities to enable concrete synthetic test input generation. We evaluate ExplainFuzz across multiple domains including SQL, REDIS, and JANUS, highlighting its ability to provide explainable, grammar-aware insights into test input structure. Our results show that ExplainFuzz outperforms traditional pCFGs and LLMs in terms of log-likelihood estimation and interpretability, contributing a new direction for explainable grammar-based fuzzing.