publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2025
- Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context PermutationWenyu Huang, Pavlos Vougiouklis, Mirella Lapata, and 1 more authorIn Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025Oral Presentation
Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs’ performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.
@inproceedings{huang-etal-2025-masking, title = {Masking in Multi-hop {QA}: An Analysis of How Language Models Perform with Context Permutation}, author = {Huang, Wenyu and Vougiouklis, Pavlos and Lapata, Mirella and Pan, Jeff Z.}, editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher}, booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.acl-long.869/}, pages = {17781--17795}, isbn = {979-8-89176-251-0}, note = {<b>Oral Presentation</b>} } - Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future DirectionsYiming Du*, Wenyu Huang*, Danna Zheng*, and 5 more authorsMay 2025
@misc{du2025rethinkingmemoryaitaxonomy, title = {Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions}, author = {Du, Yiming and Huang, Wenyu and Zheng, Danna and Wang, Zhaowei and Montella, Sebastien and Lapata, Mirella and Wong, Kam-Fai and Pan, Jeff Z.}, year = {2025}, eprint = {2505.00675}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2505.00675}, month = may, } - Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and ChallengesHongru Wang, Wenyu Huang, Yufei Wang, and 7 more authorsIn Findings of the Association for Computational Linguistics: ACL 2025, Jul 2025
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose DialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textittool creation; 2) \textittool utilization: tool awareness, tool selection, tool execution; and 3) \textitrole-consistent response: response generation and role play. Furthermore, we build VirtualMobile – an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons .
@inproceedings{wang-etal-2025-rethinking-stateful, title = {Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges}, author = {Wang, Hongru and Huang, Wenyu and Wang, Yufei and Xi, Yuanhao and Lu, Jianqiao and Zhang, Huan and Hu, Nan and Liu, Zeming and Pan, Jeff Z. and Wong, Kam-Fai}, editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2025}, month = jul, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.findings-acl.284/}, pages = {5433--5453}, isbn = {979-8-89176-256-5}, } - Prompting large language models with knowledge graphs for question answering involving long-tail factsWenyu Huang, Guancheng Zhou, Mirella Lapata, and 3 more authorsKnowledge-Based Systems, May 2025
Although Large Language Models (LLMs) are effective in performing various NLP tasks, they still struggle to handle tasks that require extensive, real-world knowledge, especially when dealing with long-tail facts (facts related to long-tail entities). This limitation highlights the need to supplement LLMs with non-parametric knowledge. To address this issue, we analysed the effects of different types of non-parametric knowledge, including textual passage and knowledge graphs (KGs). Since LLMs have probably seen the majority of factual question-answering datasets already, to facilitate our analysis, we proposed a fully automatic pipeline for creating a benchmark that requires knowledge of long-tail facts for answering the involved questions. Using this pipeline, we introduce the LTGen benchmark. We evaluate state-of-the-art LLMs in different knowledge settings using the proposed benchmark. Our experiments show that LLMs alone struggle with answering these questions, especially when the long-tail level is high or rich knowledge is required. Nonetheless, the performance of the same models improved significantly when they were prompted with non-parametric knowledge. We observed that, in most cases, prompting LLMs with KG triples surpasses passage-based prompting using a state-of-the-art retriever. In addition, while prompting LLMs with both KG triples and documents does not consistently improve knowledge coverage, it can dramatically reduce hallucinations in the generated content.
@article{HUANG2025113648, title = {Prompting large language models with knowledge graphs for question answering involving long-tail facts}, journal = {Knowledge-Based Systems}, volume = {324}, pages = {113648}, year = {2025}, issn = {0950-7051}, doi = {https://doi.org/10.1016/j.knosys.2025.113648}, url = {https://www.sciencedirect.com/science/article/pii/S095070512500694X}, author = {Huang, Wenyu and Zhou, Guancheng and Lapata, Mirella and Vougiouklis, Pavlos and Montella, Sebastien and Pan, Jeff Z.}, keywords = {Large language models, Knowledge graphs, Retrieval-augmented generation, Evaluation}, month = may, }
2024
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQAWenyu Huang, Guancheng Zhou, Hongru Wang, and 3 more authorsIn Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024
Retrieval-Augmented Generation (RAG) is widely used to inject external non-parametric knowledge into large language models (LLMs). Recent works suggest that Knowledge Graphs (KGs) contain valuable external knowledge for LLMs. Retrieving information from KGs differs from extracting it from document sets. Most existing approaches seek to directly retrieve relevant subgraphs, thereby eliminating the need for extensive SPARQL annotations, traditionally required by semantic parsing methods. In this paper, we model the subgraph retrieval task as a conditional generation task handled by small language models. Specifically, we define a subgraph identifier as a sequence of relations, each represented as a special token stored in the language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, achieves competitive retrieval performance compared to state-of-the-art models relying on 7B parameters, demonstrating that small language models are capable of performing the subgraph retrieval task. Furthermore, our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks. Our model and data will be made available online: https://github.com/hwy9855/GSR.
@inproceedings{huang-etal-2024-less, title = {Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop {KGQA}}, author = {Huang, Wenyu and Zhou, Guancheng and Wang, Hongru and Vougiouklis, Pavlos and Lapata, Mirella and Pan, Jeff Z.}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-emnlp.927/}, doi = {10.18653/v1/2024.findings-emnlp.927}, pages = {15787--15803}, } - UniMS-RAG: A Unified Multi-source Retrieval-Augmented Generation for Personalized Dialogue SystemsHongru Wang, Wenyu Huang, Yang Deng, and 6 more authorsJan 2024
@misc{wang2024unimsragunifiedmultisourceretrievalaugmented, title = {UniMS-RAG: A Unified Multi-source Retrieval-Augmented Generation for Personalized Dialogue Systems}, author = {Wang, Hongru and Huang, Wenyu and Deng, Yang and Wang, Rui and Wang, Zezhong and Wang, Yufei and Mi, Fei and Pan, Jeff Z. and Wong, Kam-Fai}, year = {2024}, eprint = {2401.13256}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2401.13256}, month = jan, } - A Large-scale Offer Alignment Model for Partitioning Filtering and Matching Product OffersWenyu Huang, André Melo, and Jeff Z. PanIn Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington DC, USA, Jul 2024
Offer alignment is a key step in a product knowledge graph construction pipeline. It aims to align retailer offers of the same product for better coverage of product details. With the rapid development of online shopping services, the offer alignment task is applied in ever larger datasets. This work aims to build an offer alignment system that can efficiently be used in large-scale offer data. The key components of this system include: 1) common offer encoders for encoding text offer data into representations; 2) trainable LSH partitioning module to divide similar offers into small blocks; 3) lightweight sophisticated late-interactions for efficient filtering and scoring of offer alignment candidate pairs. We evaluate the system on public WDC offer alignment dataset, as well as DBLP-Scholar and DBLP-ACM.
@inproceedings{10.1145/3626772.3661351, author = {Huang, Wenyu and Melo, Andr\'{e} and Pan, Jeff Z.}, title = {A Large-scale Offer Alignment Model for Partitioning Filtering and Matching Product Offers}, year = {2024}, isbn = {9798400704314}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3626772.3661351}, doi = {10.1145/3626772.3661351}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {2880–2884}, numpages = {5}, keywords = {entity matching, offer alignment, product matching}, location = {Washington DC, USA}, series = {SIGIR '24}, month = jul }
2023
- Retrieval Augmented Generation with Rich Answer EncodingWenyu Huang, Mirella Lapata, Pavlos Vougiouklis, and 2 more authorsIn Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Nov 2023
@inproceedings{huang-etal-2023-retrieval, title = {Retrieval Augmented Generation with Rich Answer Encoding}, author = {Huang, Wenyu and Lapata, Mirella and Vougiouklis, Pavlos and Papasarantopoulos, Nikos and Pan, Jeff}, editor = {Park, Jong C. and Arase, Yuki and Hu, Baotian and Lu, Wei and Wijaya, Derry and Purwarianti, Ayu and Krisnadhi, Adila Alfa}, booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = nov, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.ijcnlp-main.65/}, doi = {10.18653/v1/2023.ijcnlp-main.65}, pages = {1012--1025}, }