
References631
Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal,
632
Sadhana Kumaravel, Matthew Stallone, Rameswar
633
Panda, Yara Rizk, GP Bhargav, Maxwell Crouse,
634
Chulaka Gunasekara, et al. 2024. Granite-function
635
calling model: Introducing function calling abilities
636
via multi-task learning of granular tasks. In EMNLP,
637
pages 1131–1139.638
Mistral AI. 2024. Un ministral, des ministraux.639
AI@Meta. 2024. Llama 3 model card.640
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng,
641
Jian-Guang Lou, and Weizhu Chen. 2023. Learn-
642
ing from mistakes makes llm better reasoner. arXiv
643
preprint arXiv:2310.20689.644
Anthropic. 2024. Claude 3.5 sonnet.645
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu,
646
Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao
647
Liu, Aohan Zeng, Lei Hou, et al. 2023. Longbench:
648
A bilingual, multitask benchmark for long context
649
understanding. arXiv preprint arXiv:2308.14508.650
Kinjal Basu, Ibrahim Abdelaziz, Kelsey Bradford,
651
Maxwell Crouse, Kiran Kate, Sadhana Kumaravel,
652
Saurabh Goyal, Asim Munawar, Yara Rizk, Xin
653
Wang, et al. 2024. Nestful: A benchmark for eval-
654
uating llms on nested sequences of api calls. arXiv
655
preprint arXiv:2409.03797.656
Tom Brown, Benjamin Mann, Nick Ryder, Melanie
657
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
658
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
659
Askell, et al. 2020. Language models are few-shot
660
learners. In NeurIPS.661
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Con-
662
ghui He, Jiaqi Wang, Feng Zhao, and Dahua
663
Lin. 2023. Sharegpt4v: Improving large multi-
664
modal models with better captions. arXiv preprint
665
arXiv:2311.12793.666
Sijia Chen, Yibo Wang, Yi-Feng Wu, Qing-Guo Chen,
667
Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun
668
Zhang. 2024a. Advancing tool-augmented large lan-
669
guage models: Integrating insights from errors in
670
inference trees. arXiv preprint arXiv:2406.07115.671
Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun
672
Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo,
673
Songyang Zhang, Dahua Lin, Kai Chen, and Feng
674
Zhao. 2024b. T-eval: Evaluating the tool utilization
675
capability of large language models step by step. In
676
ACL, pages 9510–9529.677
Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei
678
Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and
679
Feng Zhao. 2024c. Agent-FLAN: Designing data
680
and methods of effective agent tuning for large lan-
681
guage models. In ACL, pages 9354–9366.682
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen-
683
hui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu
684
Feng, Hanlin Zhao, et al. 2024. Chatglm: A family
685
of large language models from glm-130b to glm-4 all
686
tools. arXiv preprint arXiv:2406.12793.687
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong
688
Shen, Yujiu Yang, Nan Duan, and Weizhu Chen.
689
2023. Critic: Large language models can self-correct
690
with tool-interactive critiquing. arXiv preprint
691
arXiv:2305.11738.692
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang,
693
Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and
694
Yang Liu. 2024a. StableToolBench: Towards stable
695
large-scale benchmarking on tool learning of large
696
language models. In ACL, pages 11143–11156. 697
Zishan Guo, Yufei Huang, and Deyi Xiong. 2024b.
698
CToolEval: A Chinese benchmark for LLM-powered
699
agent evaluation in real-world API interactions. In
700
ACL, pages 15711–15724. 701
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam
702
Perelman, Aditya Ramesh, Aidan Clark, AJ Os-
703
trow, Akila Welihinda, Alan Hayes, Alec Radford,
704
et al. 2024. Gpt-4o system card. arXiv preprint
705
arXiv:2410.21276.706
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim,
707
and Sunghun Kim. 2024. A survey on large lan-
708
guage models for code generation. arXiv preprint
709
arXiv:2406.00515.710
Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei,
711
Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao
712
Dong, Hongning Wang, Jie Tang, and Minlie Huang.
713
2024. CritiqueLLM: Towards an informative critique
714
generation model for evaluation of large language
715
model generation. In ACL, pages 13034–13054. 716
Yilun Kong, Jingqing Ruan, YiHong Chen, Bin Zhang,
717
Tianpeng Bao, Shi Shiwei, du Guo Qing, Xiaoru Hu,
718
Hangyu Mao, Ziyue Li, Xingyu Zeng, Rui Zhao, and
719
Xueqian Wang. 2024. TPTU-v2: Boosting task plan-
720
ning and tool usage of large language model-based
721
agents in real-world industry systems. In EMNLP,
722
pages 371–385. 723
Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang,
724
Dahua Lin, Kai Chen, and Xian-ling Mao. 2024. Crit-
725
icbench: Evaluating large language models as critic.
726
arXiv preprint arXiv:2402.13764.727
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song,
728
Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang,
729
and Yongbin Li. 2023. API-bank: A comprehensive
730
benchmark for tool-augmented LLMs. In EMNLP,
731
pages 3102–3116. 732
Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo,
733
Haowei Liu, and Yujiu Yang. 2024. CriticBench:
734
Benchmarking LLMs for critique-correct reasoning.
735
In ACL, pages 1552–1587. 736
9