Shiqi Jiang

I am a Senior Researcher with Microsoft Research Asia (MSRA). I received the Ph.D. degree in computer science from Nanyang Technological University supervised by Prof. Mo Li, and the Bachelor degree from Zhejiang University.

My research interests broadly fall in edge computing, mobile systems and AIoT. My recent research mainly focuses on Edge AI, where I especially investigate the following topics: efficient inference systems on edge devices; LLM-powered agents for edge ecosystems; and AI-powered mobile sensing systems.

I am constantly reruiting research interns. If you are interested in our work, please feel free to contact me.

News

Jul 19, 2025	StreamMind got accepted to ICCV 2025. One paper was conditionally accepted to NSDI 2026.
Jul 1, 2025	Our new mobile GUI agent V-Droid is released. Please visit the paper, code and model for more details.
Apr 5, 2025	LUT-Diff was accepted to TMC.
Feb 5, 2025	One paper was conditionally accepted to SenSys 2025.
Oct 27, 2024	One paper was accepcted to TOSEM, one paper as accepted to TMC. 🎊🎊

Selected Recent Publications (Full List)

*Interns or students I mentored

SenSys ’25
Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment

Shenghong Dai*, Shiqi Jiang, Yifan Yang, Ting Cao, Mo Li, Suman Banerjee, and Lili Qiu

2025

Abs Bib PDF Code

This paper presents Babel, the expandable modality alignment model, specially designed for multi-modal sensing. While there has been considerable work on multi-modality alignment, they all struggle to effectively incorporate multiple sensing modalities due to the data scarcity constraints. How to utilize multi-modal data with partial pairings in sensing remains an unresolved challenge. Babel tackles this challenge by introducing the concept of expandable modality alignment. The key idea involves transforming the N-modality alignment into a series of binary-modality alignments. Novel techniques are also proposed to further mitigate data scarcity issue and balance the contribution of the newly incorporated modality with the previously established modality alignment during the expandable alignment process. We provide the comprehensive implementation. In the pre-training phase, Babel currently aligns 6 sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. For the deployment phase, as a foundation model, any single or combination of aligned modalities could be selected from Babel and applied to downstream tasks. Evaluation demonstrates Babel’s outstanding performance on eight human activity recognition datasets, compared to a broad range of baselines e.g., the SOTA single-modal sensing networks, multi-modal sensing framework, and multi-modal large language models. Babel not only improves the performance of individual modality sensing (12% averaged accuracy improvement), but also effectively fuses multiple available modalities (up to 22% accuracy increase). Case studies also highlight emerging application scenarios empowered by Babel, including cross-modality retrieval (i.e.,sensing imaging), and bridging LLMs for sensing comprehension.
@inproceedings{babel_sensys_25, author = {Dai*, Shenghong and Jiang, Shiqi and Yang, Yifan and Cao, Ting and Li, Mo and Banerjee, Suman and Qiu, Lili}, title = {Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment}, year = {2025}, isbn = {37150143722068}, publisher = {Association for Computing Machinery}, address = {Irvine, CA, USA}, url = {https://doi.org/10.1145/3715014.3722068}, doi = {3715014.3722068}, booktitle = {Proceedings of the 23rd ACM Conference on Embedded Networked Sensor Systems}, series = {SenSys '25}, }
MobiSys ’24
Empowering In-Browser Deep Learning Inference on Edge Devices with Just-In-Time Kernel Optimizations

Fucheng Jia*, Shiqi Jiang, Ting Cao, Wei Cui, Xu Cao, Yuanchun Li, Qipeng Wang, Deyun Zhang, Ju Ren, Yunxin Liu, Lili Qiu, and Mao Yang

2024

Abs Bib PDF Code

Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering in-browser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100 × × through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space reduces kernel tuning costs by focusing on Web programming requirements and efficient device resource utilization, pruning the optimization space from millions to only dozens. nnJIT is evaluated for modern models, e.g., BART, T5, and Llama 2, on a range of edge devices including laptops and smartphones using different browsers and hardware from ARM, Intel, AMD and Nvidia. The results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.
@inproceedings{nnjit_mobisys_24, author = {Jia*, Fucheng and Jiang, Shiqi and Cao, Ting and Cui, Wei and Cao, Xu and Li, Yuanchun and Wang, Qipeng and Zhang, Deyun and Ren, Ju and Liu, Yunxin and Qiu, Lili and Yang, Mao}, title = {Empowering In-Browser Deep Learning Inference on Edge Devices with Just-In-Time Kernel Optimizations}, year = {2024}, isbn = {97984007058162406}, publisher = {Association for Computing Machinery}, address = {Tokoy, Japan}, url = {https://doi.org/10.1145/3643832.3661892}, doi = {10.1145/3643832.3661892}, location = {Tokoy, Japan}, series = {MobiSys '24}, }
MobiCom ’24
AutoDroid: LLM-powered Task Automation in Android

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu

2024

Abs Bib PDF Code

Mobile task automation is an attractive technique that aims to enable voice-based hands-free user interaction with smartphones. However, existing approaches suffer from poor scalability due to the limited language understanding ability and the non-trivial manual efforts required from developers or endusers. The recent advance of large language models (LLMs) in language understanding and reasoning inspires us to rethink the problem from a model-centric perspective, where task preparation, comprehension, and execution are handled by a unified language model. In this work, we introduce AutoDroid, a mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The key insight is to combine the commonsense knowledge of LLMs and domain-specific knowledge of apps through automated dynamic analysis. The main components include a functionality-aware UI representation method that bridges the UI with the LLM, exploration-based memory injection techniques that augment the app-specific domain knowledge of LLM, and a multi-granularity query optimization module that reduces the cost of model inference. We integrate AutoDroid with off-the-shelf LLMs including online GPT-4/GPT-3.5 and on-device Vicuna, and evaluate its performance on a new benchmark for memory-augmented Android task automation with 158 common tasks. The results demonstrated that AutoDroid is able to precisely generate actions with an accuracy of 90.9%, and complete tasks with a success rate of 71.3%, outperforming the GPT-4-powered baselines by 36.4% and 39.7%.
@inproceedings{autodroid_mobicom_24, author = {Wen, Hao and Li, Yuanchun and Liu, Guohong and Zhao, Shanhui and Yu, Tao and Li, Toby Jia-Jun and Jiang, Shiqi and Liu, Yunhao and Zhang, Yaqin and Liu, Yunxin}, title = {AutoDroid: LLM-powered Task Automation in Android}, year = {2024}, isbn = {97984007048952409}, publisher = {Association for Computing Machinery}, address = {Washington D.C., DC, USA}, url = {https://doi.org/10.1145/3636534.3649379}, doi = {10.1145/3636534.3649379}, location = {Washington D.C., US}, series = {MobiCom '24}, }
MobiSys ’23
NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors

Jianyu Wei, Ting Cao, Shijie Cao, Shiqi Jiang, Shaowei Fu, Mao Yang, Yanyong Zhang, and Yunxin Liu

2023

Abs Bib PDF

Mobile devices are increasingly equipped with heterogeneous multiprocessors, e.g., CPU + GPU + DSP. Yet existing Neural Network (NN) inference fails to fully utilize the computing power of the heterogeneous multi-processors due to the sequential structures of NN models. Towards this end, this paper proposes NN-Stretch, a new model adaption strategy, as well as the supporting system. It automatically branches a given model according to the processor architecture characteristics. Compared to other popular model adaption techniques such as model pruning that often sacrifices accuracy, NN-Stretch accelerates inference while preserving accuracy. The key idea of NN-Stretch is to horizontally stretch a model structure, from a long and narrow model to a short and wide one with multiple branches. We formulate the model branching into an optimization problem. NN-Stretch attempts to narrow down the design space by taking into account the hard latency constraints through varying where the branches converge and how each branch is scaled to fit heterogeneous processors, as well as the soft accuracy constraints through maintaining the model skeleton and expressiveness of each branch. According to the constraints, NN-Stretch can efficiently generate accurate and efficient multi-branch models. To facilitate easy deployment, this paper also devises a subgraph-based spatial scheduler for existing inference frameworks to parallelly execute the multi-branch models. Our experimental results are very promising, with up to 3.85× speedup compared to single CPU/GPU/DSP execution and up to 0.8% accuracy improvement.
@inproceedings{nnstretch_mobisys_23, author = {Wei, Jianyu and Cao, Ting and Cao, Shijie and Jiang, Shiqi and Fu, Shaowei and Yang, Mao and Zhang, Yanyong and Liu, Yunxin}, title = {NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors}, year = {2023}, isbn = {9798400701108}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3581791.3596870}, doi = {10.1145/3581791.3596870}, location = {Helsinki, Finland}, series = {MobiSys '23}, }
MobiCom ’23
AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments

Hao Wen, Yuanchun Li, Zunshuai Zhang, Shiqi Jiang, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang, and Yunxin Liu

2023

Abs Bib PDF

Deep learning models are increasingly deployed to edge devices for real-time applications. To ensure stable service quality across diverse edge environments, it is highly desirable to generate tailored model architectures for different conditions. However, conventional pre-deployment model generation approaches are not satisfactory due to the difficulty to handle the diversity of edge environments and the demand for edge information. In this paper, we propose to adapt the model architecture after deployment in the target environment, where the model quality can be precisely measured and private edge data can be retained. To achieve efficient and effective edge model generation, we introduce a pretraining-assisted on-cloud model elastification method and an edge-friendly on-device architecture search method. Model elastification generates a high-quality search space of model architectures with the guidance of a developer-specified oracle model. Each subnet in the space is a valid model with different environment affinity, and each device efficiently finds and maintains the most suitable subnet based on a series of edge-tailored optimizations. Extensive experiments on various edge devices demonstrate that our approach is able to achieve significantly better accuracy-latency tradeoffs (eg. 46.74% higher on average accuracy with 60% latency budget) than strong baselines with minimal overhead (13 GPU hours in the cloud and 2 minutes on the edge server).
@inproceedings{adaptivenet_mobicom_23, author = {Wen, Hao and Li, Yuanchun and Zhang, Zunshuai and Jiang, Shiqi and Ye, Xiaozhou and Ouyang, Ye and Zhang, Ya-Qin and Liu, Yunxin}, title = {AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments}, year = {2023}, isbn = {9781450399906}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3570361.3592529}, doi = {10.1145/3570361.3592529}, booktitle = {Proceedings of the 29th Annual International Conference on Mobile Computing and Networking}, location = {Madrid, Spain}, series = {MobiCom '23}, }
SenSys ’22
Turbo: Opportunistic Enhancement for Edge Video Analytics

Yan Lu*, Shiqi Jiang, Ting Cao, and Yuanchao Shu

2022

Abs Bib PDF

Edge computing is being widely used for video analytics. To alleviate the inherent tension between accuracy and cost, various video analytics pipelines have been proposed to optimize the usage of GPU on edge nodes. Nonetheless, we find that GPU compute resources provisioned for edge nodes are commonly under-utilized due to video content variations, subsampling and filtering at different places of a video analytics pipeline. As opposed to model and pipeline optimization, in this work, we study the problem of opportunistic data enhancement using the non-deterministic and fragmented idle GPU resources. In specific, we propose a task-specific discrimination and enhancement module, and a model-aware adversarial training mechanism, providing a way to exploit idle resources to identify and transform pipeline-specific, low-quality images in an accurate and efficient manner. A multi-exit enhancement model structure and a resource-aware scheduler is further developed to make online enhancement decisions and fine-grained inference execution under latency and GPU resource constraints. Experiments across multiple video analytics pipelines and datasets reveal that our system boosts DNN object detection accuracy by 7.27-11.34% by judiciously allocating 15.81-37.67% idle resources on frames that tend to yield greater marginal benefits from enhancement.
@inproceedings{turbo_sensys_22, author = {Lu*, Yan and Jiang, Shiqi and Cao, Ting and Shu, Yuanchao}, title = {Turbo: Opportunistic Enhancement for Edge Video Analytics}, year = {2022}, isbn = {9781450398862}, publisher = {Association for Computing Machinery}, address = {Boston, MA, USA}, url = {https://doi.org/10.1145/3560905.3568501}, doi = {10.1145/3560905.3568501}, booktitle = {Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems}, series = {SenSys '22}, }
MobiSys ’22
CoDL: Efficient CPU-GPU Co-Execution for Deep Learning Inference on Mobile Devices

Fucheng Jia*, Deyu Zhang, Ting Cao, Shiqi Jiang, Yunxin Liu, Ju Ren, and Yaoxue Zhang

2022

Abs Bib PDF Code

Concurrent inference execution on heterogeneous processors is critical to improve the performance of increasingly heavy deep learning (DL) models. However, available inference frameworks can only use one processor at a time, or hardly achieve speedup by concurrent execution compared to using one processor. This is due to the challenges to 1) reduce data sharing overhead, and 2) properly partition each operator between processors.By solving the challenges, we propose CoDL, a concurrent DL inference framework for the CPU and GPU on mobile devices. It can fully utilize the heterogeneous processors to accelerate each operator of a model. It integrates two novel techniques: 1) hybrid-type-friendly data sharing, which allows each processor to use its efficient data type for inference. To reduce data sharing overhead, we also propose hybrid-dimension partitioning and operator chain methods; 2) non-linearity- and concurrency-aware latency prediction, which can direct proper operator partitioning by building an extremely light-weight but accurate latency predictor for different processors.Based on the two techniques, we build the end-to-end CoDL inference framework, and evaluate it on different DL models. The results show up to 4.93\texttimes speedup and 62.3% energy saving compared with the state-of-the-art concurrent execution system.
@inproceedings{codl_mobisys_22, author = {Jia*, Fucheng and Zhang, Deyu and Cao, Ting and Jiang, Shiqi and Liu, Yunxin and Ren, Ju and Zhang, Yaoxue}, title = {CoDL: Efficient CPU-GPU Co-Execution for Deep Learning Inference on Mobile Devices}, year = {2022}, isbn = {9781450391856}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3498361.3538932}, doi = {10.1145/3498361.3538932}, booktitle = {Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services}, pages = {209–221}, numpages = {13}, keywords = {mobile devices, deep learning inference, CPU-GPU co-execution}, location = {Portland, Oregon}, series = {MobiSys '22}, }
MobiCom ’21
Flexible High-Resolution Object Detection on Edge Devices with Tunable Latency

Shiqi Jiang, Zhiqi Lin*, Yuanchun Li, Yuanchao Shu, and Yunxin Liu

2021

Abs Bib PDF Slides

Object detection is a fundamental building block of video analytics applications. While Neural Networks (NNs)-based object detection models have shown excellent accuracy on benchmark datasets, they are not well positioned for high-resolution images inference on resource-constrained edge devices. Common approaches, including down-sampling inputs and scaling up neural networks, fall short of adapting to video content changes and various latency requirements. This paper presents Remix, a flexible framework for high-resolution object detection on edge devices. Remix takes as input a latency budget, and come up with an image partition and model execution plan which runs off-the-shelf neural networks on non-uniformly partitioned image blocks. As a result, it maximizes the overall detection accuracy by allocating various amount of compute power onto different areas of an image. We evaluate Remix on public dataset as well as real-world videos collected by ourselves. Experimental results show that Remix can either improve the detection accuracy by 18%-120% for a given latency budget, or achieve up to 8.1\texttimes inference speedup with accuracy on par with the state-of-the-art NNs.
@inproceedings{remix_mobicom_21, author = {Jiang, Shiqi and Lin*, Zhiqi and Li, Yuanchun and Shu, Yuanchao and Liu, Yunxin}, title = {Flexible High-Resolution Object Detection on Edge Devices with Tunable Latency}, year = {2021}, isbn = {9781450383424}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3447993.3483274}, doi = {10.1145/3447993.3483274}, booktitle = {Proceedings of the 27th Annual International Conference on Mobile Computing and Networking}, pages = {559–572}, numpages = {14}, keywords = {deep neural networks, edge computing, object detection, video analytics, tunable latency}, location = {New Orleans, Louisiana}, series = {MobiCom '21}, }
IEEE TMC
Efficient and Adaptive Diffusion Model Inference Through Lookup Table on Mobile Devices

Qipeng Wang*, Shiqi Jiang, Yifan Yang, Ruiqi Liu, Yuanchun Li, Ting Cao, and Xuanzhe Liu

IEEE Transactions on Mobile Computing Apr 2025

Abs Bib PDF

Diffusion models have revolutionized image synthesis applications. Many studies focus on using approximate computation such as model quantization to reduce inference costs on mobile devices. However, due to their extensive model parameters and autoregressive inference fashion, the overhead of diffusion models remains high, which is challenging for mobile devices to handle. To reduce the inference overhead of diffusion models on mobile devices, we propose LUT-Diff, an algorithm-system co-design specifically tailored for mobile device diffusion model inference optimization. LUT-Diff optimizes using lookup tables and can efficiently generate a series of lookup table candidates for diffusion models without end-to-end training. During inference, LUT-Diff adaptively selects the best inference strategy based on the application/user’s latency budget. Additionally, LUT-Diff includes a parallel inference engine that rapidly completes model inference through CPU-GPU co-scheduling. Extensive experiments demonstrate that LUT-Diff can generate images comparable to the original model, with an up to 0.012 MSE in generated images. LUT-Diff can also achieve up to 9.1× inference acceleration and reduce the inference memory footprint by up to 70.9% compared to baseline methods. Moreover, LUT-Diff can save at least 3281× the learning cost of lookup tables.
@article{lut_diff_tmc, author = {Wang*, Qipeng and Jiang, Shiqi and Yang, Yifan and Liu, Ruiqi and Li, Yuanchun and Cao, Ting and Liu, Xuanzhe}, journal = { IEEE Transactions on Mobile Computing }, title = {{ Efficient and Adaptive Diffusion Model Inference Through Lookup Table on Mobile Devices }}, year = {2025}, volume = {}, number = {01}, issn = {1558-0660}, pages = {1-18}, keywords = {}, doi = {10.1109/TMC.2025.3558203}, url = {https://doi.ieeecomputersociety.org/10.1109/TMC.2025.3558203}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, month = apr, }
ACM TOSEM
Anatomizing Deep Learning Inference in Web Browsers

Qipeng Wang*, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, and Xuanzhe Liu

ACM Transactions on Software Engineering and Methodology Aug 2024

Abs Bib PDF

Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology.
@article{inference_in_browser_tosem, author = {Wang*, Qipeng and Jiang, Shiqi and Chen, Zhenpeng and Cao, Xu and Li, Yuanchun and Li, Aoyu and Ma, Yun and Cao, Ting and Liu, Xuanzhe}, title = {Anatomizing Deep Learning Inference in Web Browsers}, year = {2024}, issue_date = {August 2024}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, issn = {1049-331X}, url = {https://doi.org/10.1145/3688843}, doi = {10.1145/3688843}, journal = {ACM Transactions on Software Engineering and Methodology}, month = aug, keywords = {Deep learning, Web browser, measurement}, }
ACM TOSN
Large-Scale Video Analytics with Cloud–Edge Collaborative Continuous Learning

Ya Nan*, Shiqi Jiang, and Mo Li

ACM Transactions on Sensor Networks Oct 2023

Abs Bib PDF

Deep learning–based video analytics demands high network bandwidth to ferry the large volume of data when deployed on the cloud. When incorporated at the edge side, only lightweight deep neural network (DNN) models are affordable due to computational constraint. In this article, a cloud–edge collaborative architecture is proposed combining edge-based inference with cloud-assisted continuous learning. Lightweight DNN models are maintained at the edge servers and continuously retrained with a more comprehensive model on the cloud to achieve high video analytics performance while reducing the amount of data transmitted between edge servers and the cloud. The proposed design faces the challenge of constraints of both computation resources at the edge servers and network bandwidth of the edge–cloud links. An accuracy gradient-based resource allocation algorithm is proposed to allocate the limited computation and network resources across different video streams to achieve the maximum overall performance. A prototype system is implemented and experiment results demonstrate the effectiveness of our system with up to 28.6% absolute mean average precision gain compared with alternative designs.
@article{cloud-edge-video-analysis_tosn, author = {Nan*, Ya and Jiang, Shiqi and Li, Mo}, title = {Large-Scale Video Analytics with Cloud–Edge Collaborative Continuous Learning}, year = {2023}, issue_date = {January 2024}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {20}, number = {1}, issn = {1550-4859}, url = {https://doi.org/10.1145/3624478}, doi = {10.1145/3624478}, journal = {ACM Transactions on Sensor Networks}, month = oct, articleno = {14}, numpages = {23}, keywords = {distributed system, Edge computing, continuous learning, video analytics}, }