Shiqi Jiang
I am a Senior Researcher with Systems and Networking Research Group, Microsoft Research Asia (MSRA). I received the Ph.D. degree in computer science from Nanyang Technological University in 2018, supervised by Prof. Mo Li, and the Bachelor degree in computer engineering from Zhejiang University.
My research interests broadly fall in edge computing, mobile sensing, Internet-of-Things (IoT) and wearables. My recent research mainly focuses on Edge AI
, where I especially investigate the following topics: efficient inference system on Edge; continuous learning system for Edge; and AI-powered sensing system.
I am constantly reruiting research interns. If you are interested in our work, please feel free to contact me.
News
Mar 9, 2024 | One paper was conditionally accepcted to MobiSys 2024. 🎊🎊 |
---|---|
Feb 8, 2024 | Our measurement study on “In-Browser Deep Learning Inference” was released |
Jan 24, 2024 | One paper was accepcted to MobiCom 2024 (Summer Round). |
Mar 3, 2023 | One paper was conditionally accepcted to MobiSys 2023. |
Nov 22, 2022 | AdaptiveNet got accepted to MobiCom 2023 (Summer Round). |
Recent Publications
- MobiSys ’24Empowering In-Browser Deep Learning Inference on Edge Devices with Just-In-Time Kernel OptimizationsFucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Xu Cao, Yuanchun Li, Qipeng Wang, Deyun Zhang, Ju Ren, Yunxin Liu, Lili Qiu, and Mao Yang2024
Web is increasingly becoming the primary platform to deliver AI services onto edge devices, making in-browser deep learning (DL) inference more prominent. Nevertheless, the heterogeneity of edge devices, combined with the underdeveloped state of Web hardware acceleration practices, hinders current in-browser inference from achieving its full performance potential on target devices. To address this issue, this paper presents the pioneering in-browser inference system, nnJIT, which enables just-in-time (JIT) auto-generation of optimized computing kernels for edge devices. nnJIT is built upon two novel techniques that significantly reduce kernel search and compilation overhead while improving performance firmly: Tensor-Web Compiling Co-Design lowers compiling costs by around 100 × × through eliminating redundant and ineffective compiling passes; Web-Specific Lite Kernel Optimization Space reduces kernel tuning costs by focusing on Web programming requirements and efficient device resource utilization, pruning the optimization space from millions to only dozens. nnJIT is evaluated for modern models, e.g., BART, T5, and Llama 2, on a range of edge devices including laptops and smartphones using different browsers and hardware from ARM, Intel, AMD and Nvidia. The results show that nnJIT can achieve up to 8.2X faster within 30 seconds compared to the existing baselines.
- MobiCom ’24AutoDroid: LLM-powered Task Automation in AndroidHao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu2024
Mobile task automation is an attractive technique that aims to enable voice-based hands-free user interaction with smartphones. However, existing approaches suffer from poor scalability due to the limited language understanding ability and the non-trivial manual efforts required from developers or endusers. The recent advance of large language models (LLMs) in language understanding and reasoning inspires us to rethink the problem from a model-centric perspective, where task preparation, comprehension, and execution are handled by a unified language model. In this work, we introduce AutoDroid, a mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The key insight is to combine the commonsense knowledge of LLMs and domain-specific knowledge of apps through automated dynamic analysis. The main components include a functionality-aware UI representation method that bridges the UI with the LLM, exploration-based memory injection techniques that augment the app-specific domain knowledge of LLM, and a multi-granularity query optimization module that reduces the cost of model inference. We integrate AutoDroid with off-the-shelf LLMs including online GPT-4/GPT-3.5 and on-device Vicuna, and evaluate its performance on a new benchmark for memory-augmented Android task automation with 158 common tasks. The results demonstrated that AutoDroid is able to precisely generate actions with an accuracy of 90.9%, and complete tasks with a success rate of 71.3%, outperforming the GPT-4-powered baselines by 36.4% and 39.7%.
- MobiSys ’23NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-ProcessorsJianyu Wei, Ting Cao, Shijie Cao, Shiqi Jiang, Shaowei Fu, Mao Yang, Yanyong Zhang, and Yunxin Liu2023
Mobile devices are increasingly equipped with heterogeneous multiprocessors, e.g., CPU + GPU + DSP. Yet existing Neural Network (NN) inference fails to fully utilize the computing power of the heterogeneous multi-processors due to the sequential structures of NN models. Towards this end, this paper proposes NN-Stretch, a new model adaption strategy, as well as the supporting system. It automatically branches a given model according to the processor architecture characteristics. Compared to other popular model adaption techniques such as model pruning that often sacrifices accuracy, NN-Stretch accelerates inference while preserving accuracy. The key idea of NN-Stretch is to horizontally stretch a model structure, from a long and narrow model to a short and wide one with multiple branches. We formulate the model branching into an optimization problem. NN-Stretch attempts to narrow down the design space by taking into account the hard latency constraints through varying where the branches converge and how each branch is scaled to fit heterogeneous processors, as well as the soft accuracy constraints through maintaining the model skeleton and expressiveness of each branch. According to the constraints, NN-Stretch can efficiently generate accurate and efficient multi-branch models. To facilitate easy deployment, this paper also devises a subgraph-based spatial scheduler for existing inference frameworks to parallelly execute the multi-branch models. Our experimental results are very promising, with up to 3.85× speedup compared to single CPU/GPU/DSP execution and up to 0.8% accuracy improvement.
- MobiCom ’23AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge EnvironmentsHao Wen, Yuanchun Li, Zunshuai Zhang, Shiqi Jiang, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang, and Yunxin Liu2023
Deep learning models are increasingly deployed to edge devices for real-time applications. To ensure stable service quality across diverse edge environments, it is highly desirable to generate tailored model architectures for different conditions. However, conventional pre-deployment model generation approaches are not satisfactory due to the difficulty to handle the diversity of edge environments and the demand for edge information. In this paper, we propose to adapt the model architecture after deployment in the target environment, where the model quality can be precisely measured and private edge data can be retained. To achieve efficient and effective edge model generation, we introduce a pretraining-assisted on-cloud model elastification method and an edge-friendly on-device architecture search method. Model elastification generates a high-quality search space of model architectures with the guidance of a developer-specified oracle model. Each subnet in the space is a valid model with different environment affinity, and each device efficiently finds and maintains the most suitable subnet based on a series of edge-tailored optimizations. Extensive experiments on various edge devices demonstrate that our approach is able to achieve significantly better accuracy-latency tradeoffs (eg. 46.74% higher on average accuracy with 60% latency budget) than strong baselines with minimal overhead (13 GPU hours in the cloud and 2 minutes on the edge server).
- SenSys ’22Turbo: Opportunistic Enhancement for Edge Video AnalyticsYan Lu, Shiqi Jiang, Ting Cao, and Yuanchao Shu2022
Edge computing is being widely used for video analytics. To alleviate the inherent tension between accuracy and cost, various video analytics pipelines have been proposed to optimize the usage of GPU on edge nodes. Nonetheless, we find that GPU compute resources provisioned for edge nodes are commonly under-utilized due to video content variations, subsampling and filtering at different places of a video analytics pipeline. As opposed to model and pipeline optimization, in this work, we study the problem of opportunistic data enhancement using the non-deterministic and fragmented idle GPU resources. In specific, we propose a task-specific discrimination and enhancement module, and a model-aware adversarial training mechanism, providing a way to exploit idle resources to identify and transform pipeline-specific, low-quality images in an accurate and efficient manner. A multi-exit enhancement model structure and a resource-aware scheduler is further developed to make online enhancement decisions and fine-grained inference execution under latency and GPU resource constraints. Experiments across multiple video analytics pipelines and datasets reveal that our system boosts DNN object detection accuracy by 7.27-11.34% by judiciously allocating 15.81-37.67% idle resources on frames that tend to yield greater marginal benefits from enhancement.
- MobiSys ’22CoDL: Efficient CPU-GPU Co-Execution for Deep Learning Inference on Mobile DevicesFucheng Jia, Deyu Zhang, Ting Cao, Shiqi Jiang, Yunxin Liu, Ju Ren, and Yaoxue Zhang2022
Concurrent inference execution on heterogeneous processors is critical to improve the performance of increasingly heavy deep learning (DL) models. However, available inference frameworks can only use one processor at a time, or hardly achieve speedup by concurrent execution compared to using one processor. This is due to the challenges to 1) reduce data sharing overhead, and 2) properly partition each operator between processors.By solving the challenges, we propose CoDL, a concurrent DL inference framework for the CPU and GPU on mobile devices. It can fully utilize the heterogeneous processors to accelerate each operator of a model. It integrates two novel techniques: 1) hybrid-type-friendly data sharing, which allows each processor to use its efficient data type for inference. To reduce data sharing overhead, we also propose hybrid-dimension partitioning and operator chain methods; 2) non-linearity- and concurrency-aware latency prediction, which can direct proper operator partitioning by building an extremely light-weight but accurate latency predictor for different processors.Based on the two techniques, we build the end-to-end CoDL inference framework, and evaluate it on different DL models. The results show up to 4.93\texttimes speedup and 62.3% energy saving compared with the state-of-the-art concurrent execution system.
- MobiCom ’21Flexible High-Resolution Object Detection on Edge Devices with Tunable LatencyShiqi Jiang, Zhiqi Lin, Yuanchun Li, Yuanchao Shu, and Yunxin Liu2021
Object detection is a fundamental building block of video analytics applications. While Neural Networks (NNs)-based object detection models have shown excellent accuracy on benchmark datasets, they are not well positioned for high-resolution images inference on resource-constrained edge devices. Common approaches, including down-sampling inputs and scaling up neural networks, fall short of adapting to video content changes and various latency requirements. This paper presents Remix, a flexible framework for high-resolution object detection on edge devices. Remix takes as input a latency budget, and come up with an image partition and model execution plan which runs off-the-shelf neural networks on non-uniformly partitioned image blocks. As a result, it maximizes the overall detection accuracy by allocating various amount of compute power onto different areas of an image. We evaluate Remix on public dataset as well as real-world videos collected by ourselves. Experimental results show that Remix can either improve the detection accuracy by 18%-120% for a given latency budget, or achieve up to 8.1\texttimes inference speedup with accuracy on par with the state-of-the-art NNs.
- APSys ’20Profiling and Optimizing Deep Learning Inference on Mobile GPUsShiqi Jiang, Lihao Ran, Ting Cao, Yusen Xu, and Yunxin Liu2020
Mobile GPU, as the ubiquitous computing hardware on almost every smartphone, is being exploited for the deep learning inference. In this paper, we present our measurements on the inference performance with mobile GPUs. Our observations suggest that mobile GPUs are underutilized. We study the inefficient issue in depth and find that one of root causes is the improper partition of compute workload. To solve this, we propose a heuristics-based workload partitioning approach, considering both performance and overheads on mobile devices. Evaluation results show that our approach reduces the inference latency by up to 32.8% on various devices and neural networks.
- ACM TOSEMAnatomizing Deep Learning Inference in Web BrowsersQipeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, and Xuanzhe LiuACM Transactions on Software Engineering and Methodology Aug 2024
Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology.
- ACM TOSNLarge-Scale Video Analytics with Cloud–Edge Collaborative Continuous LearningYa Nan, Shiqi Jiang, and Mo LiACM Transactions on Sensor Networks Oct 2023
Deep learning–based video analytics demands high network bandwidth to ferry the large volume of data when deployed on the cloud. When incorporated at the edge side, only lightweight deep neural network (DNN) models are affordable due to computational constraint. In this article, a cloud–edge collaborative architecture is proposed combining edge-based inference with cloud-assisted continuous learning. Lightweight DNN models are maintained at the edge servers and continuously retrained with a more comprehensive model on the cloud to achieve high video analytics performance while reducing the amount of data transmitted between edge servers and the cloud. The proposed design faces the challenge of constraints of both computation resources at the edge servers and network bandwidth of the edge–cloud links. An accuracy gradient-based resource allocation algorithm is proposed to allocate the limited computation and network resources across different video streams to achieve the maximum overall performance. A prototype system is implemented and experiment results demonstrate the effectiveness of our system with up to 28.6% absolute mean average precision gain compared with alternative designs.
- ACM TOSNMemento: An Emotion-Driven Lifelogging System with WearablesShiqi Jiang, Zhenjiang Li, Pengfei Zhou, and Mo LiACM Transactions on Sensor Networks Jan 2019
Due to the increasing popularity of mobile devices, the usage of lifelogging has dramatically expanded. People collect their daily memorial moments and share with friends on the social network, which is an emerging lifestyle. We see great potential of lifelogging applications along with rapid recent growth of the wearables market, where more sensors are introduced to wearables, i.e., electroencephalogram (EEG) sensors, that can further sense the user’s mental activities, e.g., emotions. In this article, we present the design and implementation of Memento, an emotion-driven lifelogging system on wearables. Memento integrates EEG sensors with smart glasses. Since memorable moments usually coincides with the user’s emotional changes, Memento leverages the knowledge from the brain-computer-interface domain to analyze the EEG signals to infer emotions and automatically launch lifelogging based on that. Towards building Memento on Commercial off-the-shelf wearable devices, we study EEG signals in mobility cases and propose a multiple sensor fusion based approach to estimate signal quality. We present a customized two-phase emotion recognition architecture, considering both the affordability and efficiency of wearable-class devices. We also discuss the optimization framework to automatically choose and configure the suitable lifelogging method (video, audio, or image) by analyzing the environment and system context. Finally, our experimental evaluation shows that Memento is responsive, efficient, and user-friendly on wearables.
- IEEE TITSA Participatory Urban Traffic Monitoring System: The Power of Bus RidersZhidan Liu, Shiqi Jiang, Pengfei Zhou, and Mo LiIEEE Transactions on Intelligent Transportation Systems Jan 2017
This paper presents a participatory sensing-based urban traffic monitoring system. Different from existing works that heavily rely on intrusive sensing or full cooperation from probe vehicles, our system exploits the power of participatory sensing and crowdsources the traffic sensing tasks to bus riders’ mobile phones. The bus riders are information source providers and, meanwhile, major consumers of the final traffic output. The system takes public buses as dummy probes to detect road traffic conditions, and collects the minimum set of cellular data together with some lightweight sensing hints from the bus riders’ mobile phones. Based on the crowdsourced data from participants, the system recovers the bus travel information and further derives the instant traffic conditions of roads covered by bus routes. The real-world experiments with a prototype implementation demonstrate the feasibility of our system, which achieves accurate and fine-grained traffic estimation with modest sensing and computation overhead at the crowd.