NobleBlocks

State Key Laboratory of Computer Architecture

facilityBeijing, China

Research output, citation impact, and the most-cited recent papers from State Key Laboratory of Computer Architecture. Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works
119
Citations
7.4K
h-index
33
i10-index
103
Also known as
State Key Lab of Computer ArchitectureState Key Laboratory of Computer Architecture计算机体系结构国家重点实验室

Top-cited papers from State Key Laboratory of Computer Architecture

NONCODE 2016: an informative and valuable data source of long non-coding RNAs
Yi Zhao, Hui Li, Shuangsang Fang, Yue Kang +4 more
2015· Nucleic Acids Research652doi:10.1093/nar/gkv1252

NONCODE (http://www.bioinfo.org/noncode/) is an interactive database that aims to present the most complete collection and annotation of non-coding RNAs, especially long non-coding RNAs (lncRNAs). The recently reduced cost of RNA sequencing has produced an explosion of newly identified data. Revolutionary third-generation sequencing methods have also contributed to more accurate annotations. Accumulative experimental data also provides more comprehensive knowledge of lncRNA functions. In this update, NONCODE has added six new species, bringing the total to 16 species altogether. The lncRNAs in NONCODE have increased from 210 831 to 527,336. For human and mouse, the lncRNA numbers are 167,150 and 130,558, respectively. NONCODE 2016 has also introduced three important new features: (i) conservation annotation; (ii) the relationships between lncRNAs and diseases; and (iii) an interface to choose high-quality datasets through predicted scores, literature support and long-read sequencing method support. NONCODE is also accessible through http://www.noncode.org/.

Cambricon: An Instruction Set Architecture for Neural Networks
Shaoli Liu, Zidong Du, Jinhua Tao, Dong Seog Han +4 more
2016118doi:10.1109/isca.2016.42

Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.

Opposing roles and potential antagonistic mechanism between TGF-β and BMP pathways: Implications for cancer progression
Junya Ning, Yi Zhao, Yingnan Ye, Jinpu Yu
2019· EBioMedicine96doi:10.1016/j.ebiom.2019.02.033

The transforming growth factor β (TGF-β) superfamily participates in tumour proliferation, apoptosis, differentiation, migration, invasion, immune evasion and extracellular matrix remodelling. Genetic deficiency in distinct components of TGF-β and BMP-induced signalling pathways or their excessive activation has been reported to regulate the development and progression of some cancers. As more in-depth studies about this superfamily have been conducted, more evidence suggests that the TGF-β and BMP pathways play an opposing role. The cross-talk of these 2 pathways has been widely studied in kidney disease and bone formation, and the opposing effects have also been observed in some cancers. However, the antagonistic mechanisms are still insufficiently investigated in cancer. In this review, we aim to display more evidences and possible mechanisms accounting for the antagonism between these 2 pathways, which might provide some clues for further study in cancer.

Protein‐coding genes combined with long noncoding RNA as a novel transcriptome molecular staging model to predict the survival of patients with esophageal squamous cell carcinoma
Jincheng Guo, Yang Wu, Yang Chen, Feng Pan +4 more
2018· Cancer Communications75doi:10.1186/s40880-018-0277-0

BACKGROUND: Esophageal squamous cell carcinoma (ESCC) is the predominant subtype of esophageal carcinoma in China. This study was to develop a staging model to predict outcomes of patients with ESCC. METHODS: Using Cox regression analysis, principal component analysis (PCA), partitioning clustering, Kaplan-Meier analysis, receiver operating characteristic (ROC) curve analysis, and classification and regression tree (CART) analysis, we mined the Gene Expression Omnibus database to determine the expression profiles of genes in 179 patients with ESCC from GSE63624 and GSE63622 dataset. RESULTS: Univariate cox regression analysis of the GSE63624 dataset revealed that 2404 protein-coding genes (PCGs) and 635 long non-coding RNAs (lncRNAs) were associated with the survival of patients with ESCC. PCA categorized these PCGs and lncRNAs into three principal components (PCs), which were used to cluster the patients into three groups. ROC analysis demonstrated that the predictive ability of PCG-lncRNA PCs when applied to new patients was better than that of the tumor-node-metastasis staging (area under ROC curve [AUC]: 0.69 vs. 0.65, P < 0.05). Accordingly, we constructed a molecular disaggregated model comprising one lncRNA and two PCGs, which we designated as the LSB staging model using CART analysis in the GSE63624 dataset. This LSB staging model classified the GSE63622 dataset of patients into three different groups, and its effectiveness was validated by analysis of another cohort of 105 patients. CONCLUSIONS: The LSB staging model has clinical significance for the prognosis prediction of patients with ESCC and may serve as a three-gene staging microarray.

PARSEC3.0
Xusheng Zhan, Yungang Bao, Christian Bienia, Kai Li
2017· ACM SIGARCH Computer Architecture News69doi:10.1145/3053277.3053279

Benchmarks play a very important role in accelerating the development and research of CMP. As one of them, the PARSEC suite continues to be updated and revised over and over again so that it can offer better support for researchers. The former versions of PARSEC have enough workloads to evaluate the property of CMP about CPU, cache and memory, but it lacks of applications based on network stack to assess the performance of CMPs in respect of network. In this work, we introduce PARSEC3.0, a new version of PARSEC suite that implements a user-level network stack and generates three network workloads with this stack to cover network domain. We explore the input sets of splash-2 and expand them to multiple scales, a.k.a, splash-2x. We integrate splash-2 and splash-2x into PARSEC framework so that researchers use these benchmark suite conveniently. Finally, we evaluate the u-TCP/IP stack and new network workloads, and analyze the characterizes of splash-2 and splash-2x

Deconstructing iterative optimization
Yang Chen, Shuangde Fang, Yuanjie Huang, Lieven Eeckhout +3 more
2012· ACM Transactions on Architecture and Code Optimization49doi:10.1145/2355585.2355594

Iterative optimization is a popular compiler optimization approach that has been studied extensively over the past decade. In this article, we deconstruct iterative optimization by evaluating whether it works across datasets and by analyzing why it works. Up to now, most iterative optimization studies are based on a premise which was never truly evaluated: that it is possible to learn the best compiler optimizations across datasets. In this article, we evaluate this question for the first time with a very large number of datasets. We therefore compose KDataSets, a dataset suite with 1000 datasets for 32 programs, which we release to the public. We characterize the diversity of KDataSets, and subsequently use it to evaluate iterative optimization. For all 32 programs, we find that there exists at least one combination of compiler optimizations that achieves at least 83% or more of the best possible speedup across all datasets on two widely used compilers (Intel's ICC and GNU's GCC). This optimal combination is program-specific and yields speedups up to 3.75× (averaged across datasets of a program) over the highest optimization level of the compilers (-O3 for GCC and -fast for ICC). This finding suggests that optimizing programs across datasets might be much easier than previously anticipated. In addition, we evaluate the idea of introducing compiler choice as part of iterative optimization. We find that it can further improve the performance of iterative optimization because different programs favor different compilers. We also investigate why iterative optimization works by analyzing the optimal combinations. We find that only a handful optimizations yield most of the speedup. Finally, we show that optimizations interact in a complex and sometimes counterintuitive way through two case studies, which confirms that iterative optimization is an irreplaceable and important compiler strategy.

Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor
Mingzhe Zhang, Lunkai Zhang, Lei Jiang, Zhiyong Liu +1 more
201737doi:10.1109/hpca.2017.45

Multi Level Cell (MLC) Phase Change Memory (PCM) is an enhancement of PCM technology, which provides higher capacity by allowing multiple digital bits to be stored in a single PCM cell. However, the retention time of MLC PCM is limited by the resistance drift problem and refresh operations are required. Previous work shows that there exists a trade-off between write latency and retention-a write scheme with more SET iterations and smaller current provides a longer retention time but at the cost of a longer write latency. Otherwise, a write scheme with fewer SET iterations achieves high performance for writes but requires a greater number of refresh operations due to its significantly reduced retention time, and this hurts the lifetime of MLC PCM. In this paper, we show that only a small part of memory (i.e., hot memory regions) will be frequently accessed in a given period of time. Based on such an observation, we propose Region Retention Monitor (RRM), a novel structure that records and predicts the write frequency of memory regions. For every incoming memory write operation, RRM select a proper write latency for it. Our evaluations show that RRM helps the system improves the balance between system performance and memory lifetime. On the performance side, the system with RRM bridges 77.2% of the performance gap between systems with long writes and systems with short writes. On the lifetime side, a system with RRM achieves a lifetime of 6.4 years, while systems using only long writes and short writes achieve lifetimes of 10.6 and 0.3 years, respectively. Also, we can easily control the aggressiveness of RRM through an attribute called hot threshold. A more aggressively configured RRM can achieve the performance which is only 3.5% inferior than the system using static short writes, while still achieve a lifetime of 5.78 years.

RaQu: An automatic high-utilization CNN quantization and mapping framework for general-purpose RRAM Accelerator
Songyun Qu, Bing Li, Ying Wang, Dawen Xu +2 more
202036doi:10.1109/dac18072.2020.9218724

Convolutional neural networks (CNNs) have become the state-of-the-art technique in many classification tasks in IoT system. However, the low-power and area-constraint edge devices are unable to afford the expensive cost of CNNs. Resistive random access memory (RRAM) is attractive for establishing the CNN accelerator at the edge end due to the features of scalability, low-power and in-situ dot-product. However, mapping a random network architecture onto a general-purpose RRAM accelerator suffers a severe issue of resource underutilization. The neural network quantization offers an opportunity to rescue the degraded resource utilization. Selecting the bit-width for the vast parameters is impractically completed by human labor. This paper proposes an AutoML-based array-aware quantization and mapping framework that generates the fine-grained mixed-precision neural networks to optimize resource utilization in RRAM. In this framework, we design a two-stage learning and array-aware grouping strategy to quickly explore the huge searching space. The experimental results show that the proposed framework achieves 18.2%~36.1% improvement in resource utilization and 0.9%~3.3% increase in model accuracy over prior coarse-grained quantization methods.

Improving DNN Accuracy on MLC PIM via Non-Ideal PIM Device Fine-Tuning
Hao Lv, Lei Zhang, Ying Wang
2024· IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems34doi:10.1109/tcad.2024.3521195

Resistive random access memory (RRAM) emerges as a promising technology for developing energy-efficient deep neural network (DNN) accelerators, owing to its analog computing paradigm for matrix-vector multiplication. However, the inherent nonideal device features of RRAM cells, such as device variation, read disturbances, and limited on/off ratio, present challenges for model deployment. Therefore, to ensure accurate storage and computing precision for RRAM-based accelerators, a widely used practice is encoding a DNN weight by multiple cells, resulting in significant memory overhead and underutilization. This challenge is further exacerbated by the rapid increases in model size witnessed in recent years. While the one-to-one weight-cell mapping strategy can improve memory utilization, it inevitably introduces deviations in the mapped DNN weight from the desired value due to RRAM variation issues, leading to model accuracy degradation. In response to this challenge, we abstract the model optimization on RRAM chips as a non-ideal PIM device optimization problem, aimed at optimizing model accuracy without the requirement of precise weight programming. We systematically analyze the model optimization behavior on multilevel RRAM devices by investigating the accuracy recovery process of various fine-tuning strategies in recovering model performance under the non-ideal PIM device setting. Based on the analysis, we propose a non-ideal PIM device finetune scheme to recover the model performance for multilevel RRAM under the non-ideal PIM device setting. Our proposed scheme leverages knowledge distillation and exploits input/output information of the model on RRAM to guide the fine-tuning process, finally restoring its accuracy. Experimental results demonstrate the efficacy of our non-ideal PIM device fine-tuning scheme, achieving nearly complete recovery of model performance. Our approach yields over a 3% improvement in model accuracy compared to variation-aware training approaches.

Joint Optimization of Operational Cost and Performance Interference in Cloud Data Centers
Xibo Jin, Fa Zhang, Lin Wang, Songlin Hu +2 more
2015· IEEE Transactions on Cloud Computing34doi:10.1109/tcc.2015.2449839

Virtual machine (VM) scheduling is an important technique for the efficient operation of the computing resources in a data center. Previous work has mainly focused on consolidating VMs to improve resource utilization and to optimize energy consumption. However, the interference between collocated VMs is usually ignored, which can result in much worse performance degradation of the applications running on the VMs due to the contention of the shared resources. Based on this observation, we aim at designing efficient VM assignment and scheduling strategies in which we consider optimizing both the operational cost of the data center and the performance degradation of the running applications. We then propose a general model that captures the tradeoff between the two contradictory objectives. We present offline and online solutions for this problem by exploiting the spatial and temporal information of performance interference of VM collocation, where VM scheduling is performed by jointly considering the combinations and the life-cycle overlap of the VMs. Evaluation results show that the proposed methods can generate efficient schedules for VMs, achieving low operational cost while significantly reducing the performance degradation of applications in cloud data centers.

Supporting Differentiated Services in Computers via Programmable Architecture for Resourcing-on-Demand (PARD)
Jiuyue Ma, Xiufeng Sui, Ninghui Sun, Yupeng Li +4 more
201533doi:10.1145/2694344.2694382

This paper presents PARD, a programmable architecture for resourcing-on-demand that provides a new programming interface to convey an application's high-level information like quality-of-service requirements to the hardware. PARD enables new functionalities like fully hardware-supported virtualization and differentiated services in computers. PARD is inspired by the observation that a computer is inherently a network in which hardware components communicate via packets (e.g., over the NoC or PCIe). We apply principles of software-defined networking to this intra-computer network and address three major challenges. First, to deal with the semantic gap between high-level applications and underlying hardware packets, PARD attaches a high-level semantic tag (e.g., a virtual machine or thread ID) to each memory-access, I/O, or interrupt packet. Second, to make hardware components more manageable, PARD implements programmable control planes that can be integrated into various shared resources (e.g., cache, DRAM, and I/O devices) and can differentially process packets according to tag-based rules. Third, to facilitate programming, PARD abstracts all control planes as a device file tree to provide a uniform programming interface via which users create and apply tag-based rules.

A High-Throughput Neural Network Accelerator
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang +3 more
2015· IEEE Micro32doi:10.1109/mm.2015.41

The authors designed an accelerator architecture for large-scale neural networks, with an emphasis on the impact of memory on accelerator design, performance, and energy. In this article, they present a concrete design at 65 nm that can perform 496 16-bit fixed-point operations in parallel every 1.02 ns, that is, 452 gop/s, in a 3.02mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> , 485-mw footprint (excluding main memory accesses).

Statistical Performance Comparisons of Computers
Tianshi Chen, Qi Guo, Olivier Temam, Yue Wu +3 more
2014· IEEE Transactions on Computers30doi:10.1109/tc.2014.2315614

As a fundamental task in computer architecture research, performance comparison has been continuously hampered by the variability of computer performance. In traditional performance comparisons, the impact of performance variability is usually ignored (i.e., the means of performance observations are compared regardless of the variability), or in the few cases directly addressed with <inline-formula><tex-math>$t$</tex-math></inline-formula> -statistics without checking the number and normality of performance observations. In this paper, we formulate a performance comparison as a statistical task, and empirically illustrate why and how common practices can lead to incorrect comparisons. We propose a non-parametric hierarchical performance testing (HPT) framework for performance comparison, which is significantly more practical than standard <inline-formula><tex-math>$t$</tex-math></inline-formula> -statistics because it does not require to collect a large number of performance observations in order to achieve a normal distribution of sample mean. In particular, the proposed HPT can facilitate quantitative performance comparison, in which the performance speedup of one computer over another is statistically evaluated. Compared with the HPT, a common practice which uses geometric mean performance scores to estimate the performance speedup has errors of <inline-formula><tex-math>$8.0$ </tex-math></inline-formula> to <inline-formula><tex-math>$56.3$</tex-math></inline-formula> percent on SPEC CPU2006 or SPEC MPI2007, which demonstrates the necessity of using appropriate statistical techniques. This HPT framework has been implemented as an open-source software, and integrated in the PARSEC 3.0 benchmark suite.

BitPruner: Network Pruning for Bit-serial Accelerators
Xiandong Zhao, Ying Wang, Cheng Liu, Cong Shi +2 more
202030doi:10.1109/dac18072.2020.9218534

Bit-serial architectures (BSAs) are becoming increasingly popular in low power neural network processor (NNP) design. However, the performance and efficiency of state-of-the-art BSA NNPs are heavily depending on the distribution of ineffectual weight-bits of the running neural network. To boost the efficiency of third-party BSA accelerators, this work presents Bit-Pruner, a software approach to learn BSA-favored neural networks without resorting to hardware modifications. The techniques proposed in this work not only progressively prune but also structure the non-zero bits in weights, so that the number of zero-bits in the model can be increased and also load-balanced to suit the architecture of the target BSA accelerators. According to our experiments on a set of representative neural networks, Bit-Pruner increases the bit-sparsity up to 94.4% with negligible accuracy degradation. When the bit-pruned models are deployed onto typical BSA accelerators, the average performance is 2.1X and 1.5X higher than the baselines running non-pruned and weight-pruned networks, respectively.

Empirical study of redo and undo logging in persistent memory
Hu Wan, Youyou Lu, Yuanchao Xu, Jiwu Shu
201626doi:10.1109/nvmsa.2016.7547178

Atomic and durable transactions are widely used to ensure the crash consistency in persistent memory (PM). However, whether to use redo or undo logging is still a hotly debated topic in persistent memory systems. In this paper, we empirically study the performance of both redo and undo logging using NVML, a persistent memory transactional object store framework. Our results on an NVDIMM server show that redo logging significantly outperforms undo logging for workloads in which a transaction updates large number of different objects, while it underperforms undo logging for workloads with intensive read operations. Furthermore, undo logging is more sensitive to the read-to-write ratios, compared to redo logging. Finally, our experiments also demonstrate that asynchronous log truncation is much helpful in redo logging for log-heavy transactions.

A P53‐related microRNA model for predicting the prognosis of hepatocellular carcinoma patients
Shuangsang Fang, Jincheng Guo, Jianhua Zhang, Jinna Liu +4 more
2019· Journal of Cellular Physiology24doi:10.1002/jcp.29245

Studies have shown that microRNAs (miRNAs) play a vital role in tumor progression and patients' prognosis. Therefore, we aimed to construct a miRNA model for forecasting the survival of hepatocellular carcinoma (HCC) patients. The gene expression data of 433 patients with HCC from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus public databases were remined by survival analysis and receptor manipulation characteristic curve (ROC). A prognostic model including six miRNAs (hsa-mir-26a-1-3p, hsa-mir-188-5p, hsa-mir-212-5p, hsa-mir-149-5p, hsa-mir-105-5p, and hsa-mir-132-5p) were constructed in the training dataset (TCGA, n = 333). HCC patients were stratified into a high-risk group and a low-risk group with significantly different survival (median: 2.75 vs. 8.93 years, log-rank test p < .001). Then we proved its performance of stratification in another independent dataset (GSE116182, median: 2.55 vs 6.96 years, log-rank test p = .008). Cox regression analysis showed that the prognostic model was an independent prognostic indicator for HCC patients. Then time-dependent ROC analyses were performed to test the prognostic ability of the model with that of TNM staging, we found the model had a better performance, especially at 5 years (AUC = 0.76). Functional prediction showed that the genes targeted by the six prognostic miRNAs in the prognostic model were highly expressed in the P53-related pathway. In conclusion, we constructed a prognostic miRNA model that could indicate the survival of HCC patients.

A novel multidimensional signature predicts prognosis in hepatocellular carcinoma patients
Song Wang, Jianhua Zhang, Huan Wang, Lu Yang +4 more
2018· Journal of Cellular Physiology23doi:10.1002/jcp.27818

Abstract The abnormal expression of microRNAs (miRNAs) or protein‐coding genes (PCGs) have been found to be associated with the prognosis of hepatocellular carcinoma (HCC) patients. Using bioinformatics analysis methods including Cox’s proportional hazards regression analysis, the random survival forest algorithm, Kaplan–Meier, and receiver operating characteristic (ROC) curve analysis, we mined the gene expression profiles of 469 HCC patients from The Cancer Genome Atlas ( n = 379) and Gene Expression Omnibus (GSE14520; n = 90) public database. We selected a signature comprising one protein‐coding gene (PCG; DNA polymerase μ) and three miRNAs (hsa‐miR‐149‐5p, hsa‐miR‐424‐5p, hsa‐miR‐579‐5p) with highest accurate prediction (area under the ROC curve [AUC] = 0.72; n = 189) from the training data set. The signature stratified patients into high‐ and low‐risk groups with significantly different survival (median 27.9 vs. 55.2 months, log‐rank test, p &lt; 0.001) in the training data set, and its risk stratification ability were validated in the test data set (median 47.4 vs. 84.4 months, log‐rank test, p = 0.03) and an independent data set (median 31.0 vs. 46.0 months, log‐rank test, p = 0.01). Multivariable Cox regression analysis showed that the signature was an independent prognostic factor. And the signature was proved to have a better survival prediction power than tumor–node–metastasis (TNM) stage (AUC signature = 0.72/0.64/0.62 vs. AUC TNM = 0.65/0.61/0.61; p &lt; 0.05). Moreover, we validated the expression of these prognostic genes from the PCG‐miRNA signature in Huh‐7 cell by real‐time polymerase chain reaction. In conclusion, we found a signature that can predict survival of HCC patients and serve as a prognostic marker for HCC.

BPM/BPM+
Lei Liu, Zehan Cui, Yong Li, Yungang Bao +2 more
2014· ACM Transactions on Architecture and Code Optimization21doi:10.1145/2579672

The main memory system is a shared resource in modern multicore machines that can result in serious interference leading to reduced throughput and unfairness. Many new memory scheduling mechanisms have been proposed to address the interference problem. However, these mechanisms usually employ relative complex scheduling logic and need modifications to Memory Controllers (MCs), which incur expensive hardware design and manufacturing overheads. This article presents a practical software approach to effectively eliminate the interference without any hardware modifications. The key idea is to modify the OS memory management system and adopt a page-coloring-based Bank-level Partitioning Mechanism (BPM) that allocates dedicated DRAM banks to each core (or thread). By using BPM, memory requests from distinct programs are segregated across multiple memory banks to promote locality/fairness and reduce interference. We further extend BPM to BPM+ by incorporating channel-level partitioning, on which we demonstrate additional gain over BPM in many cases. To achieve benefits in the presence of diverse application memory needs and avoid performance degradation due to resource underutilization, we propose a dynamic mechanism upon BPM/BPM+ that assigns appropriate bank/channel resources based on application memory/bandwidth demands monitored through PMU (performance-monitoring unit) and a low-overhead OS page table scanning process. We implement BPM/BPM+ in Linux 2.6.32.15 kernel and evaluate the technique on four-core and eight-core real machines by running a large amount of randomly generated multiprogrammed and multithreaded workloads. Experimental results show that BPM/BPM+ can improve the overall system throughput by 4.7%/5.9%, on average, (up to 8.6%/9.5%) and reduce the unfairness by an average of 4.2%/6.1% (up to 15.8%/13.9%).

Memos: A full hierarchy hybrid memory management framework
Lei Liu, Hao Yang, Yong Li, Mengyao Xie +2 more
201619doi:10.1109/iccd.2016.7753305

In this paper, we introduce memos, which integrates suitable memory management policies and schedules resources over the entire memory hierarchy in hybrid memory system. Powered by an OS kernel level monitoring tool, memos captures memory patterns online, and then leverages them to guide the memory page placement and data mapping. Experimental results show, on average, memos can benefit memory utilization, contributing to system throughput and QoS by 19.1% and 23.6%. Moreover, memos can reduce the NVM side memory latency by 3∼83.3%, energy consumption by 25.1∼99%, and benefit the NVM lifetime significantly (40× improvement on average).

Network-on-Interposer Design for Agile Neural-Network Processor Chip Customization
Mengdi Wang, Ying Wang, Cheng Liu, Lei Zhang
202116doi:10.1109/dac18074.2021.9586261

Chiplet based multi-die integration has been thought as a key enabler of the agile chip development flow. For 2.5D based multi-die system, Network on Interposer plays an essential role in the performance and the development cost of the chips. This work proposed a reusable NoI design for agile AI chip customization. The proposed NoI design can self-adapt to the inter-die communication patterns of various neural network applications, so the produced interposers can be reused across different AI chip specifications. Experimental results show the proposed NoI design brings 42.7%$\sim$79.5% of total data communication latency reduction in different scenarios, and it also decreased the area overhead by 26.4%.