Dear DATE community,

We, the DATE Sponsors Committee (DSC) and the DATE Executive Committee (DEC), are deeply shocked and saddened by the tragedy currently unfolding in Ukraine, and we would like to express our full solidarity with all the people and families affected by the war.

Our thoughts also go out to everyone in Ukraine and Russia, whether they are directly or indirectly affected by the events, and we extend our deep sympathy.

We condemn Russia’s military action in Ukraine, which violates international law. And we call on the different governments to take immediate action to protect everyone in that country, particularly including its civilian population and people affiliated with its universities.

Now more than ever, our DATE community must promote our societal values (justice, freedom, respect, community, and responsibility) and confront this situation collectively and peacefully to end this nonsense war.

DATE Sponsors and Executive Committees.

Kindly note that all times on the virtual conference platform are displayed in the user's time zone.

The time zone for all times mentioned at the DATE website is CET – Central Europe Time (UTC+1).

Time	Label	Presentation Title Authors
08:30 CEST	O.1.1	OPENING Speakers: Cristiana Bolchini¹ and Ingrid Verbauwhede² ¹Politecnico di Milano, IT; ²KU Leuven - COSIC, BE Abstract DATE 2022 opening
09:00 CEST	O.1.2	AWARDS Speakers: Donatella Sciuto¹, David Atienza² and Yervant Zorian³ ¹Politecnico di Milano, IT; ²École Polytechnique Fédérale de Lausanne (EPFL), CH; ³Synopsys, US Abstract DATE 2022 awards presentation

Time	Label	Presentation Title Authors
09:20 CEST	K.1.1	WHAT IS BEYOND AI? SOCIETAL OPPORTUNITIES AND ELECTRONIC DESIGN AUTOMATION Speaker and Author: Valeria Bertacco, University of Michigan, US Abstract The success of hardware in enabling AI acceleration and broadening its scope has been nothing short of remarkable. How do we use the power of hardware design and electronic design automation to instead make the world a better place? EDA will be the cornerstone of innovative solutions in ensuring data privacy, sustainable computing and taming the data flood.
10:00 CEST	K.1.2	Q&A SESSION Author: Cristiana Bolchini, Politecnico di Milano, IT Abstract Questions and answers with the speaker

Time	Label	Presentation Title Authors
10:10 CEST	K.2.1	CRYO-CMOS QUANTUM CONTROL: FROM A WILD IDEA TO WORKING SILICON Speaker and Author: Edoardo Charbon, École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract The core of a quantum processor is generally an array of qubits that need to be controlled and read out by a classical processor. This processor operates on the qubits with nanosecond latency, several millions of times per second, with tight constraints on noise and power. This is due to the extremely weak signals involved in the process that require highly sensitive circuits and systems, along with very precise timing capability. We advocate the use of CMOS technologies to achieve these goals, whereas the circuits will be operated at deep-cryogenic temperatures. We believe that these circuits, collectively known as cryo-CMOS control, will make future qubit arrays scalable, enabling a faster growth in qubit count. In the lecture, the challenges of designing and operating complex circuits and systems at 4K and below will be outlined, along with preliminary results achieved in the control and read-out of qubits by ad hoc integrated circuits
10:50 CEST	K.2.2	Q&A SESSION Author: Giovanni De Micheli, École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract Questions and answers with the speaker

Time	Label	Presentation Title Authors
11:00 CEST	1.1.1	FULL-STACK QUANTUM COMPUTING SYSTEMS IN THE NISQ ERA: ALGORITHM-DRIVEN AND HARDWARE-AWARE COMPILATION TECHNIQUES Speaker: Carmen G. Almudever, TU Valencia, ES Authors: Medina Bandic¹, Sebastian Feld¹ and Carmen G. Almudever² ¹Delft University of Technology, NL; ²TU Valencia, ES Abstract The progress in developing quantum hardware with functional quantum processors integrating tens of noisy qubits, together with the availability of near-term quantum algorithms has led to the release of the first quantum computers. These quantum computing systems already integrate different software and hardware components of the so- called "full-stack", bridging quantum applications to quantum devices. In this paper, we will provide an overview on current full-stack quantum computing systems. We will emphasize the need for tight co-design among adjacent layers as well as vertical cross-layer design to extract the most from noisy intermediate-scale quantum (NISQ) processors which are both error-prone and severely constrained in resources. As an example of co-design, we will focus on the development of hardware-aware and algorithm-driven compilation techniques.
11:30 CEST	1.1.2	TWEEDLEDUM: A COMPILER COMPANION FOR QUANTUM COMPUTING Speaker: Bruno Schmitt, EPFL, CH Authors: Bruno Schmitt and Giovanni De Micheli, École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract This work presents tweedledum—an extensible open-source library aiming at narrowing the gap between high- level algorithms and physical devices by enhancing the expressive power of existing frameworks. For example, it allows designers to insert classical logic (defined at a high abstraction level, e.g., a Python function) directly into quantum circuits. We describe its design principles, concrete implementation, and, in particular, the library’s core: An intuitive and flexible intermediate representation (IR) that supports different abstraction levels across the same circuit structure.
12:00 CEST	1.2.3	A CRYO-CMOS TRANSMON QUBIT CONTROLLER AND VERIFICATION WITH FPGA EMULATION Speaker: Kevin Tien, IBM Research, US Authors: Kevin Tien¹, Ken Inoue¹, Scott Lekuch¹, David Frank¹, Sudipto Chakraborty¹, Pat Rosno², Thomas Fox¹, Mark Yeck¹, Joseph Glick¹, Raphael Robertazzi¹, Ray Richetta², John Bulzacchelli¹, Daniel Ramirez², Dereje Yilma², Andy Davies², Rajiv Joshi¹, Devin Underwood¹, Dorothy Wisnieff¹, Chris Baks¹, Donald Bethune³, John Timmerwilke¹, Blake Johnson¹, Brian Gaucher¹ and Daniel Friedman¹ ¹IBM T.J. Watson Research Center, US; ²IBM Systems, US; ³IBM Almaden Research Center, US Abstract Future generations of quantum computers are expected to operate in a paradigm where multi-qubit devices will predominantly perform circuits to support quantum error correction. Highly integrated cryogenic electronics are a key enabling technology to support the control of the large numbers of physical qubits that will be required in this fault-tolerant, error-corrected regime. Here, we describe our perspectives on cryoelectronics-driven qubit control architectures, and will then describe an implementation of a scalable, low-power, cryogenic qubit state controller that includes a domain-specific processor and a SSB upconversion I/Q-mixer-based RF AWG. We will also describe an FPGA-based emulation platform that is able to closely reproduce the system intention, and which was used to verify different aspects of the ASIC system design in in situ transmon qubit control experiments.

Time	Label	Presentation Title Authors
13:10 CEST	K.3.1	BATTERIES: POWERING UP THE NEXT GENERATIONS Speaker and Author: Silvia Bodoardo, Politecnico di Torino, IT Abstract The quest for energy possibly from renewable sources is rapidly increasing, due to new digital technologies that are taking up more and more space in our lives, electric vehicles expected to replace old combustion ones. However, today’s battery technology is lagging behind adjacent technological advances, with most devices using lithium-ion batteries, that bring with them some concerns and not the least their availability in Europe. To create a European energy platform for the future, bringing together renewable energy sources, electric transportation and a connected Internet of Things, a new solution for battery technology needs to be found. This keynote will explore how current challenges can be overcome through the application of advances in new materials, what is Europe doing in the field of batteries, the need of skilled people and how the future of battery technology can contribute to build a better, greener and connected world.
13:50 CEST	K.3.2	Q&A SESSION Author: Marco Casale-Rossi, Synopsys, IT Abstract Questions and answers with the speaker

Time	Label	Presentation Title Authors
14:30 CEST	2.1.1	MICROPOWER MANAGEMENT TECHNIQUES FOR ENERGY HARVESTING APPLICATIONS Speaker and Author: Aldo Romani, University of Bologna, IT Abstract This talk will review the main technologies adopted for energy harvesting with different types of transducersand the types of associated power conversion techniques targeting the most efficient trade-offs between maximum power point tracking, efficiency and internal consumption. Some specific implementations will be reviewed. Finally, the emerging technology trends will be discussed along with application perspectives.
15:00 CEST	2.1.2	FULLY SELF-POWERED WIRELESS SENSORS ENABLED BY OPTIMIZED POWER MANAGEMENT MODULES Speaker and Author: Peter Spies, Fraunhofer IIS, DE Abstract The power supply of wireless sensors can be assisted or completely covered by energy harvesting technologies. If a fully self-powered operation by energy harvesting is feasible depends strongly on the ambient conditions, the use-case requirements and the available board space for harvesting building blocks. Besides these aforementioned conditions and requirements, the efficiency of the power supply functional blocks and the system control can play a major role in achieving fully self-powered and unlimited operation time. The talk will introduce building blocks for energy harvesting power supplies to reach the goal of full autonomy. It will also discuss wireless technologies and system control strategies which are of paramount importance in self-powered wireless sensors. Different application examples will illustrate the introduced building blocks and technologies with a focus on condition monitoring and predictive maintenance use cases.
15:30 CEST	2.1.3	DESIGN OF SELF-SUSTAINING CONNECTED SMART DEVICES Speaker and Author: Michele Magno, ETH Zürich, CH Abstract Internet of things is a revolutionizing technology which aims to create an ecosystem of connected smart devices and smart sensor providing ubiquitous connectivity between trillions ofdevices. Recent advancements in miniaturization of devices with higher computational capabilities and ultra-low power technology have enabled the vast deployment of sensors with significant changes in hardware design, software, network architecture, data analytic, data storage and power sources. However, the largest portion of IoT devices is still powered by batteries. This talk will focus on the viable solution of harvesting energy from environment and then provide enough energy to the smart devices to achieve self-sustaining smart devices combining, energy harvesting, low power devices, edge computing including machine learning on low power processors and even directly on MEMS sensors to achieve truly self-sustaining smart sensors.

Time	Label	Presentation Title Authors
11:00 CEST	5.1.1	PHYSICALLY & ALGORITHMICALLY SECURE LOGIC LOCKING WITH HYBRID CMOS/NANOMAGNET LOGIC CIRCUITS Speaker: Alexander J. Edwards, University of Texas at Dallas, US Authors: Alexander Edwards¹, Naimul Hassan¹, Dhritiman Bhattacharya², Mustafa Shihab¹, Peng Zhou¹, Xuan Hu¹, Jayasimha Atulasimha², Yiorgos Makris¹ and Joseph Friedman¹ ¹University of Texas at Dallas, US; ²Virginia Commonwealth University, US Abstract The successful logic locking of integrated circuits requires that the system be secure against both algorithmic and physical attacks. In order to provide resilience against imaging techniques that can detect electrical behavior, we recently proposed an approach for physically and algorithmically secure logic locking with strain-protected nanomagnet logic (NML). While this NML system exhibits physical and algorithmic security, the fabrication imprecision, noise-related errors, and slow speed of NML incur a significant security overhead cost. In this paper, we therefore propose a hybrid CMOS/NML logic locking solution in which NML islands provide security within a system primarily composed of CMOS, thereby providing physical and algorithmic security with minimal overhead. In addition to describing this proposed system, we also develop a framework for device/system co-design techniques that consider trade-offs regarding the efficiency and security.
11:20 CEST	5.1.2	EXPLORING STANDARD-CELL DESIGN FOR RECONFIGURABLE NANOTECHNOLOGIES: A FORMAL APPROACH Speaker: Michael Raitza, TU Dresden, DE Authors: Michael Raitza, Steffen Märcker, Shubham Rai and Akash Kumar, TU Dresden, DE Abstract Standard-cell design has always been a craft and common field-effect transistors span only a narrow design space. This has changed with reconfigurable transistors. Boolean functions that exhibit multiple dual product-terms in their sum- of-product form yield various beneficial circuit implementations with reconfigurable transistors. In this work, we present an approach to automatically generate these implementations through a formal modeling approach. Using the 3-input XOR function as an example, we discuss the variations and show how to quantify properties like worst-case delay and power dissipation, as well as averages of delay and energy consumption per operation over different scenarios. The quantification runs fully automated on charge-transport network models employing probabilistic model checking. This yields exact results instead of approximations obtained from experiments and sampling. Our results show several benefits of reconfigurable transistor circuits over static CMOS implementations.
11:40 CEST	5.1.3	DESIGN ENABLEMENT OF CFET DEVICES FOR SUB-2NM CMOS NODES Speaker: Odysseas Zografos, imec, BE Authors: Odysseas Zografos, Bilal Chehab, Pieter Schuddinck, Gioele Mirabeli, Naveen Kakarla, Yang Xiang, Pieter Weckx and Julien Ryckaert, imec, BE Abstract Novel devices that optimize their structure in a three-dimensional fashion and offer significant area gains by reducing standard cell track height are adopted to scale silicon technologies beyond the 5nm node. Such a device is the Complementary FET (CFET), which consists of an n-type channel stacked vertically over a p-type channel. In this paper we review the significant benefits of CFET devices as well as the challenges that arise with their use. More specifically, we focus on the standard cell design challenges as well as the physical implementation ones. We show that to fully exploit the area benefits of the CFET devices, one must carefully select the metal stack used for the physical implementation of a large design.
12:00 CEST	5.1.4	MAJORITY-BASED DESIGN FLOW FOR AQFP SUPERCONDUCTING FAMILY Speaker: Giulia Meuli, Synopsys, IT Authors: Giulia Meuli¹, Vinicius Possani², Rajinder Singh², Siang-Yun Lee³, Alessandro Tempia Calvino⁴, Dewmini Marakkalage⁴, Patrick Vuillod⁵, Luca Amarù⁶, Scott Chase⁶, Jamil Kawa⁷ and Giovanni De Micheli⁸ ¹Synopsys, IT; ²Synopsys Inc., US; ³École Polytechnique Fédérale de Lausanne, CH; ⁴EPFL, CH; ⁵Synopsys Inc., FR; ⁶Synopsys Inc, US; ⁷Synopsys, Inc., US; ⁸École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract Adiabatic superconducting devices are promising candidates to develop high-speed/low-power electronics. Advances in physical technology must be matched with a systematic development of comprehensive design and simulation tools to bring superconducting electronics to a commercially viable state. Being the technology fundamentally different from CMOS, new challenges are posed to design automation tools: library cells are controlled by multi-phase clocks, they implement the majority logic function, and they have limited fanout. We present a product-level RTL-to-GDSII flow for the design of Adiabatic Quantum-Flux-Parametron (AQFP) electronic circuits, with a focus on the special techniques used to comply with these challenges. In addition, we demonstrate new optimization opportunities for graph matching, resynthesis, and buffer/splitter insertion, improving the state-of-the-art.

Time	Label	Presentation Title Authors
13:10 CEST	K.4.1	AI IN THE EDGE; THE EDGE OF AI Speaker and Author: Georges Gielen, KU Leuven, BE Abstract In the world of IoT, both humans and objects are continuously connected, collecting and communicating data, in a rising number of applications including industry 4.0, biomedical, environmental monitoring, smart houses and offices. Local computation in the edge has become a necessity to limit data traffic. Additionally embedding AI processing in the edge adds potentially high levels of smart autonomy to these IoT 2.0 systems. Progress in nanoelectronic technology allows to do this in power- and hardware-efficient architectures and designs. This keynote gives an overview of key solutions, but also describes main limitations and risks, exploring the edge of edge AI.
13:50 CEST	K.4.2	Q&A SESSION Author: Gi-Joon Nam, IBM Research, US Abstract Questions and answers with the speaker

Time	Label	Presentation Title Authors
14:30 CEST	6.1.1	BIO-INSPIRED ENERGY EFFICIENT ALL-SPIKING INTERNET OF THINGS NODES Speaker: Adrian M. Ionescu, EPFL, CH Author: Adrian Ionescu, EFPL, CH Abstract In this talk we will present bio-inspired innovations exploiting phase change and ferroelectric materials and devices for all-spiking IoT nodes and Edge AI event detection applications. Particularly, we will report new progress in (i) electromagnetic and optical spiking sensors based on vanadium dioxides, and, (ii) ferroelectric neurons and synapses built with doped high-k dielectrics on 2D semiconducting materials. The future implications for improving the energy efficiency of IoT nodes will be discussed.
15:00 CEST	6.1.2	HYBRID DIGITAL-ANALOG SYSTEMS-ON-CHIP FOR EFFICIENT EDGE AI Speaker: Marian Verhelst, KU Leuven, BE Authors: Marian Verhelst¹, Kodai Ueyoshi¹, Giuseppe Sarda¹, Pouya Houshmand¹, Ioannis Papistas², Vikram Jain¹, Man Shi¹, Peter Vrancx³, Debjyoti Bhattacharjee³, Stefan Cosemans², Arindam Mallik³ and Peter Debacker³ ¹KU Leuven, BE; ²Imec and Axelera, BE; ³imec, BE Abstract Deep inference workloads at the edge are characterized by a wide variety of neural network layer topologies and characteristics. While large convolutional layers execute very efficiency on the dense compute-in-memory co-processors appearing in literature, other layer types (grouped convolutions, layers with low channel count or high precision requirements) benefit from digital execution. This talk discusses a new breed of heterogeneous SoC’s, integrating co-processors of different nature into a common processing systems with tightly coupled shared memory, to be able to dispatch every layer to the most optimal accelerator.
15:30 CEST	6.1.3	3D COMPUTE CUBES FOR EDGE INTELLIGENCE: NANOELECTRONIC-ENABLED ADAPTIVE SYSTEMS BASED ON JUNCTIONLESS, AMBIPOLAR, AND FERROELECTRIC VERTICAL FETS Speaker: Ian O'Connor, Lyon Institute of Nanotechnology, FR Authors: Ian O'Connor¹, David Atienza², Jens Trommer³, Oskar Baumgartner⁴, Guilhem Larrieu⁵ and Cristell Maneux⁶ ¹Lyon Institute of Nanotechnology, FR; ²École Polytechnique Fédérale de Lausanne (EPFL), CH; ³Namlab gGmbH, DE; ⁴Global TCAD Solutions, AT; ⁵LAAS – CNRS, FR; ⁶University of Boedeaux, FR Abstract New computing paradigms and technologies are required to respond to the challenges of data-intensive edge intelligence. We propose a triple combination of emerging technologies for the fine interweaving of versatile logic functionality and memory for reconfigurable in-memory computing: vertical junctionless gate-all-around nanowire transistors for ultimate downscaling; ambipolar functionality enhancement for fine-grain flexibility; ferroelectric oxides for non-volatile logic operation. Through a DTCO approach, this talk will describe the design of 3D compute cubes naturally suited to the hardware acceleration of computation-intensive kernels, as well as their integration into computing systems, introducing a system-wide exploration framework to assess their effectiveness. HW/SW optimization will also be described with a focus on Transformer and Conformer networks and the matrix multiplication kernel, which dominates their run-time.

Time	Label	Presentation Title Authors
16:00 CEST	8.1.1	INTRODUCTION TO THE CAREER FAIR Speaker and Author: Anton Klotz, Cadence Design Systems, DE Abstract Introduction to Career Fair. How to apply for listed positions
16:10 CEST	8.1.2	CADENCE DESIGN SYSTEMS Speaker and Author: Anton Klotz, Cadence Design Systems, DE Abstract Introducing Cadence Design Systems as employer for young talents
16:17 CEST	8.1.3	IMMS Speaker and Author: Eric Schaefer, IMMS, DE Abstract Introducing IMMS as employer for young talents
16:23 CEST	8.1.4	SIEMENS EDA Speaker and Author: Janani Muruganandam, Siemens, NL Abstract Introducing Siemens EDA as employer for young talents
16:30 CEST	8.1.5	SYNOPSYS Speaker and Author: Markus Wedler, Synopsys, DE Abstract Introducing Synopsys as employer for young talents
16:37 CEST	8.1.6	ANSYS Speaker and Author: Helene Tabourier, Ansys, DE Abstract Introducing Ansys as employer for young talents
16:43 CEST	8.1.7	INTEL Speaker and Author: Pablo Herrero, INTEL, DE Abstract Introducing Intel as employer for young talents
16:50 CEST	8.1.8	BOSCH Speaker and Author: Atefe Dalirsani, BOSCH, DE Abstract Introducing Bosch as employer for young talents

Time	Label	Presentation Title Authors
17:00 CEST	9.1.1	DUTCH NAO TEAM Speaker and Author: Thomas Wiggers, University of Amsterdam, NL Abstract Dutch Nao Team is a team of bachelor and master students from the University of Amsterdam that program robots to play football autonomously. Dutch Nao Team competes in the RoboCup SPL League and competitions around the world.
17:10 CEST	9.1.2	SQUADRA CORSE POLITO Speaker and Author: Enrico Salvatore, Politecnico di Torino, IT Abstract Squadra Corse PoliTO is the Formula SAE team of the Politecnico di Torino. The team is entirely run by students of the Politecnico di Torino who design, manufacture, test, and compete with formula style race cars in the Formula Student competitions. The team qualified for all the major Formula SAE student competitions of the 2021-2022 season.
17:20 CEST	9.1.3	DYNAMIS PRC Speaker and Author: Ishac Oursana, Politecnico di Milano, IT Abstract Dynamics PRC is the Formula Student team of Politecnico di Milano. Originally working on Combustion engines, the Dynamics PRC Teams also works on Electric prototypes and autonomous driving. Dynamics PRC classified 1st in Overall FSN and Overall FSATA in 2019.
17:30 CEST	9.1.4	HYPED Speaker and Author: Marina Antonogiannaki, University of Edinburgh, GB Abstract HYPED is the Edinburgh University Hyperloop Team. HYPED co-organises the European Hyperloop Week to promote the development of Hyperloop and connect students with the industry. HYPED has been among the finalists of SpaceX Hyperloop Pod Competition from 2017 to 2019 and won the Virgin Hyperloop One Global Challenge
17:40 CEST	9.1.5	ONELOOP AT UC DAVIS Speaker and Author: Zbynka Kekula, UC Davis, US Abstract "OneLoop is a student run organization of UC Davis working on developing a Hyperloop pod. Since the first SpaceX competition in 2017they continued to excel in competitions and in furthering HyperLoop research."
17:50 CEST	9.1.6	NEUROTECH LEUVEN Speaker and Author: Jonah Van Assche, KU Leuven, BE Abstract NeuroTech Leuven is a team of students from KU Leuven, Belgium that are interested in all things “Neuro”, ranging from neuroscience to neurotechnology. The NeureTech team takes part to the NeuroTechX Competition.
18:00 CEST	9.1.7	Q&A SESSION Authors: Sara Vinco¹ and Anton Klotz² ¹Politecnico di Torino, IT; ²Cadence Design Systems, DE Abstract This poster session allows a closer interaction of student teams with EDA and microelectronic companies, to allow discussion on sponsorship opportunities, e.g., in terms on monetary sponsorships, licenses, tutorials.

Time	Label	Presentation Title Authors
18:30 CEST	10.1.1	NOVEL ATTACK AND DEFENSE STRATEGIES FOR ENHANCED LOGIC LOCKING SECURITY Speaker: Lilas Alrahis, New York University Abu Dhabi, AE Authors: Lilas Alrahis¹ and Hani Saleh² ¹New York University Abu Dhabi, AE; ²Khalifa University, AE Abstract The globalized and, thus, distributed semiconductor supply chain creates an attack vector for the untrusted entities in stealing the intellectual property (IP) of a design. To ward off the threat of IP theft, researchers developed various countermeasures like state-space obfuscation, split manufacturing, and logic locking (LL). LL is a holistic design-for-trust technique that aims to protect the design IP from untrustworthy entities throughout the IC supply chain, by locking the functionality of the design. State-of-the-art LL solutions such as provably secure logic locking (PSLL) and scan locking/obfuscation aim to offer protection against immediate attacks such as the Boolean satisfiability (SAT)-based attack. However, these implementations mostly focus on thwarting the SAT-based attack leaving them vulnerable to other unexplored threats. The underlying research objective of this Ph.D. work is enhancing the security of LL by exposing and then addressing its security vulnerabilities.
18:30 CEST	10.1.2	PROPER ABSTRACTIONS FOR DIGITAL ELECTRONIC CIRCUITS: A PHYSICALLY GUIDED APPROACH Speaker: Jurgen Maier, TU Wien, AT Author: Jürgen Maier, TU Wien, AT Abstract In this thesis I show that developing abstractions, which are able to describe the behavior of digital electronic circuits in a simple yet accurate fashion, can be efficiently guided by identifying the underlying physical processes. Based on transistor-level analysis, analog SPICE simulations and even formal proofs I thus provide approximations of the analog signal trajectories inside a circuit and of the signal propagation delay in the digital domain. In addition I introduce methods for an efficient characterization of the Schmitt-Trigger, including its metastable and dynamic behavior. Overall, the developed abstractions are highly faithful in regard to the fact that only physically reasonable behavior can be modeled and vice versa. This leads to more powerful, accurate and trustworthy results which allows one to identify problematic spots in a circuit with higher confidence in less time. Nevertheless, no "silver bullet" w.r.t modeling abstractions could be found, meaning that each abstractions requires careful analysis of the physical behavior to achieve the optimal performance, accuracy and coverage.
18:30 CEST	10.1.3	RETRAINING-FREE WEIGHT-SHARING FOR CNN COMPRESSION Speaker: Etienne Dupuis, Lyon Institute of Nanotechnology, FR Authors: Etienne Dupuis¹, David Novo², Alberto Bosio³ and Ian O'Connor³ ¹Institut des Nanotechnologies de Lyon, FR; ²CNRS, LIRMM, University of Montpellier, FR; ³Lyon Institute of Nanotechnology, FR Abstract The Weight-Sharing (WS) technique gives promising results in compressing Convolutional Neural Networks (CNNs), but it requires the careful determining of the shared values for each layer of a given CNN. The WS Design Space Exploration (DSE) time can easily explode for state-of-the-art CNNs. We propose a new heuristic approach to drastically reduce the exploration time without sacrificing the quality of the output. The results carried out on recent CNNs GoogleNet, ResNet50V2, MobileNetV2, InceptionV3, and EfficientNet), trained with the ImageNet dataset, show over 5× memory compression at an acceptable accuracy loss (complying with the MLPerf quality target) without any retraining step. Index Terms—Convolutional Neural Network, Deep Learning, Computer vision, Hardware Accelerator, Design Space Explo- ration, Approximate Computing, Weight-Sharing
18:30 CEST	10.1.4	INTELLIGENT CIRCUIT DESIGN AND IMPLEMENTATION WITH MACHINE LEARNING IN EDA Speaker and Author: Zhiyao Xie, Duke University, US Abstract EDA (Electronic Design Automation) technology has achieved remarkable progress over the past decades, from attaining merely functionally correct designs to handling multi-million-gate circuits. However, chip design is not completely automatic yet in general and the gap is not easily surmountable. For example, automation of EDA flow is still largely restricted to individual point tools with little interplay across different tools and design steps. Tools in early steps cannot well judge if their solutions may eventually lead to satisfactory designs, and the consequence of a poor solution cannot be found until very late. A major weakness of these traditional EDA technologies is the insufficient prior design knowledge reuse. Conventional optimization techniques construct solutions from scratch even if similar optimizations have already been performed, perhaps even repeatedly. Predictive models are either inaccurate or dependent on trial designs, which are very time- and resource-consuming. These limitations point to a major strength of machine learning (ML) – the capability to explore highly complex correlations between two design stages based on prior data. During my Ph.D. study, I construct multiple fast yet accurate models for various design objectives in EDA with customized ML algorithms.
18:30 CEST	10.1.5	CROSS-LAYER TECHNIQUES FOR ENERGY-EFFICIENCY AND RESILIENCY OF ADVANCED MACHINE LEARNING ARCHITECTURES Speaker: Alberto Marchisio, TU Wien, AT Authors: Alberto Marchisio¹ and Muhammad Shafique² ¹TU Wien (TU Wien), AT; ²New York University Abu Dhabi, AE Abstract Machine Learning (ML) algorithms have shown high level of accuracy in several tasks, therefore ML-based applications are widely used in many systems and platforms. However, the development of efficient ML-based systems requires addressing two key research problems: energy-efficiency and security. Current trends show the growing interest in the community for complex ML models, such as Deep Neural Networks (DNNs), Capsule Networks (CapsNets), Spiking Neural Networks (SNNs). Besides their high learning capabilities, their complexity pose several challenges to address the above-discussed research problems. In this work, we explore cross-layer concepts to engage both hardware and software-level techniques to build resilient and energy-efficient architectures for these networks.
18:30 CEST	10.1.6	DESIGN & ANALYSIS OF AN ON-CHIP PROCESSOR FOR THE AUTISM SPECTRUM DISORDER (ASD) CHILDREN ASSISTANCE USING THEIR EMOTIONS Speaker: Abdul Rehman Aslam, Lahore University of Management Sciences, PK Authors: Abdul Rehman Aslam and Muhammad Awais Bin Altaf, Lahore University of Management Sciences, Pakistan, PK Abstract Autism Spectrum Disorder (ASD) is a neurological disorder that affects the cognitive and emotional abilities of children. The number of ASD patients has increased drastically in the past decade. The world health organization estimates that around 1 out of every 160 children is an ASD patient in the United States. The actual number of patients may be substantially higher as many patients are not reported due to the stigma associated with the ASD diagnosis methods. The ASD statistics can be more severe in underdeveloped and 3rd world countries with a lack of basic health facilities for a major population. The conventional Autism Diagnosis Observation Schedule (ADOS-2) diagnosis methods require extensive behavioral evaluations and require frequent visits of the children to the neurologists. These extensive evaluations lead to late diagnosis and hence late treatment. The chronic ailment of the central nervous system in ASD causes the degradation of emotional and cognitive abilities. The ASD patients suffer from attention deficit hyperactivity disorder, memory issues, inability to take decisions, emotional issues, and lack of self-control. The lack of self-control is overriding in their emotions. They have highly imbalanced emotions and face certain negative emotional outbursts (NEOB). The NEOB’s are impulses of negative emotions causing self-injuries and suicide attempts leading to death. The long-term continuous monitoring with neurofeedback of human emotions is therefore crucial for ASD patients. The timely prediction of NEOB’s is crucial in mitigating its harmful effect. The emotions prediction can be used to regulate the emotions by controlling these NEOB’s. This need can be addressed by Electroencephalography (EEG) based non-invasive, real-time and continuous emotion’s prediction system on chip (SoC) embedded inside some headband. This work targets the design and analysis of the digital backend (DBE) processor for a fully integrated wearable emotion prediction SoC. The SoC involves an analog front (AFE) for EEG data acquisition and a DBE processor for the emotion’s prediction. The miniaturized low-power processor can be embedded in a headband (patch sensor) for the timely prediction of NEOB’s. An SoC that predicts the NEOB’s and records its pattern was designed and implemented in 0.18µm 1P6M CMOS process. The dual-channel deep neural network (DNN) based emotions classification processor utilizes only two EEG channels for the emotion’s classification. The lowest number of channels minimizes the patient’s discomfort while wearing the headband SoC. The DBE classification processor utilizes only two features per channel to minimize the area and power and overcome overfitting problems. The proposed approximated skewness indicator feature was implemented using 86X lower area (gate count) after tuning the conventional mathematical formula for skewness. The DNN classifier was implemented in a semi pipelined manner after instructions rescheduling and customized arithmetic and logic unit implementation with 34X lower area (gate count). The sigmoid activation function was implemented with 50% lower memory resources due to symmetry between positive and negative sigmoid values. The overall area efficiency of 71% was achieved for the DNN classification unit. The 16mm2 SoC is implemented in 0.18um 1P6M, CMOS process and consumes 10.13μJ/classification for 2 channel operation while achieving an average accuracy of >85% on multiple emotion databases and real-time testing. The DBE processor for the wearable non-invasive emotions classification system was fabricated using 0.18µm CMOS process. The processor has an overall energy efficiency of 10.13µJ per classification. This is the world’s first SoC for emotions prediction targeting ASD patients with minimal hardware resources. The SoC can also be used for the ASD prediction with an excellent classification accuracy of 95%.
18:30 CEST	10.1.7	RESILIENCE AND ENERGY-EFFICIENCY FOR DEEP LEARNING AND SPIKING NEURAL NETWORKS FOR EMBEDDED SYSTEMS Speaker: Rachmad Vidya Wicaksana Putra, TU Wien, AT Authors: Rachmad Vidya Wicaksana Putra¹ and Muhammad Shafique² ¹TU Wien, AT; ²New York University Abu Dhabi, AE Abstract Neural networks (NNs) have become prominent machine learning (ML) algorithms because they achieve state-of-the-art accuracy for various data analytic applications, such as object recognition, healthcare, and autonomous driving. However, deploying the advanced NN algorithms, such as deep neural networks (DNNs) and spiking neural networks (SNNs), to the resource-constrained embedded systems is challenging because of their memory- and compute-intensive nature. Moreover, the existing SNN-based systems still cannot adapt to dynamic operating environments that make the offline-learned knowledge obsolete, and suffer from the negative impact of hardware-induced faults, thereby degrading the accuracy. Therefore, in this PhD work, we explore cross-layer hardware (HW)- and software (SW)-level techniques for building resilient and energy-efficient NN-based systems to enable their deployment for embedded applications in a reliable manner under diverse operating conditions.
18:30 CEST	10.1.8	MODELING AND OPTIMIZATION OF EMERGING AI ACCELERATORS UNDER RANDOM UNCERTAINTIES Speaker and Author: Sanmitra Banerjee, Duke University, US Abstract Artificial intelligence (AI) accelerators based on carbon nanotube FETs (CNFETs) and silicon-photonic neural networks (SPNNs) enable ultra-low-energy and ultra-high-speed matrix multiplication. However, these emerging technologies are susceptible to inevitable fabrication-process variations and manufacturing defects. My Ph.D. dissertation focuses on the development of a comprehensive modeling framework to analyze such uncertainties and their impact on emerging AI accelerators. We show that the nature of uncertainties in CNFETs and SPNNs differs from that in Si CMOS circuits and as such, the application and effectiveness of conventional EDA and test approaches is significantly restricted when applied to such emerging technologies. To address this, we also propose several novel technology-aware design optimization and test generation methods to facilitate yield ramp-up of next-generation AI accelerators.
18:30 CEST	10.1.9	LOGIC SYNTHESIS IN THE MACHINE LEARNING ERA: IMPROVING CORRELATION AND HEURISTICS Speaker: Walter Lau Neto, University of Utah, US Authors: Walter Lau Neto and Pierre-Emmanuel Gaillardon, University of Utah, US Abstract This extended abstract proposes to explore current advances in extit{Machine Learning} (ML) techniques to enhance both abstraction and heuristics in Logic Synthesis. We start by proposing a extit{Convolutional Neural Network} (CNN) model to predict early in the flow post extit{Place & Route} (PnR) critical paths, and a method to use this information and optimize these paths, achieving 15.3\% improvement in ADP and 18.5\% improvement in EDP. We also present a CNN model to be used during technology-mapping, that presents a novel cut-pruning policy, improving the mapping delay by an average of 10\% when compared to the ABC tool, the state-of-the-art open source technology mapper, at a cost of 2\% area. Our model for technology mapping replaces a core heuristic, which to the best of our knowledge is a novel contribution. Most of previous work for ML in EDA use ML to forecast metrics and tune the flow, but not embedded as a core heuristic.
18:30 CEST	10.1.10	ACCELERATING CNN INFERENCE NEAR TO THE MEMORY BY EXPLOITING PARALLELISM, SPARSITY, AND REDUNDANCY Speaker: Palash Das, Indian Institute of Technology, Guwahati, IN Authors: Palash Das and Hemangee Kapoor, Indian Institute of Technology, Guwahati, IN Abstract Convolutional Neural Networks (CNNs) have become a promising tool for deep learning, specifically in the domain of computer vision. Deep CNNs have widespread use in real-life applications like image classification, object detection, and image segmentation. The inference phase of CNNs is often used in real-time for faster prediction and classification and hence seeks high performance and energy efficiency from the system. Towards designing such systems, we implement multiple strategies that make the real-time inference exceptionally faster in exchange for minimum area/power overhead. We implement multiple custom accelerators with various capabilities and integrate them closer to the main memory to reduce the memory access latency/energy using the near-memory processing (NMP) concept. In our first contribution, we design custom hardware, convolutional logic unit (CLU), and integrate them close to a 3D memory, specifically hybrid memory cube (HMC). We propose a dataflow that helps in parallelizing the CNN tasks for their concurrent execution. In the second contribution, we propose an architecture that leverages the benefits of using NMP using HMC, exploiting parallelism and data sparsity. In the third contribution, apart from NMP and parallelism, the proposed hardware can also remove the redundant multiplications of inference by a lookaside memory (LAM)-based search technique. This makes the inference substantially faster because of the reduced number of costly multiplication operations. And lastly, we investigate the efficacy of NMP with the conventional DRAM while accelerating the inference. While implementing NMP in DRAM, we also explore the design space with our designed hardware modules based on the parameters like performance, power consumption, and area overhead.
18:30 CEST	10.1.11	DESIGN AUTOMATION FOR ADVANCED MICROFLUIDIC BIOCHIPS Speaker and Author: Debraj Kundu, IITR, IN Abstract The science behind the handling of fluids on the scale of nano to femto liter in order to automate a bio-application is termed as microfluidics and the devices used in such process are generally called as biochips. Due to the recent advancements in the fabrication technologies of these biochips, there is a huge boom in its design automation field in the last decade. Integration, precision, and high throughput are the main advantages of biochips over lab-based macro systems. Based on the working principle, these biochips can be broadly classified as continuous flow-based microfluidic biochips (CFMBs) and digital microfluidic biochips (DMFBs). In order to automate various bio-applications on a biochip different design automation methodologies are required for different kinds of biochips. We provide rigorous and elegant design automation techniques for sample preparation, fluid loading, placement of mixers and scheduling of mixing graphs in MEDA, PMD and CMF biochips.
18:30 CEST	10.1.12	ULTRA-FAST TEMPERATURE ESTIMATION METHODS FOR ARCHITECTURE-LEVEL THERMAL MODELING Speaker and Author: Hameedah Sultan, Indian Institute of Technology Delhi, IN Abstract As the power density of modern-day chips has increased, the chip temperature, too, has increased steadily. High temperature causes several adverse effects, affecting the chip's performance and reliability. It also increases the leakage power, which further increases the on-chip temperature, resulting in a feedback effect. In order to carry out temperature-aware design optimization, it is often necessary to conduct thousands of temperature simulations at various stages of the design cycle, and thus the speed of simulation without a concomitant loss in accuracy is essential. State-of-the-art works in thermal estimation have serious limitations in modeling some important aspects of thermal modeling. Additionally, these methods are slow. We overcome the limitations of these works by developing fast Green's function-based analytical methods.
18:30 CEST	10.1.13	MULTI-OBJECTIVE DIGITAL VLSI DESIGN OPTIMISATION Speaker and Author: Linan Cao, University of York, GB Abstract Modern VLSI design's complexity and density has been exponentially increasing over the past 50 years and recently reached a stage within its development, allowing heterogeneous, many-core systems and numerous functions to be integrated into a tiny silicon die. These achievements are accomplished by pushing process technology to its physical limits. Transistor shrinking has succeeded with continuous improvements in the physical dimension, switching frequency and power efficiency of integrated circuits (ICs), allowing embedded electronic systems to be used in more and more real-world automated applications. However, as advanced semiconductor technologies come ever closer to the atomic scale, the transistor scaling challenge and stochastic performance variations intrinsic to fabrication emerge. Electronic design automation (EDA) tools handle the growing size and complexity of modern electronic designs by breaking down systems into smaller blocks or cells, introducing different levels of abstraction. In the field of digital very large scale integration (VLSI) design, comprehensive and mature industry-standard design flows are available to tape out chips. This complex process consists of several steps including logic design, logic synthesis, physical implementation and pre-silicon physical verification. However, in this staged, hierarchical design approach, where each step is optimised independently, overheads and inefficiency can accumulate in the resulting overall design. Designers and EDA vendors have to handle these challenges from process technology, design complexity and growing scale, which may otherwise result in inferior design quality, even failures, and lower design yields under time-to-market pressure. Multiple or many design objectives and constraints are emerging during the design process and often need to be dealt with simultaneously. Multi-objective evolutionary algorithms (MOEAs) show flexible capabilities in maintaining multiple variable components and factors in uncertain environments. The VLSI design process involves a large number of available parameters both from designs and EDA tools. This provides many potential optimisation avenues where evolutionary algorithms can excel. This PhD work investigates the application of evolutionary techniques for digital VLSI design optimisation. Automated multi-objective optimisation frameworks, compatible with industrial design flows and foundry technologies, are proposed to improve solution performance, expand feasible design space, and handle complex physical floorplan constraints through tuning designs at gate-level. Methodologies for enriching standard cell libraries regarding drive strength are also introduced to cooperate with multi-objective optimisation frameworks, e.g., subsequent hill-climbing, providing a richer pool of solutions optimised for different trade-offs. The experiments of this thesis work demonstrate that multi-objective evolutionary algorithms, derived from biological inspirations, can assist the digital VLSI design process, in an industrial design context, to more efficiently search for well-balanced trade-off solutions as well as optimised design space coverage. The expanded drive granularity of standard cells can push the performance of silicon technologies with offering improved solutions regarding critical objectives. The achieved optimisation results can better deliver trade-off solutions regarding power, performance and area (PPA) metrics than using standard EDA tools alone. This has been not only shown for a single circuit solution but also covered the entire standard-tool-produced design space.
18:30 CEST	10.1.14	TINYDL: EFFICIENT DESIGN OF SCALABLE DEEP NEURAL NETWORKS FOR RESOURCE-CONSTRAINED EDGE DEVICES Speaker and Author: Mohammad Loni, Mälardalen University, SE Abstract The main aim of my Ph.D. thesis is to develop theoretical foundations and practical algorithms that (i) enable designing scalable and energy-efficient DNNs with low energy footprint, (ii) facilitate fast deployment of complicated DL models for a diverse set of Edge devices satisfying given hardware constraints, and (iii) improve the accuracy of network quantization methods for largescale datasets. To address research challenges, I developed (i) a set of ADONN, DeepMaker, NeuroPower, DenseDisp and FastStereoNet frameworks during my Ph.D. studies to design hardware-friendly NAS methods with minimum design cost, and (ii) novel ternarization frameworks named TOT-Net and TAS that prevents the accuracy degradation of quantization techniques.
18:30 CEST	10.1.15	DECISION DIAGRAMS IN QUANTUM DESIGN AUTOMATION Speaker and Author: Stefan Hillmich, Johannes Kepler University Linz, AT Abstract The impact quantum computing may achieve hinges on Computer-Aided Design (CAD) keeping up with the increasing power of physical realizations. The complexity of quantum computing has to be tackled with dedicated methods and data structures as well as a close cooperation between the CAD community and physicists. The main contribution of the thesis is to narrow the emerging design gap for quantum computing by bringing established methods of the CAD community to the quantum world. More precisely, the work focuses on the application of decision diagrams to the areas of quantum circuit simulation, estimation of observables in quantum chemistry, and technology mapping. The supporting paper is attached to the extended abstract.
18:30 CEST	10.1.16	DEPENDABLE RECONFIGURABLE SCAN NETWORKS Speaker: Natalia Lylina, University of Stuttgart, DE Authors: Natalia Lylina and Hans-Joachim Wunderlich, University of Stuttgart, DE Abstract Dependability of modern devices is enhanced by integrating an extensive number of non-functional instruments. These are needed to facilitate cost-efficient bring-up, debug, test, diagnosis, and adaptivity in the field and might include, e.g., sensors, aging monitors, Logic and Memory Built-In Self-Test (BIST) registers. Reconfigurable Scan Networks (RSNs) provide a flexible way to access such instruments as well the device's registers throughout the lifetime, starting from PSV through manufacturing test and finally during in-field test. At the same time, the dependability properties of the device-under-test (DUT) can be affected through an improper RSN integration. This doctoral project overcomes these problems and establishes a methodology to integrate dependable RSNs for a given device considering such dependability aspects, as accessibility via RSNs, testability of RSNs, and security compliance of RSNs with the underlying device-under-test. The remainder of this extended abstract is structured as follows. First, the background information about RSNs is provided, followed by the challenges of dependability-aware RSN integration. Next, the objectives and the contributions of this work are summarized for specific dependability properties.
18:30 CEST	10.1.17	BREAKING THE ENERGY CAGE OF INSECT-SCALE AUTONOMOUS DRONES: INTERPLAY OF PROBABILISTIC HARDWARE AND CO-DESIGNED ALGORITHMS Speaker: Priyesh Shukla, University of Illinois at Chicago, US Authors: Priyesh Shukla and Amit Trivedi, University of Illinois at Chicago, US Abstract Autonomy at insect-scale drones is challenged with highly constrained area and power budget. Robustness amidst noisy sensory inputs and surrounding is also critical. To address this, we present two compute-in-memory (CIM) frameworks for insect-scale drone localization. Our first framework is floating-gate (FG) inverter array based CIM (for Bayesian particle filtering) that efficiently evaluates log-likelihood of drone's pose which otherwise demands heavy computational workload using conventional digital processing. Our second method is Monte-Carlo dropout (MC-Dropout)-based deep neural network (DNN) inference in an all-digital 8T-SRAM (static random access memory) CIM. The CIM is equipped with additional MC-Dropout inference primitives to account for uncertainty in drone's pose prediction. We discuss compute reuse and optimization strategy for MC-Dropout schedules to gain significant reduction in this (approximated Bayesian) DNN workload. FG-CIM based localization is 25x energy efficient that conventional digital processing. And SRAM-CIM for MC-Dropout inference consumes 28pJ for 30 MC-Dropout inference iterations (3 TOPS/W).
18:30 CEST	10.1.18	RESILIENT: PROTECTING DESIGN IP FROM MALICIOUS ENTITIES Speaker: Nimisha Limaye, New Yor University, US Authors: Nimisha Limaye¹ and Ozgur Sinanoglu² ¹New York University, US; ²New York University Abu Dhabi, AE Abstract Globalization of integrated circuit (IC) supply chain opened up venues for untrusted entities with the malicious intent of intellectual property (IP) piracy and overproduction of ICs. These malicious entities encompass foundry, test facility, and end user. An untrusted foundry can readily obtain the unprotected design IP from the design house, and a test facility or an end user can reverse-engineer the chip using widely available tools and extract the underlying design IP to pirate or overproduce the ICs. We first perform an exhaustive security analysis of the state-of-the-art logic locking techniques and propose various attacks. Further, we propose countermeasures to thwart attacks from all the malicious entities in the supply chain. Through our solutions, we allow the security-enforcing designers to protect their design IP at various abstraction levels. Our solution can protect not just digital designs but also mixed-signal designs.
18:30 CEST	10.1.19	ALGORITHM-ARCHITECTURE CO-DESIGN FOR ENERGY-EFFICIENT, ROBUST, AND PRIVACY-PRESERVING MACHINE LEARNING Speaker and Author: Souvik Kundu, USC, US Abstract My Ph.D. research includes three major aspects of the algorithm -architecture co-design for machine learning accelerators: 1. energy-efficiency via novel training-efficient pruning, quantization, and distillation, 2. robust model training for safety-critical edge applications, 3. analysis of model and data privacy of the associated IPs.
18:30 CEST	10.1.20	PERFORMANCE-AWARE DESIGN-SPACE OPTIMIZATION AND ATTACK MITIGATION FOR EMERGING HETEROGENEOUS ARCHITECTURES Speaker and Author: Mitali Sinha, IIIT Delhi, IN Abstract The growing system sizes and time-to-market pressure of heterogeneous SoCs compel the chip designers to analyze only part of the design space, leading to suboptimal Intellectual Property (IP) designs. Hence, different processing cores like accelerators are generally designed as standalone IP blocks by third-party vendors and chip designers often over-provision the amount of on-chip resources required to add flexibility to each IP design. Although this modularity simplifies IP design, integrating these off-the-shelf IP blocks into a single SoC may overshoot the resource budget of the underlying system. Furthermore, the integration of third-party IPs alongside other on-chip modules makes the system vulnerable to security threats. This work addresses the challenges involved in designing efficient heterogeneous SoCs by optimizing the utilization of on-chip resources and mitigating performance-based security threats.
18:30 CEST	10.1.21	PRACTICAL SIDE-CHANNEL AND FAULT ATTACKS ON LATTICE-BASED CRYPTOGRAPHY Speaker: Prasanna Ravi, Nanyang Technological University, SG Authors: Prasanna Ravi¹, Anupam Chattopadhyay¹ and Shivam Bhasin² ¹Nanyang Technological University, SG; ²Temasek Laboratories, Nanyang Technological University, SG Abstract The possibility of large scale quantum computers in the future has been an ever-growing threat towards existing public-key infrastructure, which is predomiantly based on classical RSA and ECC-based public-key cryptography. This prompted NIST to initiate a global level standardization process for alternate quantum-attack resistant Public Key Encryption (PKE), Key Encapsulation Mechanisms (KEM) and Digital Signatures (DSS), better known as Post-Quantum Cryptography (PQC). The PQC standardization process started in 2017 with 69 submissions and is currently in its third and final round with seven (7) main finalist candidates and eight (8) alternate finalist candidates. Among these fifteen (15) finalist candidates, seven (7) of them belong to a single category, referred to as lattice-based cryptography. These schemes are based on hard geometric problems, that are conjectured to be computationally intractable by quantum computers. NIST laid out several evaluation criteria for the standardization process, which include theoretical Post-Quantum (PQ) security guarantees, implementation cost and performance. Along with them, resistance against physical attacks such as Side-Channel Analysis (SCA) and Fault Injection Analysis (FIA) has also emerged as an important criterion for the standardization process. This is especially relevant for adoption of PQC in embedded devices, which will be used in environments where an attacker can have unimpeded physical access to the target device. We therefore focus on evaluating the security of practical implementations of lattice-based schemes against SCA and FIA. We have identified novel SCA and FIA vulnerabilities which led to practical attacks on implementations of several lattice-based schemes. Most of our attacks exploit vulnerabilities inherent in the algorithms of lattice-based schemes, which make our attacks adaptable to different implementation platforms (hardware and software).
18:30 CEST	10.1.22	MEMORY INTERFERENCE AND MITIGATIONS IN RECONFIGURABLE HESOCS FOR EMBEDDED AI Speaker: Gialuca Brilli, University of Modena and Reggio Emilia, IT Authors: Gianluca Brilli, Alessandro Capotondi, Paolo Burgio, Andrea Marongiu and Marko Bertogna, University of Modena and Reggio Emilia, IT Abstract Recent advances in high-performance embedded systems has paved the way for next-generation applications, which were impratical few decades ago, such as Deep Neural Networks (DNNs). DNNs are widely adopted in several embedded domains and in particular in the so-called Cyber Physical Systems (CPS). Examples of CPS are autonomous robots, that typically integrate one or more neural networks into their navigation systems for perception and localization tasks. To match this need, high-performance embedded chips manufacturers are increasingly adopting a heterogeneous design (HeSoC), where sequential processors and energy efficient massively parallel accelerators, used to perform ML tasks in an energy efficient manner. These systems typycally follow a Commercial-Off-The-Shelf (COTS) organization, where the memory hierarchy composed of multiple cache layers and a main memory (DRAM) is shared between the computational engines of the system. This scheme allows on the one hand to increase the time-to-market, the scalability of the system and in general to provide good average-case performance. However, it is not always adequate in applications where by construction the system must guarantee bounded performance even in the worst-case. Shared memory organization creates contention problems on shared resources [1]–[3], where the execution time of a task also depends on the number of other tasks that access a given shared resource in the same time interval. The main aspects addressed in this work are: (i) a characterization of state-of-the-art embedded neural networks engines, to study the typical workload of a DNN and the impact that could have on the system; (ii) A deep memory interference characterization on HeSoCs with particular reference to FPGA-based; (iii) Architectural solutions to mitigate memory interference and improve the low memory-bandwidth utilization of PREM-like schemes.

Label	Presentation Title Authors
IP.1_1.1	(Best Paper Award Candidate) A SOFTWARE ARCHITECTURE TO CONTROL SERVICE-ORIENTED MANUFACTURING SYSTEMS Speaker: Sebastiano Gaiardelli, Università di Verona, IT Authors: Sebastiano Gaiardelli¹, Stefano Spellini¹, Marco Panato², Michele Lora³ and Franco Fummi¹ ¹Università di Verona, IT; ²Universita' di Verona, IT; ³University of Southern California, US Abstract This paper presents a software architecture extending the classical automation pyramid to control and reconfigure flexible, service-oriented manufacturing systems. At the Planning level, the architecture requires a Manufacturing Execution System (MES) consistent with the International Society of Automation (ISA) standard. Then, the Supervisory level is automated by introducing a novel component, called Automation Manager. The new component interacts upward with the MES, and downward with a set of servers providing access to the manufacturing machines. The communication with machines relies on the OPC Unified Architecture (OPC UA) standard protocol, which allows exposing production tasks as “services”. The proposed software architecture has been prototyped to control a real production line, originally controlled by a commercial MES, unable to fully exploit the flexibility provided by the case study manufacturing system. Meanwhile, the proposed architecture is fully exploiting the production line’s flexibility.
IP.1_1.2	(Best Paper Award Candidate) COMPREHENSIVE AND ACCESSIBLE CHANNEL ROUTING FOR MICROFLUIDIC DEVICES Speaker: Philipp Ebner, Johannes Kepler University, AT Authors: Gerold Fink, Philipp Ebner and Robert Wille, Johannes Kepler University Linz, AT Abstract Microfluidics is an emerging field that allows to minimize, integrate, and automate processes that are usually conducted with unwieldy laboratory equipment inside a single device; resulting in so-called "Labs-on-a-Chip" (LoCs). The design process of channel-based LoCs is still mainly conducted manually thus far - resulting in time-consuming tasks and error-prone designs. This also holds for the routing process, where multiple components inside an LoC should be connected according to a specification. In this work, we present a routing tool which considers the particular requirements of microfluidic applications and automates the routing process. In order to make the tool more accessible (even to users with little to no EDA-expertise), it is incorporated into a user-friendly and intuitive online interface.
IP.1_1.3	(Best Paper Award Candidate) XST: A CROSSBAR COLUMN-WISE SPARSE TRAINING FOR EFFICIENT CONTINUAL LEARNING Speaker: Fan Zhang, Arizona State University, US Authors: Fan Zhang, Li Yang, Jian Meng, Jae-sun Seo, Yu Cao and Deliang Fan, Arizona State University, US Abstract Leveraging the ReRAM crossbar-based In-Memory-Computing(IMC) to accelerate single task DNN inference has been widely studied. However, using the ReRAM crossbar for continual learning has not been explored yet. In this work, we propose XST, a novel crossbar column-wise sparse training framework for continual learning. XST significantly reduces the training cost and saves inference energy. More importantly, it is friendly to existing crossbar-based convolution engine with almost no hardware overhead. Compared with the state-of-the-art CPG method, the experiments show that XST's accuracy achieves 4.95% higher accuracy. Furthermore, XST demonstrates ~5.59X training speedup and 1.5X inference energy-saving.

Label	Presentation Title Authors
IP.1_2.1	(Best Paper Award Candidate) ENERGY-EFFICIENT BRAIN-INSPIRED HYPERDIMENSIONAL COMPUTING USING VOLTAGE SCALING Speaker: Xun Jiao, Villanova University, US Authors: Sizhe Zhang¹, Ruixuan Wang¹, Dongning Ma¹, Jeff Zhang², Xunzhao Yin³ and Xun Jiao¹ ¹Villanova University, US; ²Harvard University, US; ³Zhejiang University, CN Abstract Brain-inspired hyperdimensional computing (HDC) is an emerging computational paradigm that mimics the brain cognition and leverages hyperdimensional vectors with fully distributed holographic representation and (pseudo) randomness. Recently, HDC has demonstrated promising capability in a wide range of applications such as medical diagnosis, human activity recognition, and voice classification, etc. Despite the growing popularity of HDC, its memory-centric computing characteristics make the associative memory implementation under significant energy consumption due to the massive data storage and processing. While voltage scaling has been studied intensively to reduce memory energy dissipation, it can introduce errors which would degrade the output quality. In this paper, we systematically study and leverage the application-level error resilience of HDC to reduce the energy consumption of HDC associative memory by using voltage scaling. Evaluation results on various applications show that our proposed approach can achieve 47.6% energy saving on associative memory with a negligible accuracy loss (<1%). We further explore two low-cost error masking methods, i.e., word masking and bit masking, respectively, to mitigate the impact of voltage scaling-induced errors. Experimental results show that the proposed word masking (bit masking) method can further enhance energy saving up to 62.3% (72.5%) with accuracy loss <1%.
IP.1_2.2	ERROR GENERATION FOR 3D NAND FLASH MEMORY Speaker: Weihua Liu, Huazhong University of Science and Technology, CN Authors: Weihua Liu, Fei Wu, Songmiao Meng, Xiang Chen and Changsheng Xie, Huazhong University of Science and Technology, CN Abstract Three-dimension (3D) NAND flash memory is the preferred storage component of solid-state drive (SSD) for its high ratio of capacity and cost. Optimizing the reliability of modern SSD needs to test and collect a large amount of realworld error data from 3D NAND flash memory. However, the test costs have surged dozens of times as its capacity increases. It’s imperative to reduce the costs of testing denser and highcapacity flash memory. To facilitate it, in this paper, we aim to enable reproducing error data efficiently for 3D NAND flash memory. We use a conditional generative adversarial network (cGAN) to learn the error distribution with multiple interferences and generate diverse error data comparable to the real-world. Evaluation results demonstrate it is feasible and efficient for error generation with cGAN.
IP.1_2.3	ESTIMATING VULNERABILITY OF ALL MODEL PARAMETERS IN DNN WITH A SMALL NUMBER OF FAULT INJECTIONS Speaker: Yangchao Zhang, Osaka University, JP Authors: Yangchao Zhang¹, Hiroaki Itsuji², Takumi Uezono², Tadanobu Toba² and Masanori Hashimoto³ ¹Osaka University, JP; ²Hitachi Ltd., JP; ³Kyoto University, JP Abstract The reliability of deep neural networks (DNNs) against hardware errors is essential as DNNs are increasingly employed in safety-critical applications such as automatic driving. Transient errors in memory, such as radiation-induced soft error, may propagate through the inference computation, resulting in unexpected output, which can adversely trigger catastrophic system failures. As a first step to tackle this problem, this paper proposes constructing a vulnerability model (VM) with a small number of fault injections to identify vulnerable model parameters in DNN. We reduce the number of bit locations for fault injection significantly and develop a flow to incrementally collect the training data, i.e., the fault injection results, for VM accuracy improvement. Experimental results show that VM can estimate vulnerabilities of all DNN model parameters only with 1/3490 computations compared with traditional fault injection-based vulnerability estimation.

Label	Presentation Title Authors
IP.1_3.1	EXPLOITING ARBITRARY PATHS FOR THE SIMULATION OF QUANTUM CIRCUITS WITH DECISION DIAGRAMS Speaker: Lukas Burgholzer, Johannes Kepler University Linz, Austria, AT Authors: Lukas Burgholzer, Alexander Ploier and Robert Wille, Johannes Kepler University Linz, AT Abstract The classical simulation of quantum circuits is essential in the development and testing of quantum algorithms. Methods based on tensor networks or decision diagrams have proven to alleviate the inevitable exponential growth of the underlying complexity in many cases. But the complexity of these methods is very sensitive to so-called contraction plans or simulation paths, respectively, which define the order in which respective operations are applied. While, for tensor networks, a plethora of strategies has been developed, simulation based on decision diagrams is mostly conducted in a straight-forward fashion thus far. In this work, we envision a flow that allows to translate strategies from the domain of tensor networks to decision diagrams. Preliminary results indicate that a substantial advantage may be gained by employing suitable simulation paths—motivating a thorough consideration.
IP.1_3.2	A NOVEL NEUROMORPHIC PROCESSORS REALIZATION OF SPIKING DEEP REINFORCEMENT LEARNING FOR PORTFOLIO MANAGEMENT Speaker: Seyyed Amirhossein Saeidi, Amirkabir University of Technology (Tehran Polytechnic), IR Authors: Seyyed Amirhossein Saeidi, Forouzan Fallah, Soroush Barmaki and Hamed Farbeh, Amirkabir University of Technology, IR Abstract The process of constantly reallocating budgets into financial assets, aiming to increase the anticipated return of assets and minimizing the risk, is known as portfolio management. Processing speed and energy consumption of portfolio management have become crucial as the complexity of their real-world applications increasingly involves high-dimensional observation and action spaces and environment uncertainty, which their limited onboard resources cannot offset. Emerging neuromorphic chips inspired by the human brain increase processing speed by up to 500 times and reduce power consumption by several orders of magnitude. This paper proposes a spiking deep reinforcement learning (SDRL) algorithm that can predict financial markets based on unpredictable environments and achieve the defined portfolio management goal of profitability and risk reduction. This algorithm is optimized for Intel’s Loihi neuromorphic processor and provides 186x and 516x energy consumption reduction compared to a high-end processor and GPU, respectively. In addition, a 1.3x and 2.0x speed-up is observed over the high-end processors and GPUs, respectively. The evaluations are performed on cryptocurrency market benchmark between 2016 and 2021.
IP.1_3.3	IN-SITU TUNING OF PRINTED NEURAL NETWORKS FOR VARIATION TOLERANCE Speaker: Mehdi Tahoori, Karlsruhe Institute of Technology, DE Authors: Michael Hefenbrock, Dennis Weller, Jasmin Aghassi, Michael Beigl and Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract Printed electronic (PE) can meet the requirements of many application domains with requirements on cost, conformity, and non-toxicity which silicon-based computing systems cannot achieve. A typical computational task to be performed in many of such applications is classification. Therefore, printed Neural Networks (pNNs) have been proposed to meet these requirements. However, PE suffers from high process variations due to low resolution printing in low-cost additive manufacturing. This can severely impact the inference accuracy of pNNs. In this work, we show how a unique feature of PE, namely additive printing can be leveraged to perform in-situ tuning of pNNs to compensate accuracy losses induced by device variations. The experiments show that, even under 30 % variation of the conductances, up to 90% of the initial accuracy can be recovered.

Label	Presentation Title Authors
IP.1_4.1	PRACTICAL IDENTITY RECOGNITION USING WIFI'S CHANNEL STATE INFORMATION Speaker: Cristian Turetta, University of Verona, IT Authors: Cristian Turetta¹, Florenc Demrozi¹, Philipp H. Kindt², Alejandro Masrur³ and Graziano Pravadelli¹ ¹Università di Verona, IT; ²TU Munich, DE; ³TU Chemnitz, DE Abstract Identity recognition is increasingly used to control access to sensitive data, restricted areas in industrial, healthcare, and defense settings, as well as in consumer electronics. To this end, existing approaches are typically based on collecting and analyzing biometric data and imply severe privacy concerns. Particularly when cameras are involved, users might even reject or dismiss an identity recognition system. Furthermore, iris or fingerprint scanners, cameras, microphones, etc., imply installation and maintenance costs and require the user's active participation in the recognition procedure.This paper proposes a non-intrusive identity recognition system based on analyzing WiFi's Channel State Information (CSI). We show that CSI data attenuated by a person's body and typical movements allows for a reliable identification -- even in a sitting posture. We further propose a lightweight deep learning algorithm trained using CSI data, which we implemented and evaluated on an embedded platform (i.e., a Raspberry Pi 4B). Our results obtained using real-world experiments suggest a high accuracy in recognizing people's identity, with a specificity of 98% and a sensitivity of 99%, while requiring a low training effort and negligible cost.
IP.1_4.2	A RDMA INTERFACE FOR ULTRA-FAST ULTRASOUND DATA-STREAMING OVER AN OPTICAL LINK Speaker: Andrea Cossettini, ETH Zurich, CH Authors: Andrea Cossettini, Konstantin Taranov, Christian Vogt, Michele Magno, Torsten Hoefler and Luca Benini, ETH Zürich, CH Abstract Digital ultrasound (US) probes integrate the analog-to-digital conversion directly on the probe and can be conveniently connected to commodity devices. Existing digital probes are however limited to a relatively small number of channels, do not guarantee access to the raw US data, or cannot operate at very high frame rates (e.g., due to exhaustion of computing and storage units on the receiving device). In this work, we present an open, compact, power-efficient, 192-channels digital US data acquisition system capable of streaming US data at transfer rates greater than 80 Gbps towards a host PC for ultra-high frame rate imaging (in the multi-kHz range). Our US probe is equipped with two power-efficient Field Programmable Gate Arrays (FPGAs) and is interfaced to the host PC with two optical-link 100G Ethernet connections. The high-speed performance is enabled by implementing a Remote Direct Memory Access (RDMA) communication protocol between the probe and the controlling PC, that utilizes a high-performance Non-Volatile Memory Express (NVMe) interface to store the streamed data. To the best of our knowledge, thanks to the achieved datarates, this is the first high-channel-count compact digital US platform capable of raw data streaming at frame rates of 20 kHz (for imaging at 3.5 cm depths), without the need for sparse sampling, consuming less than 40 W.
IP.1_4.3	ROBUST HUMAN ACTIVITY RECOGNITION USING GENERATIVE ADVERSARIAL IMPUTATION NETWORKS Speaker: Dina Hussein, Washington State University, US Authors: Dina Hussein¹, Aaryan Jain² and Ganapati Bhat¹ ¹Washington State University, US; ²Nikola Tesla STEM High School, US Abstract Human activity recognition (HAR) is widely used in applications ranging from activity tracking to rehabilitation of patients. HAR classifiers are typically trained with data collected from a known set of users while assuming that all the sensors needed for activity recognition are working perfectly and there are no missing samples. However, real-world usage of the HAR classifier may encounter missing data samples due to user error, device error, or battery limitations. The missing samples, in turn, lead to a significant reduction in accuracy. To address this limitation, we propose an adaptive method that either uses low-power mean imputation or generative adversarial imputation networks (GAIN) to recover the missing data samples before classifying the activities. Experiments on a public HAR dataset with 22 users show that the proposed robust HAR classifier achieves 94% classification accuracy with as much as 20% missing samples from the sensors with 390 µJ energy consumption per imputation.

Label	Presentation Title Authors
IP.1_5.1	HYPERX: A HYBRID RRAM-SRAM PARTITIONED SYSTEM FOR ERROR RECOVERY IN MEMRISTIVE XBARS Speaker: Adarsh Kosta, Purdue University, US Authors: Adarsh Kosta, Efstathia Soufleri, Indranil Chakraborty, Amogh Agrawal, Aayush Ankit and Kaushik Roy, Purdue University, US Abstract Memristive crossbars based on Non-volatile Memory (NVM) technologies such as RRAM, have recently shown great promise for accelerating Deep Neural Networks (DNNs). They achieve this by performing efficient Matrix-Vector-Multiplications (MVMs) while offering dense on-chip storage and minimal off-chip data movement. However, their analog nature of computing introduces functional errors due to non-ideal RRAM devices, significantly degrading the application accuracy. Further, RRAMs suffer from low endurance and high write costs, hindering on-chip trainability. To alleviate these limitations, we propose HyperX, a hybrid RRAM-SRAM system that leverages the complementary benefits of NVM and CMOS technologies. Our proposed system consists of a fixed RRAM block offering area and energy-efficient MVMs and an SRAM block enabling on-chip training to recover the accuracy drop due to the RRAM non-idealities. The improvements are reported in terms of energy and product of latency and area (ms x mm^2), termed as area-normalized latency. Our experiments on CIFAR datasets using ResNet-20 show up to 2.88x and 10.1x improvements in inference energy and area-normalized latency, respectively. In addition, for a transfer learning task from ImageNet to CIFAR datasets using ResNet-18, we observe up to 1.58x and 4.48x improvements in energy and area-normalized latency, respectively. These improvements are with respect to an all-SRAM baseline.
IP.1_5.2	A RESOURCE-EFFICIENT SPIKING NEURAL NETWORK ACCELERATOR SUPPORTING EMERGING NEURAL ENCODING Speaker: Daniel Gerlinghoff, Agency for Science, Technology and Research, SG Authors: Daniel Gerlinghoff¹, Zhehui Wang¹, Xiaozhe Gu², Rick Siow Mong Goh¹ and Tao Luo¹ ¹Agency for Science, Technology and Research, SG; ²Chinese University of Hong Kong, Shenzhen, CN Abstract Spiking neural networks (SNNs) recently gained momentum due to their low-power multiplication-free computing and the closer resemblance of biological processes in the nervous system of humans. However, SNNs require very long spike trains (up to 1000) to reach an accuracy similar to their artificial neural network (ANN) counterparts for large models, which offsets efficiency and inhibits its application to low-power systems for real-world use cases. To alleviate this problem, emerging neural encoding schemes are proposed to shorten the spike train while maintaining the high accuracy. However, current accelerators for SNN cannot well support the emerging encoding schemes. In this work, we present a novel hardware architecture that can efficiently support SNN with emerging neural encoding. Our implementation features energy and area efficient processing units with increased parallelism and reduced memory accesses. We verified the accelerator on FPGA and achieve 25% and 90% improvement over previous work in power consumption and latency, respectively. At the same time, high area efficiency allows us to scale for large neural network models. To the best of our knowledge, this is the first work to deploy the large neural network model VGG on physical FPGA-based neuromorphic hardware.
IP.1_5.3	SCALABLE HARDWARE ACCELERATION OF NON-MAXIMUM SUPPRESSION Speaker: Chunyun Chen, Nanyang Technological University, SG Authors: Chunyun Chen¹, Tianyi Zhang², Zehui Yu¹, Adithi Raghuraman¹, Shwetalaxmi Udayan¹, Jie Lin² and Mohamed Aly¹ ¹Nanyang Technological University, SG; ²Institute for Infocomm Research, ASTAR, SG Abstract Non-maximum Suppression (NMS) in one- and two-stage object detection deep neural networks (e.g., SSD and Faster- RCNN) is becoming the computation bottleneck. In this paper, we introduce a hardware acceleration for the scalable PSRR- MaxpoolNMS algorithm. Our architecture shows 75.0× and 305× speedups compared to the software implementation of the PSRR- MaxpoolNMS as well as the hardware implementations of Gree-dyNMS, respectively, while simultaneously achieving comparable Mean Average Precision (mAP) to software-based floating-point implementations. Our architecture is 13.4× faster than the state-of-the-art NMS one. Our accelerator supports both one- and two-stage detectors, while supporting very high input resolutions (i.e., FHD)—essential input size for better detection accuracy.

Label	Presentation Title Authors
IP.1_6.1	ACTIVE LEARNING OF ABSTRACT SYSTEM MODELS FROM TRACES USING MODEL CHECKING Speaker: Natasha Yogananda Jeppu, University of Oxford, GB Authors: Natasha Yogananda Jeppu¹, Tom Melham¹ and Daniel Kroening² ¹University of Oxford, GB; ²Amazon, Inc, GB Abstract We present a new active model-learning approach to generating abstractions of a system implementation, as finite state automata (FSAs), from execution traces. Given an implementation and a set of observable system variables, the generated automata admit all system behaviours over the given variables and provide useful insight in the form of invariants that hold on the implementation. To achieve this, the proposed approach uses a pluggable model learning component that can generate an FSA from a given set of traces. Conditions that encode a completeness hypothesis are then extracted from the FSA under construction and used to evaluate its degree of completeness by checking their truth value against the system using software model checking. This generates new traces that express any missing behaviours. The new trace data is used to iteratively refine the abstraction, until all system behaviours are admitted by the learned abstraction. To evaluate the approach, we reverse-engineer a set of publicly available Simulink Stateflow models from their C implementations.
IP.1_6.2	REDUCING THE CONFIGURATION OVERHEAD OF THE DISTRIBUTED TWO-LEVEL CONTROL SYSTEM Speaker: Yu Yang, KTH Royal Institute of Technology, SE Authors: Yu Yang, Dimitrios Stathis and Ahmed Hemani, KTH Royal Institute of Technology, SE Abstract With the growing demand for more efficient hardware accelerators for streaming applications, a novel Coarse-Grained Reconfigurable Architecture (CGRA) that uses a Distributed Two-Level Control (D2LC) system has been proposed in the literature. Even though the highly distributed and parallel structure makes it fast and energy-efficient, the single-issue instruction channel between the level-1 and level-2 controller in each D2LC cell becomes the bottleneck of its performance. In this paper, we improve its design to mimic a multi-issued architecture by inserting shadow instruction buffers between the level-1 and level-2 controllers. Together with a zero-overhead hardware loop, the improved D2LC architecture can enable efficient overlap between loop iterations. We also propose a complete constraint programming based instruction scheduling algorithm to support the above hardware features. The experiment result shows that the improved D2LC architecture can achieve up to 25% of reduction on the instruction execution cycles and 35% reduction on the energy-delay product.
IP.1_6.3	BATCHLENS: A VISUALIZATION APPROACH FOR ANALYZING BATCH JOBS IN CLOUD SYSTEMS Speaker: Qiang Guan, Kent State University, US Authors: Shaolun Ruan¹, Yong Wang¹, Hailong Jiang², Weijia Xu³ and Qiang Guan² ¹Singapore Management University, SG; ²Kent State University, US; ³TACC, US Abstract Cloud systems are becoming increasingly powerful and complex. It is highly challenging to identify anomalous execution behaviors and pinpoint problems by examining the overwhelming intermediate results/states in complex application workflows. Domain scientists urgently need a friendly and functional interface to understand the quality of the computing services and the performance of their applications in real time. To meet these needs, we explore data generated by job schedulers and investigate general performance metrics (e.g., utilization of CPU, memory and disk I/O). Specifically, we propose an interactive visual analytics approach, BatchLens, to provide both providers and users of cloud service with an intuitive and effective way to explore the status of system batch jobs and help them conduct root-cause analysis of anomalous behaviors in batch jobs. We demonstrate the effectiveness of BatchLens through a case study on the public Alibaba bench workload trace datasets.

Label	Presentation Title Authors
IP.1_7.1	FLOWACC: REAL-TIME HIGH-ACCURACY DNN-BASED OPTICAL FLOW ACCELERATOR IN FPGA Speaker: Yehua Ling, Sun Yat-sen University, CN Authors: Yehua Ling, Yuanxing Yan, Kai Huang and Gang Chen, Sun Yat-sen University, CN Abstract Recently, accelerator architectures have been designed to use deep neural networks (DNNs) to accelerate computer vision tasks, possessing the advantages of both accuracy and speed. Optical flow accelerator is however not among these architectures that DNNs have been successfully deployed. Existing hardware accelerators for optical flow estimation are all designed for classic methods and generally perform poorly in estimated accuracy. In this paper, we present FlowAcc, a dedicated hardware accelerator for DNN-based optical flow estimation, adopting a pipelined hardware design for real-time processing of image streams. We design an efficient multiplexing binary neural network (BNN) architecture for pyramidal feature extraction to significantly reduce the hardware cost and make it independent of the pyramid level number. Furthermore, efficient hamming distance calculation and competent flow regularization are utilized for hierarchical optical flow estimation to greatly improve the system efficiency. Comprehensive experimental results demonstrate that FlowAcc achieves state-of-the-art estimation accuracy and real-time performance on the Middlebury dataset when compared with the existing optical flow accelerators.
IP.1_7.2	ON EXPLOITING PATTERNS FOR ROBUST FPGA-BASED MULTI-ACCELERATOR EDGE COMPUTING SYSTEMS Speaker: Seyyed Ahmad Razavi, University of California, Irvine, US Authors: Seyyed Ahmad razavi, Hsin-Yu Ting, Tootiya Giyahchi and Eli Bozorgzadeh, University of California, Irvine, US Abstract Edge computing plays a key role in providing services for emerging compute-intensive applications while bringing computation close to end devices. FPGAs have been deployed to provide custom acceleration services due to their reconfigurability and support for multi-tenancy in sharing the computing resource. This paper explores an FPGA-based Multi-Accelerator Edge Computing System, that serves various DNN applications from multiple end devices simultaneously. To dynamically maximize the responsiveness to end devices, we propose a system framework that exploits the characteristic of applications in patterns and employs a staggering module coupled with a mixed offline/online multi-queue scheduling method to alleviate resource contention, and uncertain delay caused by network delay variation. Our evaluation shows the framework can significantly improve responsiveness and robustness in serving multiple end devices.
IP.1_7.3	RLPLACE: DEEP RL GUIDED HEURISTICS FOR DETAILED PLACEMENT OPTIMIZATION Speaker: Uday Mallappa, UC San Diego, US Authors: Uday Mallappa¹, Sreedhar Pratty² and David Brown² ¹University of California San Diego, US; ²Nvidia, US Abstract The solution space of detailed placement becomes intractable with increase in thenumber of placeable cells and their possible locations. So, the existing works either focus on the sliding window-based optimization or row-based optimization. Though these region-based methods enable us to use linear-programming, pseudo-greedy or dynamic-programming algorithms, locally optimal solutions from these methods are globally sub-optimal with inherent heuristics. The heuristics such as the order in which we choose these local problems or size of each sliding window (runtime vs. optimality tradeoff) account for the degradation of solution quality. Our hypothesis is that learning-based techniques (with their richer representation ability) have shown a great success in problems with huge solution spaces, and can offer an alternative to the existing rudimentary heuristics. We propose a two-stage detailed-placement algorithm RLPlace that uses reinforcement learning (RL) for coarse re-arrangement and Satisfiability Modulo Theories (SMT) for fine-grain refinement. With global placement output of two critical IPs as the start point, RLPlace achieves upto 1.35% HPWL improvement as compared to the commercial tool’s detailed-placement result. In addition, RLPlace shows at least 1.2% HPWL improvement over highly optimized detailed-placement variants of the two IPs.

Label	Presentation Title Authors
IP.ASD.1	DEADLOCK ANALYSIS AND PREVENTION FOR INTERSECTION MANAGEMENT BASED ON COLORED TIMED PETRI NETS Speaker: Tsung-Lin Tsou, National Taiwan University, TW Authors: Tsung-Lin Tsou, Chung-Wei Lin and Iris Hui-Ru Jiang, National Taiwan University, TW Abstract We propose a Colored Timed Petri Net (CTPN) based model for intersection management. With the expressiveness of the CTPN-based model, we can consider timing, vehicle-specific information, and different types of vehicles. We then design deadlock-free policies and guarantee deadlock-freeness for intersection management. To the best of our knowledge, this is the first work on CTPN-based deadlock analysis and prevention for intersection management.
IP.ASD.2	ATTACK DATA GENERATION FRAMEWORK FOR AUTONOMOUS VEHICLE SENSORS Speaker: Jan Lauinger, TU Munich, DE Authors: Jan Lauinger¹, Andreas Finkenzeller¹, Henrik Lautebach², Mohammad Hamad¹ and Sebastian Steinhorst¹ ¹TU Munich, DE; ²ZF Group, DE Abstract Driving scenarios of autonomous vehicles combine many data sources with new networking requirements in highly dynamic system setups. To keep security mechanisms applicable to new application fields in the automotive domain, our work introduces a security framework to generate, attack, and validate realistic data sets at rest and in transit. Concerning realistic data sets, our framework leverages autonomous driving simulators as well as static data sets of vehicle sensors. A configurable networking setup enables flexible data encapsulation to perform and validate networking attacks on data in transit. We validate our results with intrusion detection algorithms and simulation environments. Generated data sets and configurations are reproducible, portable, storable, and support iterative security testing of scenarios.
IP.ASD.3	CONTRACT-BASED QUALITY-OF-SERVICE ASSURANCE IN DYNAMIC DISTRIBUTED SYSTEMS Speaker: Lea Schönberger, TU Dortmund University, DE Authors: Lea Schönberger¹, Susanne Graf², Selma Saidi³, Dirk Ziegenbein⁴ and Arne Hamann⁴ ¹TU Dortmund University, DE; ²University Grenoble Alpes, CNRS, FR; ³TU Dortmund, DE; ⁴Robert Bosch GmbH, DE Abstract To offer an infrastructure for autonomous systems offloading parts of their functionality, dynamic distributed systems must be able to satisfy non-functional quality-of-service (QoS) requirements. However, providing hard QoS guarantees without complex global verification that are satisfied even under uncertain conditions is very challenging. In this work, we propose a contract-based QoS assurance for centralized, hierarchical systems, which requires local verification only and has the potential to cope with dynamic changes and uncertainties.

Time	Label	Presentation Title Authors
14:30 CEST	11.1.1	EFFICSENSE: AN ARCHITECTURAL PATHFINDING FRAMEWORK FOR ENERGY-CONSTRAINED SENSOR APPLICATIONS Speaker: Jonah Van Assche, KU Leuven, BE Authors: Jonah Van Assche, Ruben Helsen and Georges Gielen, KU Leuven, BE Abstract This paper introduces EffiCSense, an architectural pathfinding framework for mixed-signal sensor front-ends for both regular and compressive sensing systems. Since sensing systems are often energy constrained, finding a suitable architecture can be a long iterative process between high-level modeling and circuit design. We present a Simulink-based framework that allows for architectural pathfinding with high-level functional models while also including power consumption models of the different circuit blocks. This allows to directly model the impact of design specifications on power consumption and speeds up the overall design process significantly. Both architectures with and without compressive sensing can be handled. The framework is demonstrated for the processing of EEG signals for epilepsy detection, comparing solutions with and without analog compressive sensing. Simulations show that using the compression, an optimal design can be found that is estimated to be 3.6 times more power-efficient compared to a system without compression, consuming 2.44 uW for a detection accuracy of 99.3%.
14:34 CEST	11.1.2	TOPOLOGY OPTIMIZAITON OF OPERATIONAL AMPLIFIER IN CONTINUOUS SPACE VIA GRAPH EMBEDDING Speaker: Jialin Lu, Fudan University, CN Authors: Jialin Lu, Liangbo Lei, Fan Yang, Li Shang and Xuan Zeng, Fudan University, CN Abstract Operational amplifier is a key building block in analog circuits. However, the design process of the operational amplifier is complex and time-consuming, as there are no practical automation tools available in the industry. This paper presents a new topology optimization method for operational amplifiers. The behavioral description of the operational amplifier is described using a directed acyclic graph (DAG), which is then transformed into a low-dimensional embedding in continuous space using a variational graph autoencoder. Topology search is performed in the continuous embedding space using stochastic optimization methods, such as Bayesian Optimization. The yield search results are then transformed back to operational amplifier topologies using a graph decoder. The proposed method is also equipped with a surrogate model for performance prediction. Experimental results show that the proposed approach can achieve significant speedup over the genetic searching algorithms. The produced three-stage operational amplifiers offer competitive performance compared to manual designs.
14:38 CEST	11.1.3	A CHARGE FLOW FORMULATION FOR GUIDING ANALOG/MIXED-SIGNAL PLACEMENT Speaker: Tonmoy Dhar, University of Minnesota Twin Cities, US Authors: Tonmoy Dhar¹, Ramprasath S², Jitesh Poojary², Soner Yaldiz³, Steven Burns³, Ramesh Harjani² and Sachin S. Sapatnekar² ¹University of Minnesota Twin Cities, US; ²University of Minnesota, US; ³Intel Corporation, US Abstract An analog/mixed-signal designer typically performs circuit optimization, involving intensive SPICE simulations, on a schematic netlist and then sends the optimized netlist to layout. During the layout phase, it is vital to maintain symmetry requirements to avoid performance degradation due to mismatch: these constraints are usually specified using user input or by invoking an external tool. Moreover, to achieve high performance, the layout must avoid large interconnect parasitics on critical nets. Prior works that optimize parasitics during placement work with coarse metrics such as the half-perimeter wire length, but these metrics do not appropriately emphasize performance-critical nets. The novel charge flow (CF) formulation in this work addresses both symmetry detection and parasitic optimization. By leveraging schematic-level simulations, which are available “for free” from the circuit optimization step, the approach (a) alters the objective function to emphasize the reduction of parasitics on performance-critical nets, and (b) identifies symmetric elements/element groups. The effectiveness of the CF-based approach is demonstrated on a variety of circuits within a stochastic placement engine.
14:42 CEST	11.1.4	(Best Paper Award Candidate) ARE ANALYTICAL TECHNIQUES WORTHWHILE FOR ANALOG IC PLACEMENT? Speaker: Yishuang Lin, Texas A&M University, US Authors: Yishuang Lin¹, Yaguang Li¹, Donghao Fang¹, Meghna Madhusudan², Sachin S. Sapatnekar², Ramesh Harjani² and Jiang Hu¹ ¹Texas A&M University, US; ²University of Minnesota, US Abstract Analytical techniques have long been a prevailing approach to digital IC placement due to their advantage in handling huge size problems. Recently, they were adopted for analog IC placement, where prior methods were mostly based on simulated annealing. However, there lacks a comparative study between the two approaches. Moreover, the impact from different analytical techniques is not clear. This work attempts to shed light on both issues by studying existing methods and developing a new analytical technique. Circuit performance is a critical concern for automated analog layout. To this end, we propose a performance driven analytical analog placement technique, which has not been studied in the past to the best of our knowledge. Experiments were performed on various testcase circuits. For conventional formulation without considering performance, the proposed analytical technique achieves 55 times speedup and 12% wirelength reduction compared to simulated annealing. For performance driven placement, the proposed technique outperforms simulated annealing in terms of circuit performance, area and runtime. Moreover, the proposed technique generally provides better solution quality than a recent previous analytical technique.
14:46 CEST	11.1.5	ROUTABILITY-AWARE PLACEMENT FOR ADVANCED FINFET MIXED-SIGNAL CIRCUITS USING SATISFIABILITY MODULO THEORIES Speaker: Hao Chen, University of Texas at Austin, US Authors: Hao Chen¹, Walker Turner², David Z. Pan¹ and Haoxing Ren² ¹University of Texas at Austin, US; ²NVIDIA Corporation, US Abstract Due to the increasingly complex design rules and geometric layout constraints within advanced FinFET nodes, automated placement of full-custom analog/mixed-signal (AMS) designs has become increasingly challenging. Compared with traditional planar nodes, AMS circuit layout is dramatically different for FinFET technologies due to strict design rules and grid-based restrictions for both placement and routing. This limits previous analog placement approaches in effectively handling all of the new constraints while adhering to the new layout style. Additionally, limited work has demonstrated effective routability modeling, which is crucial for successful routing. This paper presents a robust analog placement framework using satisfiability modulo theories (SMT) for efficient constraint handling and routability modeling. Experimental results based on industrial designs show the effectiveness of the proposed framework in optimizing placement metrics while satisfying the specified constraints.
14:50 CEST	11.1.6	CONSTRUCTIVE COMMON-CENTROID PLACEMENT AND ROUTING FOR BINARY-WEIGHTED CAPACITOR ARRAYS Speaker: Nibedita Karmokar, University of Minnesota, Twin Cities, US Authors: Nibedita Karmokar, Arvind Kumar Sharma, Jitesh Poojary, Meghna Madhusudan, Ramesh Harjani and Sachin S. Sapatnekar, University of Minnesota, US Abstract The accuracy and linearity of capacitive digital-to-analog converters (DACs) depend on precise capacitor ratios, but these ratios are perturbed by process variations and parasitics. This paper develops fast constructive procedures for common-centroid placement and routing for binary-weighted capacitors in charge-sharing DACs. Parasitics also degrade the switching speed of a capacitor array, particularly in FinFET nodes with severe wire/via resistances. To overcome this, the capacitor array is placed and routed to optimize switching speed, measured by the 3dB frequency. A balance between 3dB frequency and DAC INL/DNL is shown by trading off via counts with dispersion. The approach delivers high-quality results with low runtimes.
14:54 CEST	11.1.7	Q&A SESSION Authors: Manuel Barragan¹ and Lars Hedrich² ¹Universite Grenoble Alpes, CNRS, Grenoble INP, TIMA, FR; ²Goethe University of Frankfurt/Main, DE Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
14:30 CEST	11.2.1	MUSCAT: MUS-BASED CIRCUIT APPROXIMATION TECHNIQUE Speaker: Linus Witschen, Paderborn University, DE Authors: Linus Witschen, Tobias Wiersema, Matthias Artmann and Marco Platzner, Paderborn University, DE Abstract Many applications show an inherent resiliency against inaccuracies and errors in their computations. The design paradigm approximate computing exploits this fact by trading off the application’s accuracy against a target metric, e.g., hardware area. This work focuses on approximate computing on the hardware level, where approximate logic synthesis seeks to generate approximate circuits under user-defined quality constraints. We propose the novel approximate logic synthesis method MUSCAT to generate approximate circuits which are valid-by-construction. MUSCAT inserts cutpoints into the netlist to employ the commonly-used concept of substituting connections between gates by constant values, which offers potential for subsequent logic minimization. MUSCAT’s novelty lies in utilizing formal verification engines to identify minimal unsatisfiable subsets. These subsets determine a maximal number of cutpoints that can be activated together without resulting in a violation against the user-defined quality constraints. As a result, MUSCAT determines an optimal solution w.r.t. the number of activated cutpoints while providing a guarantee on the quality constraints. We present the method and experimentally compare MUSCAT’s open-source implementation to AIG rewriting and components from the EvoApproxLib. We show that our method improves upon these state-of-the-art methods by achieving up to 80 % higher savings in circuit area at typically much lower computation times.
14:34 CEST	11.2.2	OPACT: OPTIMIZATION OF APPROXIMATE COMPRESSOR TREE FOR APPROXIMATE MULTIPLIER Speaker: Xiao Weihua, Shanghai Jiao Tong University, CN Authors: Weihua Xiao¹, Cheng Zhuo² and Weikang Qian¹ ¹Shanghai Jiao Tong University, CN; ²Zhejiang University, CN Abstract Approximate multipliers have attracted significant attention of researchers for designing low-power systems. The most area-consuming part of a multiplier is its compressor tree (CT). Hence, the prior works proposed various approximate compressors to reduce the area of the CT. However, the compression strategy for the approximate compressors has not been systematically studied: Most of the prior works apply their ad hoc strategies to arrange approximate compressors. In this work, we propose OPACT, a method for optimizing approximate compressor tree for approximate multiplier. An integer linear programming problem is first formulated to co-optimize CT’s area and error. Moreover, since different connection orders of the approximate compressors can affect the error of an approximate multiplier, we formulate another mixed-integer programming problem for optimizing the connection order. The experimental results showed that OPACT can produce approximate multipliers with with an average reduction of 24.4% and 8.4% in power-delay product and mean error distance, respectively, compared to the best existing designs with the same types of approximate compressors used.
14:38 CEST	11.2.3	LEARNING TO DESIGN ACCURATE DEEP LEARNING ACCELERATORS WITH INACCURATE MULTIPLIERS Speaker: Paras Jain, UC Berkeley, US Authors: Paras Jain¹, Safeen Huda², Martin Maas³, Joseph Gonzalez¹, Ion Stoica¹ and Azalia Mirhoseini⁴ ¹UC Berkeley, US; ²University of Toronto, CA; ³Google, Inc., US; ⁴Google, US Abstract Approximate computing is a promising way to improve the power efficiency of deep learning. While recent work proposes new arithmetic circuits (adders and multipliers) that consume substantially less power at the cost of computation errors, these approximate circuits decrease the end-to-end accuracy of common models. We present AutoApprox, a framework to automatically generate approximate low-power deep learning accelerators without any accuracy loss. AutoApprox generates a wide range of approximate ASIC accelerators with a TPUv3 systolic-array template. AutoApprox uses a learned router to assign each DNN layer to an approximate systolic array from a bank of arrays with varying approximation levels. By tailoring this routing for a specific neural network architecture, we discover circuit designs without the accuracy penalty from prior methods. Moreover, AutoApprox optimizes for the end-to-end performance, power and area of the the whole chip and PE mapping rather than simply measuring the performance of the arithmetic units in isolation. To our knowledge, our work is the first to demonstrate the effectiveness of custom-tailored approximate circuits in delivering significant chip-level energy savings with zero accuracy loss on a large-scale dataset such as ImageNet. AutoApprox synthesizes a novel approximate accelerator based on the TPU that reduces end-to-end power consumption by 3.2% and area by 5.2% at a sub-10nm process with no degradation in ImageNet validation top-1 and top-5 accuracy.
14:42 CEST	11.2.4	(Best Paper Award Candidate) CROSS-LAYER APPROXIMATION FOR PRINTED MACHINE LEARNING CIRCUITS Speaker: Giorgos Armeniakos, NTUA / KIT, GR Authors: Giorgos Armeniakos¹, Georgios Zervakis², Dimitrios Soudris³, Mehdi Tahoori² and Joerg Henkel⁴ ¹National Technichal University of Athens, GR; ²Karlsruhe Institute of Technology, DE; ³National TU Athens, GR; ⁴Karlsruhe institute of technology, DE Abstract Printed electronics (PE) feature low non-recurring engineering costs and low per unit-area fabrication costs, enabling thus extremely low-cost and on-demand hardware. Such low-cost fabrication allows for high customization that would be infeasible in silicon, and bespoke architectures prevail to improve the efficiency of emerging PE machine learning (ML) applications. However, even with bespoke architectures, the large feature sizes in PE constraint the complexity of the ML models that can be implemented. In this work, we bring together, for the first time, approximate computing and PE design targeting to enable complex ML models, such as Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs), in PE. To this end, we propose and implement a cross-layer approximation, tailored for bespoke ML architectures. At the algorithmic level we apply a hardware-driven coefficient approximation of the ML model and at the circuit level we apply a netlist pruning through a full search exploration. In our extensive experimental evaluation we consider 14 MLPs and SVMs and evaluate more than 4300 approximate and exact designs. Our results demonstrate that our cross approximation delivers Pareto optimal designs that, compared to the state-of-the-art exact designs, feature 47% and 44% average area and power reduction, respectively, and less than 1% accuracy loss.
14:46 CEST	11.2.5	A TARGET-SEPARABLE BWN INSPIRED SPEECH RECOGNITION PROCESSOR WITH LOW-POWER PRECISION-ADAPTIVE APPROXIMATE COMPUTING Speaker: Bo Liu, Southeast University, CN Authors: Bo Liu¹, Hao Cai¹, Xuan Zhang¹, Haige Wu¹, Anfeng Xue¹, Zilong Zhang¹, Zhen Wang² and Jun Yang¹ ¹Southeast University, CN; ²Nanjing Prochip Electronic Technology Co. Ltd, CN Abstract This paper proposes a speech recognition processor based on a target-separable binarized weight network (BWN), capable of performing both speaker verification (SV) and keyword spotting (KWS). In traditional speech recognition system, the SV based on traditional model and the KWS based on neural networks (NN) model are two independent hardware modules. In this work, both SV and KWS are processed by the proposed BWN with unified training and optimization framework which can be performed for various application scenarios. By the system-architecture co-design, SV and KWS share most of the feature extraction network parameters, and the classification part is calculated separately according to different targets. An energy-efficient NN accelerator which can be dynamically reconfigured to process different layers of the BWN with splitting calculation of frequency domain convolution is proposed. SV and KWS can be achieved with only one time calculation of each input speech frame, which greatly improves the computing energy efficiency. The computing units of the NN accelerator is optimized using precision-adaptive approximate addition tree architecture with Dual-VDD method to further reduce the energy cost. Compared to state-of-the-arts, this work can achieve about 4x reduction in power consumption while maintaining high system adaptability and accuracy.
14:50 CEST	11.2.6	TOWARDS ENERGY-EFFICIENT CGRAS VIA STOCHASTIC COMPUTING Speaker: Bo Wang, Chongqing University, CN Authors: Bo Wang¹, Rong Zhu¹, Jiaxing Shang² and Dajiang Liu¹ ¹Chongqing University, CN; ²Chongqing Univesity, CN Abstract Stochastic computing (SC) is a promising computing paradigm for low-power and low-cost applications with the added benefit of high error tolerance. Meanwhile, Coarse-Grained Reconfigurable Architecture (CGRA) is also a promising platform for domain-specific applications for its combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would synergistically reinforce the strengths of both paradigms. Accordingly, this paper proposes an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication, where the problem of accuracy and latency are both improved using parallel stochastic sequence generators and leading zero shifters. In addition, with the flexible connections among PEs, the high-accuracy operation can be easily achieved by combing neighbor PEs without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA has 16% more energy reduction and 34% energy efficiency improvement while keeping high configuration flexibility.
14:54 CEST	11.2.7	Q&A SESSION Authors: Jie Han¹ and Ilaria Scarabottolo² ¹University of Alberta, CA; ²USI Lugano, CH Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
14:30 CEST	11.3.1	DASC : A DRAM DATA MAPPING METHODOLOGY FOR SPARSE CONVOLUTIONAL NEURAL NETWORKS Speaker: Bo-Cheng Lai, National Yang Ming Chiao Tung University, TW Authors: Bo-Cheng Lai¹, Tzu-Chieh Chiang¹, Po-Shen Kuo¹, Wan-Ching Wang¹, Yan-Lin Hung¹, Hung-Ming Chen², Chien-Nan Liu¹ and Shyh-Jye Jou¹ ¹National Yang Ming Chiao Tung University, TW; ²Institute of Electronics, National Chiao Tung University, TW Abstract The data transferring of sheer model size of CNN (Convolution Neural Network) has become one of the main performance challenges in modern intelligent systems. Although pruning can trim down substantial amount of non-effective neurons, the excessive DRAM accesses of the non-zero data in a sparse network still dominate the overall system performance. Proper data mapping can enable efficient DRAM accesses for a CNN. However, previous DRAM mapping methods focus on dense CNN and become less effective when handling the compressed format and irregular accesses of sparse CNN. The extensive design space search for mapping parameters also results in a time-consuming process. This paper proposes DASC: a DRAM data mapping methodology for sparse CNNs. DASC is designed to handle the data patterns and block schedule of sparse CNN to attain good spatial locality and efficient DRAM accesses. The bank-group feature in modern DDR is further exploited to enhance processing parallelism. DASC also introduces an analytical model to facilitate fast exploration and quick convergence of parameter search in minutes instead of days from previous work. When compared with the state-of-the-art, DASC decreases the total DRAM latencies and attains an average of 17.1x, 14.3x, and 14.6x better DRAM performance for sparse AlexNet, VGG-16, and Resnet-50 respectively.
14:34 CEST	11.3.2	VW-SDK: EFFICIENT CONVOLUTIONAL WEIGHT MAPPING USING VARIABLE WINDOWS FOR PROCESSING-IN-MEMORY ARCHITECTURES Speaker: Johnny Rhe, Sungkyunkwan University, KR Authors: Johnny Rhe, Sungmin Moon and Jong Hwan Ko, Sungkyunkwan University, KR Abstract With their high energy efficiency, processing-in-memory (PIM) arrays are increasingly used for convolutional neural network (CNN) inference. In PIM-based CNN inference, the computational latency and energy are dependent on how the CNN weights are mapped to the PIM array. A recent study proposed shifted and duplicated kernel (SDK) mapping that reuses the input feature maps with a unit of a parallel window, which is convolved with duplicated kernels to obtain multiple output elements in parallel. However, the existing SDK-based mapping algorithm does not always result in the minimum computing cycles because it only maps a square-shaped parallel window with the entire channels. In this paper, we introduce a novel mapping algorithm called variable-window SDK (VW-SDK), which adaptively determines the shape of the parallel window that leads to the minimum computing cycles for a given convolutional layer and PIM array. By allowing rectangular-shaped windows with partial channels, VW-SDK utilizes the PIM array more efficiently, thereby further reduces the number of computing cycles. The simulation with a 512x512 PIM array and Resnet-18 shows that VW-SDK improves the inference speed by 1.69x compared to the existing SDK-based algorithm.
14:38 CEST	11.3.3	A UNIFORM LATENCY MODEL FOR DNN ACCELERATORS WITH DIVERSE ARCHITECTURES AND DATAFLOWS Speaker: Linyan Mei, KU Leuven, CN Authors: Linyan Mei¹, Huichu Liu², Tony Wu³, H. Ekin Sumbul², Marian Verhelst¹ and Edith Beigne² ¹KU Leuven, BE; ²Facebook Inc., US; ³Meta/Facebook, US Abstract In the early design phase of a Deep Neural Network (DNN) acceleration system, fast energy and latency estimation are important to evaluate the optimality of different design candidates on algorithm, hardware, and algorithm-to-hardware mapping, given the gigantic design space. This work proposes a uniform intra-layer analytical latency model for DNN accelerators that can be used to evaluate diverse architectures and dataflows. It employs a 3-step approach to systematically estimate the latency breakdown of different system components, capture the operation state of each memory component, and identify stall-induced performance bottlenecks. To achieve high accuracy, different memory attributes, operands' memory sharing scenarios, as well as dataflow implications have been taken into account. Validation against an in-house taped-out accelerator across various DNN layers has shown an average latency model accuracy of 94.3%. To showcase the capability of the proposed model, we carry out 3 case studies to assess respectively the impact of mapping, workloads, and diverse hardware architectures on latency, driving design insights for algorithm-hardware-mapping co-optimization.
14:42 CEST	11.3.4	MEDEA: A MULTI-OBJECTIVE EVOLUTIONARY APPROACH TO DNN HARDWARE MAPPING Speaker: Enrico Russo, University of Catania, IT Authors: Enrico Russo¹, Maurizio Palesi¹, Salvatore Monteleone², Davide Patti¹, Giuseppe Ascia¹ and Vincenzo Catania¹ ¹University of Catania, IT; ²Università Niccolò Cusano, IT Abstract Deep Neural Networks (DNNs) embedded domain-specific accelerators enable inference on resource-constrained devices. Making optimal design choices and efficiently scheduling neural network algorithms on these specialized architectures is challenging. Many choices can be made to schedule computation spatially and temporally on the accelerator. Each choice influences the access pattern to the buffers of the architectural hierarchy, affecting the energy and latency of the inference. Each mapping also requires specific buffer capacities and a number of spatial components instances that translate in different chip area occupation. The space of possible combinations, the mapping space, is so large that automatic tools are needed for its rapid exploration and simulation. This work presents MEDEA, an open-source multi-objective evolutionary algorithm based approach toDNNs accelerator mapping space exploration. MEDEA leverages the Timeloop analytical cost model. Differently from the other schedulers that optimize towards a single objective, MEDEA allows deriving the Pareto set of mappings to optimize towards multiple, sometimes conflicting, objectives simultaneously. We found that solutions found by MEDEA dominates in most cases those found by state-of-the-art mappers.
14:46 CEST	11.3.5	DIGAMMA: DOMAIN-AWARE GENETIC ALGORITHM FOR HW-MAPPING CO-OPTIMIZATION FOR DNN ACCELERATORS Speaker: Sheng-Chun Kao, Georgia Institute of Technology, US Authors: Sheng-Chun Kao¹, Michael Pellauer², Angshuman Parashar² and Tushar Krishna¹ ¹Georgia Institute of Technology, US; ²Nvidia, US Abstract The design of DNN accelerators includes two key parts: HW resource configuration and mapping strategy. Intensive research has been conducted to optimize each of them independently. Unfortunately, optimizing for both together is extremely challenging due to the extremely large cross-coupled search space. To address this, in this paper, we propose a HW-Mapping co-optimization framework, an efficient encoding of the immense design space constructed by HW and Mapping, and a domain-aware genetic algorithm, named DiGamma, with specialized operators for improving search efficiency. We evaluate DiGamma with seven popular DNNs models with different properties. Our evaluations show DiGamma can achieve (geomean) 3.0x and 10.0x speedup, comparing to the best-performing baseline optimization algorithms, in edge and cloud settings.
14:50 CEST	11.3.6	(Best Paper Award Candidate) ANACONGA: ANALYTICAL HW-CNN CO-DESIGN USING NESTED GENETIC ALGORITHMS Speaker: Nael Fasfous, TU Munich, DE Authors: Nael Fasfous¹, Manoj Rohit Vemparala², Alexander Frickenstein², Emanuele Valpreda³, Driton Salihu¹, Julian Höfer⁴, Anmol Singh², Naveen-Shankar Nagaraja², Hans-Joerg Voegel², Nguyen Anh Vu Doan¹, Maurizio Martina³, Juergen Becker⁴ and Walter Stechele¹ ¹TU Munich, DE; ²BMW Group, DE; ³Politecnico di Torino, IT; ⁴Karlsruhe Institute of Technology, DE Abstract We present AnaCoNGA, an analytical co-design methodology, which enables two genetic algorithms to evaluate the fitness of design decisions on layer-wise quantization of a neural network and hardware (HW) resource allocation. We embed a hardware architecture search (HAS) algorithm into a quantization strategy search (QSS) algorithm to evaluate the hardware design Pareto-front of each considered quantization strategy. We harness the speed and flexibility of analytical HW-modeling to enable parallel HW-CNN co-design. With this approach, the QSS is focused on seeking high-accuracy quantization strategies which are guaranteed to have efficient hardware designs at the end of the search. Through AnaCoNGA, we improve the accuracy by 2.88 p.p. with respect to a uniform 2-bit ResNet20 on CIFAR-10, and achieve a 35% and 37% improvement in latency and DRAM accesses, while reducing LUT and BRAM resources by 9% and 59% respectively, when compared to a standard edge variant of the accelerator.
14:54 CEST	11.3.7	Q&A SESSION Authors: Jan Moritz Joseph¹ and Elnaz Ansari² ¹RWTH Aachen University, DE; ²Meta/Facebook, US Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
14:30 CEST	11.4.1	(Best Paper Award Candidate) ADAFLOW: A FRAMEWORK FOR ADAPTIVE DATAFLOW CNN ACCELERATION ON FPGAS Speaker: Guilherme Korol, Federal University of Rio Grande do Sul - Brazil, BR Authors: Guilherme Korol¹, Michael Jordan², Mateus Beck Rutzig³ and Antonio Carlos Schneider Beck¹ ¹Universidade Federal do Rio Grande do Sul, BR; ²UFRGS, BR; ³UFSM, BR Abstract To meet latency and privacy requirements, resource-hungry deep learning applications have been migrating to the Edge, where IoT devices can offload the inference processing to local Edge servers. Since FPGAs have successfully accelerated an increasing number of deep learning applications (especially CNN-based ones), they emerge as an effective alternative for Edge platforms. However, Edge applications may present highly unpredictable workloads, requiring runtime adaptability in the inference processing. Although some works apply model switching on CPU and GPU platforms by exploiting different pruning rates at runtime, so the inference can adapt according to some quality-performance trade-off, FPGA-based accelerators refrain from this approach since they are synthesized to specific CNN models. In this context, this work enables model switching on FPGAs by adding to the well-known FINN accelerator an extra level of adaptability (i.e., flexibility) and support to the dynamic use of pruning via fast model switch on flexible accelerators, at the cost of some extra logic, or via FPGA reconfigurations of fixed accelerators. From that, we developed AdaFlow: a framework that automatically builds, at design time, a library from these new available versions (flexible and fixed, pruned or not) that will be used, at runtime, to dynamically select a given version according to a user-configurable accuracy threshold and current workload conditions. We have evaluated AdaFlow under a smart Edge surveillance application with two CNN models and two datasets, showing that AdaFlow processes, on average, 1.3x more inferences and increases, on average, 1.4x the power efficiency over state-of-the-art statically deployed dataflow accelerators.
14:34 CEST	11.4.2	RAW FILTERING OF JSON DATA ON FPGAS Speaker: Tobias Hahn, FAU, DE Authors: Tobias Hahn, Andreas Becher, Stefan Wildermann and Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Abstract Many Big Data applications include the processing of data streams on semi-structured data formats such as JSON. A disadvantage of such formats is that an application may spend a significant amount of processing time just on unselectively parsing all data. To relax this issue, the concept of raw filtering is proposed with the idea to remove data from a stream prior to the costly parsing stage. However, as accurate filtering of raw data is often only possible after the data has been parsed, raw filters are designed to be approximate in the sense of allowing false-positives in order to be implemented efficiently. Contrary to previously proposed CPU-based raw filtering techniques that are restricted to string matching, we present FPGA-based primitives for filtering strings, numbers and also number ranges. In addition, a primitive respecting the basic structure of JSON data is proposed that can be used to further increase the accuracy of introduced raw filters. The proposed raw filter primitives are designed to allow for their composition according to a given filter expression of a query. Thus, complex raw filters can be created for FPGAs which enable a drastical decrease in the amount of generated false-positives, particularly for IoT workload. As there exists a trade-off between accuracy and resource consumption, we evaluate primitives as well as composed raw filters using different queries from the RiotBench benchmark. Our results show that up to 94.3% of the raw data can be filtered without producing any observed false-positives using only a few hundred LUTs.
14:38 CEST	11.4.3	GRAPHWAVE: A HIGHLY-PARALLEL COMPUTE-AT-MEMORY GRAPH PROCESSING ACCELERATOR Speaker: Jinho Lee, National University of Singapore, SG Authors: Jinho Lee, Burin Amornpaisannon, Tulika Mitra and Trevor E. Carlson, National University of Singapore, SG Abstract The fast, efficient processing of graphs is needed to quickly analyze and understand connected data, from large social network graphs, to edge devices performing timely, local data analytics. But, as graph data tends to exhibit poor locality, designing both high-performance and efficient graph accelerators have been difficult to realize. In this work, GraphWave, we take a different approach compared to previous research and focus on maximizing accelerator parallelism with a compute-at-memory approach, where each vertex is paired with a dedicated functional unit. We also demonstrate that this work can improve performance and efficiency by optimizing the accelerator's interconnect with multi-level multi-casting to minimize congestion. Taken together, this work achieves, to the best of our knowledge, a state-of-the-art efficiency of up to 63.94 GTEPS/W with a throughput of 97.80 GTEPS (billion traversed edges per second).
14:42 CEST	11.4.4	RF-CGRA: A ROUTING-FRIENDLY CGRA WITH HIERARCHICAL REGISTER CHAINS Speaker: Dajiang Liu, Chongqing University, CN Authors: Rong Zhu, Bo Wang and Dajiang Liu, Chongqing University, CN Abstract CGRAs are promising architectures to accelerate domain-specific applications as they combine high energy-efficiency and flexibility. With either isolated register files (RFs) or link-consuming distributed registers in each processing element (PE), existing CGRAs are all not friendly to data routing for data-flow graphs (DFGs) with a high edge/node ratio since there are many multi-cycle dependences. To this end, this paper proposes a Routing-Friendly CGRA (RF-CGRA) where hierarchical (intra-PE or inter-PE) register chains could be flexibly (wide range of chain length) and compactly (consuming fewer links among PEs) achieved for data routing, resulting in a new mapping problem that requires the improvement of a compiler. Experimental results show that RF-CGRA gets 1.19X performance and 1.14X energy efficiency of the state-of-the-art CGRA with single-cycle multi-hop connections (HyCUBE) while keeping a moderate compilation time.
14:46 CEST	11.4.5	PATHSEEKER: A FAST MAPPING ALGORITHM FOR CGRAS Speaker: Mahesh Balasubramanian, Arizona State University, US Authors: Mahesh Balasubramanian and Aviral Shrivastava, Arizona State University, US Abstract Coarse-grained reconfigurable arrays (CGRAs) have gained traction over the years as a low-power accelerator due to the efficient mapping of the compute-intensive loops onto the 2-D array by the CGRA compiler. When encountering a mapping failure for a given node, existing mapping techniques either exit and retry the mapping anew, or perform backtracking, i.e., recursively remove the previously mapped node to find a valid mapping. Abandoning mapping and starting afresh can deteriorate the quality of mapping and the compilation time. Even backtracking may not be the best choice since the previous node may not be the incorrectly placed node. To tackle this issue, we propose PathSeeker -- a mapping approach that analyzes mapping failures and performs local adjustments to the schedule to obtain a mapping. Experimental results on 35 top performance-critical loops from MiBench, Rodinia, and Parboil benchmark suites demonstrate that PathSeeker can map all of them with better mapping quality and dramatically less compilation time than the previous state-of-the-art approaches -- GraphMinor and RAMP, which were unable to map 20 and 5 loops, respectively. Over these benchmarks, PathSeeker achieves 28% better performance at 550x compilation speedup over GraphMinor and 3% better performance at 10x compilation speedup over RAMP on a 4x4 CGRA.
14:50 CEST	11.4.6	IMPROVING TECHNOLOGY MAPPING FOR AIC-BASED FPGAS Speaker: Shubham Rai, TU Dresden, DE Authors: Martin Thümmler, Shubham Rai and Akash Kumar, TU Dresden, DE Abstract Commonly, LUTs are used in FPGAs as their main source of configurability. But these large multiplexers have only one output and their area scales exponentially with the number of inputs. As counterpart AND-inverter-cones (AIC) were proposed in 2012. They are a cone-like structure of configurable gates. AICs are not as flexible configurable as LUTs, but have multiple major benefits: First, its structure is inspired by And-Inverter-Graphs, which are currently the predominant form to represent and optimize digital hardware circuits. Second, they provide multiple outputs and are intrinsically fracturable. Therefore logic duplication can be reduced. Additionally, physical AICs can be split into multiple smaller ones without any additional hardware effort. Third, their area scales linearly with the exponentially increasing number of inputs. Additionally, a special form of AICs called Nand-Nor-Cones can be implemented very efficiently, especially for the newly emerging RFET technologies. Technology mapping is one of the crucial tasks to release the full power of AIC based FPGAs. In this thesis the current technology mapping algorithms are reviewed and the following improvements are proposed: First, instead of calculating the required time by choices, a direct required time calculation method is presented. This ensures, that every node has a sensible required time assigned. Second, it is shown that the priority cut calculation method can be replaced by a much simpler direct cut selection method with reduced runtime and similar quality of results. Third, a local subgraph balancing is proposed, to reduce the cone sizes to which cuts get mapped. Combining all of these improvements leads to an average area reduction of over 20\% for the MCNC benchmarks compared to the previous technology mapper, while not increasing the average circuit delay. % Similar improvements are presented for the VTR benchmarks. Additionally, a mapping algorithm to NNCs with three inputs per gate is provided for the first time. Finally, the technology mapper is integrated open-source into the logic synthesis and verification system Abc
14:54 CEST	11.4.7	Q&A SESSION Authors: Michaela Blott¹ and Shanker Shreejith² ¹Xilinx, IE; ²Trinity College Dublin, IE Abstract Questions and answers with the authors

O.1 Opening

K.1 Opening keynote #1: "What is beyond AI? Societal opportunities and electronic design automation"

K.2 Opening keynote #2: "Cryo-CMOS Quantum Control: from a Wild Idea to Working Silicon"

1.1 Scalable quantum stacks: current status and future prospects

K.3 Lunch Keynote: "Batteries: powering up the next generations"

2.1 Energy-autonomous systems for next generation of IoT

3.1 Panel: Quantum Software Toolchain

4.1 Panel: Quantum Hardware

5.1 Novel Design Techniques for Emerging Technologies in Computing

K.4 Lunch Keynote: "AI in the edge; the edge of AI"

6.1 Alternative design paradigms for sustainable IoT nodes

7.1 Panel: Autonomous Systems Design as a Research Challenge

8.1 Young People Program: Career Fair

9.1 Young People Program: Sponsorship Fair

10.1 PhDForum

IP.1_1 Interactive presentations

IP.1_2 Interactive presentations

IP.1_3 Interactive presentations

IP.1_4 Interactive presentations

IP.1_5 Interactive presentations

IP.1_6 Interactive presentations

IP.1_7 Interactive presentations

IP.ASD Interactive presentations

K.5 Lunch Keynote: "Probabilistic and Deep Learning Techniques for Robot Navigation and Automated Driving"

11.1 Analog / mixed-signal EDA from system level to layout level

11.2 Approximate Computing Everywhere

11.3 Advanced Mapping and Optimization for Emerging ML Hardware

11.4 Reconfigurable Systems

11.5 An Industrial Perspective on Autonomous Systems Design

12.1 AI as a Driver for Innovative Applications

12.2 Applications of optimized quantum and probabilistic circuits in emergent computing systems

12.3 Reliable safe and approximate systems

12.4 Raising Performance and Reliability of the Memory Subsystem

12.5 Bringing Robust Deep Learning to the Autonomous Edge: New Challenges and Algorithm-Hardware Solutions

13.1 New Perspectives in Test and Diagnosis

13.2 From system-level specification to RTL and back

13.3 Advances in permanent storage efficiency and NN-in-memory

13.4 System-level security

13.5 Safe and Efficient Engineering of Autonomous Systems

A.1 Panel on Quantum and Neuromorphic Computing: Designing Brain-Inspired Chips

14.1 University Fair

W01 European Automotive Reliability, Test and Safety (eARTS)

Information

Key dates

Program

Morning sessions

Afternoon sessions

W01.0 Opening

W01.1 Keynote: From combustion towards electrical cars

W01.T1 Technical Session 1 - Applications, Machine Learning, and System-level Test

W01.2 Invited Session 1: Design-for-dependability for AI hardware accelerators in the edge

W01.T2 Technical Session 2 - Testing

W01.T3 Technical Session 3 - Reliability and Safety

W01.ET Embedded Tutorial - IEEE P2851 advancements

W01.3 Invited Session 2: The challenges of reaching zero defect and functional safety – and how the EDA industry tackles them

W01.4 Panel: What are the limitations of EDA tools with respect to zero defects and FuSa?

W01.5 Closing

W07 European Workshop on Silicon Lifecycle Management (eSLM)

Aim and Scope

Topic Areas

Preliminary Program

Panel Information

W08 Workshop on Ferroelectronics

DESCRIPTION

SCHEDULE

TOPIC AREAS

W01.0 Opening

W03 NeurONN Workshop on Neuromorphic Computing

This workshop is supported by Fraunhofer EMFT, Research Institution for Microsystems and Solid State Technologies Participants can register for the workshop free of charge via the online registration platform.

Workshop Description

W03.0 Welcome Note

W03.1 NeurONN Project Overview

W03.2 Projects related to Neuromorphic computing

W03.3 Materials and Devices

W03.LB Lunch Break

W03.4 Demonstrators

W03.5 Neuromorphic Architecture & Design

W03.CB Coffee Break

W03.6 Neuromorphic Computing

W03.0 Welcome Note

This workshop is supported by Fraunhofer EMFT, Research Institution for Microsystems and Solid State Technologies

Participants can register for the workshop free of charge via the online registration platform.

This workshop is supported by TU Kaiserslautern, Department of Electrical and Computer Engineering, Division of Microelectronic Systems Design

Participants can register for the workshop free of charge via the online registration platform.

Time	Label	Presentation Title Authors
14:30 CEST	11.5.1	SYMBIOTIC SAFETY: SAFE AND EFFICIENT HUMAN-MACHINE COLLABORATION BY UTILIZING RULES Speaker: Tasuku Ishigooka, Hitachi, Ltd., JP Authors: Tasuku Ishigooka, Hiroyuki Yamada, Satoshi Otsuka, Nobuyasu Kanekawa and Junya Takahashi, Hitachi, Ltd., JP Abstract Collaborative work between workers and autonomous systems in the same area is required to improve operation efficiency. However, there exist collision risks caused by coexistence of workers and autonomous systems. The safety functions of the autonomous systems, such as emergency stops, can reduce the risks and but may decrease the operation efficiency. Therefore, we propose a novel safety concept called Symbiotic Safety. The concept improves both safety and operation efficiency by transformation of action plan, e.g., adjustment of action plan or update of safety rule, which reduces frequency of risk occurrence and suppress efficiency loss due to safety functions. In this paper, we explain the symbiotic safety technologies and share results of an evaluation experiment by utilizing our prototype system.
14:45 CEST	11.5.2	A MIDDLEWARE JOURNEY FROM MICROCONTROLLERS TO MICROPROCESSORS Speaker: Alban Tamisier, Apex.AI, FR Authors: Michael Pöhnl, Alban Tamisier and Tobias Blaß, Apex.AI, DE Abstract This paper discusses some of the challenges we encountered when developing Apex.OS, an automotive grade version of the Robot Operating System (ROS) 2. To better understand these challenges, we look back at the best practices used for data communication and software execution in OSEK-based systems. Finally we describe the extensions made in ROS 2, Apex.OS and Apex.Middleware to meet the real-time constraints of the targeted automotive systems.
15:00 CEST	11.5.3	RELIABLE DISTRIBUTED SYSTEMS Speaker: Philipp Mundhenk, Robert Bosch GmbH, DE Authors: Philipp Mundhenk, Arne Hamann, Andreas Heyl and Dirk Ziegenbein, Robert Bosch GmbH, DE Abstract The domains of Cyber-Physical Systems (CPSs) and Information Technology (IT) are converging. Driven by the need for increased compute performance, as well as the need for increased connectivity and runtime flexibility, IT hardware, such as microprocessors and Graphics Processing Units (GPUs), as well as software abstraction layers are introduced to CPS. These systems and components are being enhanced for the execution of hard real-time applications. This enables the convergence of embedded and IT: Embedded workloads can be executed reliably on top of IT infrastructure. This is the dawn of Reliable Distributed Systems (RDSs), a technology that combines the performance and cost of IT systems with the reliability of CPSs. The Fabric is a global RDS runtime environment, weaving the interconnections between devices and enabling abstractions for compute, communication, storage, sensing & actuation. This paper outlines the vision of RDS, introduces the aspects required for implementing RDSs and the Fabric, relates existing technologies, and outlines open research challenges.
15:15 CEST	11.5.4	PAVE 360 - A PARADIGM SHIFT IN AUTONOMOUS DRIVING VERIFICATION WITH A DIGITAL TWIN Speaker and Author: Tapan Vikas, Siemens EDA GmbH, DE Abstract The talk will show case the benefits of architectural exploration based on Digital Twin approaches. The challenges involved in state of the art Digital twin will be highlighted. Hardware software co-design challenges will be discussed shortly.

Time	Label	Presentation Title Authors
15:40 CEST	12.1.1	(Best Paper Award Candidate) ALGORITHM-HARDWARE CO-DESIGN FOR EFFICIENT BRAIN-INSPIRED HYPERDIMENSIONAL LEARNING ON EDGE Speaker: Yang Ni, University of California, Irvine, US Authors: Yang Ni¹, Yeseong Kim², Tajana S. Rosing³ and Mohsen Imani⁴ ¹University of California, Irvine, US; ²DGIST, KR; ³UCSD, US; ⁴University of California Irvine, US Abstract Machine learning methods have been widely utilized to provide high quality for many cognitive tasks. Running sophisticated learning tasks requires high computational costs to process a large amount of learning data. Brain-inspired Hyperdimensional (HD) computing is introduced as an alternative solution for lightweight learning on edge devices. However, HD computing models still rely on accelerators to ensure real-time and efficient learning. These hardware designs are not commercially available and need a relatively long period to synthesize and fabricate after deriving the new applications. In this paper, we propose an efficient framework for accelerating the HD computing at the edge by fully utilizing the available computing power. We optimize the HD computing through algorithm-hardware co-design of the host CPU and existing low-power machine learning accelerators, such as Edge TPU. We interpret the lightweight HD learning model as a hyper-wide neural network to take advantage of the accelerator and machine learning platform. We further improve the runtime cost of training by employing a bootstrap aggregating algorithm called bagging while maintaining the learning quality. We evaluate the performance of the proposed framework with several applications. Joint experiments on mobile CPU and the Edge TPU show that our framework achieves 4.5× faster training and 4.2× faster inference compared to the baseline platform. In addition, our framework achieves 19.4× faster training and 8.9× faster inference as compared to embedded ARM CPU, Raspberry Pi, that consumes similar power consumption.
15:44 CEST	12.1.2	POISONHD: POISON ATTACK ON BRAIN-INSPIRED HYPERDIMENSIONAL COMPUTING Speaker: Xun Jiao, Villanova University, US Authors: Ruixuan Wang¹ and Xun Jiao² ¹VU, US; ²Villanova University, US Abstract While machine learning (ML) methods especially deep neural networks (DNNs) promise enormous societal and economic benefits, their deployments present daunting challenges due to intensive computational demands and high storage requirements. Brain-inspired hyperdimensional computing (HDC) has recently been introduced as an alternative computational model that mimics the ``human brain'' at the functionality level. HDC has already demonstrated promising accuracy and efficiency in multiple application domains including healthcare and robotics. However, the robustness and security aspects of HDC has not been systematically investigated and sufficiently examined. Poison attack is a commonly-seen attack on various ML models including DNNs. It injects noises to labels of training data to introduce classification error of ML models. This paper presents PoisonHD, an HDC-specific poison attack framework that maximizes its effectiveness in degrading the classification accuracy by leveraging the internal structural information of HDC models. By applying PoisonHD on three datasets, we show that PoisonHD can cause significantly greater accuracy drop on HDC model than a random label flipping approach. We further develop a defense mechanism by designing an HDC-based data sanitization that can fully recover the accuracy loss caused by poison attack. To the best of our knowledge, this is the first paper that studies the poison attack on HDC models.
15:48 CEST	12.1.3	AIME: WATERMARKING AI MODELS BY LEVERAGING ERRORS Speaker: Dhwani Mehta, University of Florida, US Authors: Dhwani Mehta, Nurun Mondol, Farimah Farahmandi and Mark Tehranipoor, University of Florida, US Abstract The recent evolution of deep neural networks (DNNs) has made running complex data analytics tasks, which range from natural language processing, object detection to autonomous cars, artificial intelligence (AI) warfare, cloud, healthcare, industrial robots, and edge devices feasible. The benefits of AI are indisputable. However, there are several concerns regarding the security of the deployed AI models, such as reverse engineering and Intellectual Property (IP) piracy. Accumulating a sufficiently large amount of data - building, training, and improving the model accuracy - to finally deploying the model requires immense human and computational power, making the process expensive. Therefore, it is of utmost importance to protect the model against IP infringement. We propose AIME, a novel watermarking framework that captures model inaccuracy during the training phase and converts it into the owner-specific unique signature. The watermark is embedded within the class mispredictions of the DNN model. Watermark extraction is performed when the model is queried by an owner-specific sequence of key inputs, and the signature is decoded from the sequence of model predictions. AIME works with negligible watermark embedding runtime overhead while preserving the accurate functionality of the DNN. We have performed a comprehensive evaluation of AIME, which models on MNIST, Fashion-MNIST, and CIFAR-10 dataset and corroborated its effectiveness, robustness, and performance.
15:52 CEST	12.1.4	THINGNET: A LIGHTWEIGHT REAL-TIME MIRAI IOT VARIANTS HUNTER THROUGH CPU POWER FINGERPRINTING Speaker: Zhuoran Li, Old Dominion University, US Authors: Zhuoran Li and Danella Zhao, Old Dominion University, US Abstract Internet of Things (IoT) devices have become attractive targets of cyber criminals, whereas attackers have been leveraging these vulnerable devices most notably via the infamous Mirai-based botnets, accounting for nearly 90% of IoT malware attacks in 2020. In this work, we propose a robust, universal and non-invasive Mirai-based malware detection engine employing a compact deep neural network architecture. Our design allows programmatic collection of CPU power footprints with integrated current sensors under various device states, such as idle, service and attack. A lightweight online inference model is deployed in the CPU for on-the-fly classification. Our model is robust against noisy environment with a lucid design of noise reduction function. This work appears to be the first step towards a viable CPU malware detection engine based on power fingerprinting. The extensive simulation study under ARM architecture that is widely used in IoT devices, demonstrates a high detection accuracy of 99.1% at a speed less than 1ms. By analyzing Mirai-based infection under distinguishable phases for power feature extraction, our model has further demonstrated an accuracy of 96.3% on model-unknown variants detection.
15:56 CEST	12.1.5	M2M-ROUTING: ENVIRONMENTAL ADAPTIVE MULTI-AGENT REINFORCEMENT LEARNING BASED MULTI-HOP ROUTING POLICY FOR SELF-POWERED IOT SYSTEMS Speaker: Wen Zhang, Texas A&M- Corpus Christi, US Authors: Wen Zhang¹, Jun Zhang², Mimi Xie³, Tao Liu⁴, Wenlu Wang¹ and Chen Pan⁵ ¹Texas A&M University--Corpus Christi, US; ²Harvard University, US; ³University of Texas at San Antonio, US; ⁴Lawrence Technological University, US; ⁵Texas A&M University-Corpus Christi, US Abstract Energy harvesting (EH) technologies facilitate the trending proliferation of IoT devices with sustainable power supplies. However, the intrinsic weak and unstable nature of EH results in frequent and unpredictable power interruptions in EH IoT devices, which further causes unpleasant packet loss or reconnection failures in IoT network. Therefore, conventional routing and energy allocation methods are inefficient in the EH environments. The complexity of the EH environment caused a stumbling block to an intelligent routing policy and energy allocation. To address the problems, this work proposes an environment adaptive Deep Reinforcement Learning (DRL)-based multi-hop routing policy, M2M-Routing, to jointly optimize energy allocation and routing policy, which conquers these challenges through leveraging the offline computation resources. We prepare multi-models for the complicated energy harvesting environment offline. By searching a historical similar power trace to identify the model ID, the prepared DRL model is selected to manage energy allocation and routing policy for the query power traces. Simulation results indicate that M2M-Routing improves the amount of data delivery by 3 times to 4 times compared with baselines.
16:00 CEST	12.1.6	Q&A SESSION Authors: Xun Jiao¹ and Srinivas Katkoori² ¹Villanova University, US; ²University of South Florida, US Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
15:40 CEST	12.2.1	MUZZLE THE SHUTTLE: EFFICIENT COMPILATION FOR MULTI-TRAP TRAPPED-ION QUANTUM COMPUTERS Speaker: Abdullah Ash Saki, Pennsylvania State University, US Authors: Abdullah Ash- Saki¹, Rasit Onur Topaloglu² and Swaroop Ghosh¹ ¹Pennsylvania State University, US; ²IBM, US Abstract Trapped-ion systems can have a limited number of ions (qubits) in a single trap. Increasing the qubit count to run meaningful quantum algorithms would require multiple traps where ions need to shuttle between traps to communicate. The existing compiler has several limitations, which result in a high number of shuttle operations and degraded fidelity. In this paper, we target this gap and propose compiler optimizations to reduce the number of shuttles. Our technique achieves a maximum reduction of 51.17% in shuttles (average ~ 33%) tested over 125 circuits. Furthermore, the improved compilation enhances the program fidelity up to 22.68X with a modest increase in the compilation time.
15:44 CEST	12.2.2	CIRCUITS FOR MEASUREMENT BASED QUANTUM STATE PREPARATION Speaker: Niels Gleinig, ETH Zurich, DE Authors: Niels Gleinig and Torsten Hoefler, ETH Zürich, CH Abstract In quantum computing, state preparation is the problem of synthesizing circuits that initialize quantum systems to specific states. It has been shown that there are states that require circuits of exponential size to be prepared (when not using measurements), and consequently, despite extensive research on this problem, the existing computer-aided design (CAD) methods produce circuits of exponential size. This is even the case for the methods that solve this problem on the important subclass of uniform states, which for example need to be prepared when using Quantum Simulated Annealing algorithms to solve combinatorial optimization problems. In this paper, we show how CAD based state preparation can be made scalable by using techniques that are unique to quantum computing: amplitude amplification, measurements, and the resulting state collapses. With this approach, we are able to produce wide classes of states in polynomial time, resulting in an exponential improvement over existing CAD methods.
15:48 CEST	12.2.3	OPTIC: A PRACTICAL QUANTUM BINARY CLASSIFIER FOR NEAR-TERM QUANTUM COMPUTERS Speaker: Daniel Silver, Northeastern University, US Authors: Tirthak Patel, Daniel Silver and Devesh Tiwari, Northeastern University, US Abstract Quantum computers can theoretically speed up optimization workloads such as variational machine learning and classification workloads over classical computers. However, in practice, proposed variational algorithms have not been able to run on existing quantum computers for practical-scale problems owing to their error-prone hardware. We propose OPTIC, a framework to effectively execute quantum binary classification on real noisy intermediate-scale quantum (NISQ) computers.
15:52 CEST	12.2.4	SCALABLE VARIATIONAL QUANTUM CIRCUITS FOR AUTOENCODER-BASED DRUG DISCOVERY Speaker: Junde Li, Pennsylvania State University, US Authors: Junde Li and Swaroop Ghosh, Pennsylvania State University, US Abstract The de novo design of drug molecules is recognized as a time-consuming and costly process, and computational approaches have been applied in each stage of the drug discovery pipeline. Variational autoencoder is one of the computer-aided design methods which explores the chemical space based on existing molecular dataset. Quantum machine learning has emerged as an atypical learning method that may speed up some classical learning tasks because of its strong expressive power. However, near-term quantum computers suffer from limited number of qubits which hinders the representation learning in high dimensional spaces. We present a scalable quantum generative autoencoder (SQ-VAE) for simultaneously reconstructing and sampling drug molecules, and a corresponding vanilla variant (SQ-AE) for better reconstruction. The architectural strategies in hybrid quantum classical networks such as, adjustable quantum layer depth, heterogeneous learning rates, and patched quantum circuits are proposed to learn high dimensional dataset such as, ligand-targeted drugs. Extensive experimental results are reported for different dimensions including 8x8 and 32x32 after choosing suitable architectural strategies. The performance of quantum generative autoencoder is compared with the corresponding classical counterpart throughout all experiments. The results show that quantum computing advantages can be achieved for normalized low-dimension molecules, and that high-dimension molecules generated from quantum generative autoencoders have better drug properties within the same learning period.
15:56 CEST	12.2.5	TOWARDS LOW-COST HIGH-ACCURACY STOCHASTIC COMPUTING ARCHITECTURE FOR UNIVARIATE FUNCTIONS: DESIGN AND DESIGN SPACE EXPLORATION Speaker: Kuncai Zhong, Shanghai Jiao Tong University, CN Authors: Kuncai Zhong, Zexi Li and Weikang Qian, Shanghai Jiao Tong University, CN Abstract Univariate functions are widely used. Several recent works propose to implement them by an unconventional computing paradigm, stochastic computing (SC). However, existing SC designs either have a high hardware cost due to the area consuming randomizer or a low accuracy. In this work, we propose a low-cost high-accuracy SC architecture for univariate functions. It consists of only a single stochastic number generator and a minimum number of D flip-flops. We also apply three methods, random number source (RNS) negating, RNS scrambling, and input scrambling, to improve the accuracy of the architecture. To efficiently configure the architecture to achieve a high accuracy, we further propose a design space exploration algorithm. The experimental results show that compared to the conventional architecture, the area of the proposed architecture is reduced by up to 76%, while its accuracy is close to or sometimes even higher than that of the conventional architecture.
16:00 CEST	12.2.6	Q&A SESSION Authors: Giulia Meuli¹ and Yvain Thonnart² ¹Synopsys, IT; ²CEA-Leti, FR Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
15:40 CEST	12.3.1	(Best Paper Award Candidate) DO TEMPERATURE AND HUMIDITY EXPOSURES HURT OR BENEFIT YOUR SSDS? Speaker: Adnan Maruf, Florida International University, US Authors: Adnan Maruf¹, Sashri Brahmakshatriya¹, Baolin Li², Devesh Tiwari², Gang Quan¹ and Janki Bhimani¹ ¹Florida International University, US; ²Northeastern University, US Abstract SSDs are becoming mainstream data storage devices, replacing HDDs in most data centers, consumer goods, and IoT gadgets. In this work, we ask an uncharted research question: What is the environmental conditions' impact on SSD performance? To answer it, we systematically measure, quantify, and characterize the impact of various commonly changing environmental conditions such as temperature and humidity on the performance of SSDs. Our experiments and analysis uncover that exposure to changes in temperature and humidity can significantly affect SSD performance.
15:44 CEST	12.3.2	SAFEDM: A HARDWARE DIVERSITY MONITOR FOR REDUNDANT EXECUTION ON NON-LOCKSTEPPED CORES Speaker: Francisco Bas, Barcelona Supercomputing Center (BSC), Universitat Politècnica de Catalunya (UPC), ES Authors: Francisco Bas¹, Pedro Benedicte², Sergi Alcaide¹, Guillem Cabo², Fabio Mazzocchetti² and Jaume Abella² ¹Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES; ²Barcelona Supercomputing Center, ES Abstract Computing systems in the safety domain, such as those in avionics or space, require specific safety measures related to the criticality of the deployment. A problem these systems face is that of transient failures in hardware. A solution commonly used to tackle potential failures is to introduce redundancy in these systems, for example 2 cores that execute the same program at the same time. However, redundancy does not solve all potential failures, such as Common Cause Failures (CCF), where a single fault affects both cores identically (e.g. a voltage droop). If both redundant cores have identical state when the fault occurs, then there may be a CCF since the fault can affect both cores in the same way. To avoid CCF it is critical to know that there is diversity in the execution amongst the redundant cores. In this paper we introduce SafeDM, a hardware Diversity Monitor that quantifies the diversity of each redundant processor to guarantee that CCF will not go unnoticed, and without needing to deploy lockstepped cores. SafeDM computes data and instruction diversity separately, using different techniques appropriate for each case. We integrate SafeDM in a RISC-V FPGA space MPSoC from Cobham Gaisler where SafeDM is proven effective with a large benchmark suite, incurring low area and power overheads. Overall, SafeDM is an effective hardware solution to quantify diversity in cores performing redundant execution.
15:48 CEST	12.3.3	IS APPROXIMATION UNIVERSALLY DEFENSIVE AGAINST ADVERSARIAL ATTACKS IN DEEP NEURAL NETWORKS? Speaker: Ayesha Siddique, University of Missouri, US Authors: Ayesha Siddique and Khaza Anuarul Hoque, University of Missouri, US Abstract Approximate computing is known for its effectiveness in improvising the energy efficiency of deep neural network (DNN) accelerators at the cost of slight accuracy loss. Very recently, the inexact nature of approximate components, such as approximate multipliers have also been reported successful in defending adversarial attacks on DNNs models. Since the approximation errors traverse through the DNN layers as masked or unmasked, this raises a key research question-can approximate computing always offer a defense against adversarial attacks in DNNs, i.e., are they universally defensive? Towards this, we present an extensive adversarial robustness analysis of different approximate DNN accelerators (AxDNNs) using the state-of-the-art approximate multipliers. In particular, we evaluate the impact of ten adversarial attacks on different AxDNNs using the MNIST and CIFAR-10 datasets. Our results demonstrate that adversarial attacks on AxDNNs can cause 53% accuracy loss whereas the same attack may lead to almost no accuracy loss (as low as 0.06%) in the accurate DNN. Thus, approximate computing cannot be referred to as a universal defense strategy against adversarial attacks.
15:52 CEST	12.3.4	RELIABILITY ANALYSIS OF A SPIKING NEURAL NETWORK HARDWARE ACCELERATOR Speaker: Theofilos Spyrou, Sorbonne University, CNRS, LIP6, FR Authors: Theofilos Spyrou¹, Sarah A. Elsayed¹, Engin Afacan², Luis A. Camuñas Mesa³, Barnabé Linares-Barranco³ and Haralampos-G. Stratigopoulos¹ ¹Sorbonne Université, CNRS, LIP6, FR; ²Gebze TU, TR; ³IMSE-CNM, CSIC, University of Sevilla, ES Abstract Despite the parallelism and sparsity in neural network models, their transfer into hardware unavoidably makes them susceptible to hardware-level faults. Hardware-level faults can occur either during manufacturing, such as physical defects and process-induced variations, or in the field due to environmental factors and aging. The performance under fault scenarios needs to be assessed so as to develop cost-effective fault-tolerance schemes. In this work, we assess the resilience characteristics of a hardware accelerator for Spiking Neural Networks (SNNs) designed in VHDL and implemented on an FPGA. The fault injection experiments pinpoint the parts of the design that need to be protected against faults, as well as the parts that are inherently fault-tolerant.
15:56 CEST	12.3.5	RELIABILITY OF GOOGLE’S TENSOR PROCESSING UNITS FOR EMBEDDED APPLICATIONS Speaker: Rubens Luiz Rech Junior, Institute of Informatics, UFRGS, BR Authors: Rubens Luiz Rech Junior¹ and Paolo Rech² ¹UFRGS, BR; ²LANL/UFRGS, US Abstract Convolutional Neural Networks (CNNs) have become the most used and efficient way to identify and classify objects in a scene. CNNs are today fundamental not only for autonomous vehicles, but also for Internet of Things (IoT) and smart cities or smart homes. Vendors are developing low-power, efficient, and low-cost dedicated accelerators to allow the execution of the computational-demanding CNNs even in embedded applications with strict power and cost budgets. Google's Coral Tensor Processing Unit (TPU) is one of the latest low power accelerators for CNNs. In this paper we investigate the reliability of TPUs to atmospheric neutrons, reporting experimental data equivalent to more than 30 million years of natural irradiation. We analyze the behavior of TPUs executing atomic operations (standard or depthwise convolutions) with increasing input sizes as well as eight CNN designs typical of embedded applications, including transfer learning and reduced data-set configurations. We found that, despite the high error rate, most neutrons-induced errors only slightly modify the convolution output and do not change the CNNs detection or classification. By reporting details about the fault model and error rate, we provide valuable information on how to evaluate and improve the reliability of CNNs executed on a TPU.
16:00 CEST	12.3.6	Q&A SESSION Authors: Angeliki Kritikakou¹ and Marcello Traiola² ¹Univ Rennes, Inria, CNRS, IRISA, FR; ²Inria / IRISA, FR Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
15:40 CEST	12.4.1	STEALTH ECC: A DATA-WIDTH AWARE ADAPTIVE ECC SCHEME FOR DRAM ERROR RESILIENCE Speaker: Young Seo Lee, Korea University, KR Authors: Young Seo Lee¹, Gunjae Koo¹, Young-Ho Gong² and Sung Woo Chung¹ ¹Korea University, KR; ²KwangWoon University, KR Abstract As DRAM process technology scales down and DRAM density continues to grow, DRAM errors have become a primary concern in modern data centers. Typically, data centers have adopted memory systems with a single error correction double error detection (SECDED) code. However, the SECDED code is not sufficient to satisfy DRAM reliability demands as memory systems get more vulnerable. Though the servers in data centers employ strong ECC schemes such as Chipkill, such ECC schemes lead to substantial performance and/or storage overhead. In this paper, we propose Stealth ECC, a cost-effective memory protection scheme providing stronger error correctability than the conventional SECDED code, with negligible performance overhead and without storage overhead. Depending on the data-width (either narrow-width or full-width), Stealth ECC adaptively selects ECC schemes. For narrow-width values, Stealth ECC provides multi-bit error correctability by storing more parity bits in MSB side, instead of zeros. Furthermore, with bitwise interleaved data placement between x4 DRAM chips, Stealth ECC is robust to a single DRAM chip error for narrow-width values. On the other hand, for full-width values, Stealth ECC adopts the SECDED code, which maintains DRAM reliability comparable to the conventional SECDED code. As a result, thanks to the reliability improvement of narrow-width values, Stealth ECC enhances overall DRAM reliability, while incurring negligible performance overhead as well as no storage overhead. Our simulation results show that Stealth ECC reduces the probability of system failure (caused by DRAM errors) by 47.9%, on average, with only 0.9% performance overhead compared to the conventional SECDED code.
15:44 CEST	12.4.2	ACCELERATE HARDWARE LOGGING TO EFFICIENTLY GUARANTEE PM CRASH CONSISTENCY Speaker: Zhiyuan Lu, Michigan Tech. University, US Authors: Zhiyuan Lu¹, Jianhui Yue¹, Yifu Deng¹ and Yifeng Zhu² ¹Michigan Tech. University, US; ²University of Maine, US Abstract While logging has been adopted in persistent memory (PM) to support crash consistency, logging incurs severe performance overhead. This paper discovers two common factors that contribute to the inefficiency of logging: (1) load imbalance among memory banks, and (2) constraints of intra-record ordering. Over-loaded memory banks may significantly prolong the waiting time of log requests targeting these banks. To address this issue, we propose a novel log entry allocation scheme (LALEA) that reshapes the traffic distribution over PM banks. In addition, the intra-record ordering between a header and its log entries decreases the degree of parallelism in log operations. We design a log metadata buffering scheme (BLOM) that totally eliminates the intra-record ordering constraints. These two proposed log optimizations are general and can be applied to many existing designs. We evaluate our designs using both micro-benchmarks and real PM applications. Our experimental results show that LALEA and BLOM can achieve 54.04% and 17.16% higher transaction throughput on average, compared to two state-of-the-art designs, respectively.
15:48 CEST	12.4.3	(Best Paper Award Candidate) MEMPOOL-3D: BOOSTING PERFORMANCE AND EFFICIENCY OF SHARED-L1 MEMORY MANY-CORE CLUSTERS WITH 3D INTEGRATION Speaker: Matheus Cavalcante, ETH Zürich, CH Authors: Matheus Cavalcante¹, Anthony Agnesina², Samuel Riedel¹, Moritz Brunion³, Alberto Garcia-Ortiz⁴, Dragomir Milojevic⁵, Francky Catthoor⁵, Sung Kyu Lim² and Luca Benini⁶ ¹ETH Zürich, CH; ²Georgia Tech, US; ³University of Bremen, DE; ⁴ITEM (U.Bremen), DE; ⁵IMEC, BE; ⁶Università di Bologna and ETH Zürich, IT Abstract Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latency interconnect. MemPool's baseline 2D design is severely limited by routing congestion and wire propagation delay, making the design ideal for 3D integration. In architectural terms, we increase MemPool's scratchpad memory capacity beyond the sweet spot for 2D designs, improving performance in a common digital signal processing kernel. We propose a 3D MemPool design that leverages a smart partitioning of the memory resources across two layers to balance the size and utilization of the stacked dies. In this paper, we explore the architectural and the technology parameter spaces by analyzing the power, performance, area, and energy efficiency of MemPool instances in 2D and 3D with 1 MiB, 2 MiB, 4 MiB, and 8 MiB of scratchpad memory in a commercial 28 nm technology node. We observe a performance gain of 9.1 % when running a matrix multiplication on the MemPool-3D design with 4 MiB of scratchpad memory compared to the MemPool 2D counterpart. In terms of energy efficiency, we can implement the MemPool-3D instance with 4 MiB of L1 memory on an energy budget 15 % smaller than its 2D counterpart, and even 3.7 % smaller than the MemPool-2D instance with one-fourth of the L1 scratchpad memory capacity.
15:52 CEST	12.4.4	REPAIR: A RERAM-BASED PROCESSING-IN-MEMORY ACCELERATOR FOR INDEL REALIGNMENT Speaker: Chin-Fu Nien, Academia Sinica, TW Authors: Ting Wu¹, Chin-Fu Nien², Kuang-Chao Chou³ and Hsiang-Yun Cheng² ¹Electrical and Computer Engineering, Carnegie Mellon university, US; ²Academia Sinica, TW; ³Gradute Institute of Electronics Engineering, National Taiwan University, TW Abstract Genomic analysis has attracted a lot of interest recently since it is the key to realizing precision medicine for diseases such as cancer. Among all the genomic analysis pipeline stages, Indel Realignment is the most time-consuming and induces intensive data movements. Thus, we propose RePAIR, the first ReRAM-based processing-in-memory accelerator targeting the Indel Realignment algorithm. To further increase the computation parallelism, we design several mapping and scheduling optimization schemes. RePAIR achieves 7443x speedup and is 27211x more energy efficient over the GATK3.8 running on a CPU server, significantly outperforming the state-of-the-art.
15:56 CEST	12.4.5	SIC PROCESSORS FOR EXTREME HIGH-TEMPERATURE VENUS SURFACE EXPLORATION Speaker: Heewoo Kim, University of Michigan, Ann Arbor, US Authors: Heewoo Kim, Javad Bagherzadeh and Ronald Dreslinski, University of Michigan, US Abstract Being the ‘sister planet’ of the Earth, surface exploration of Venus is expected to provide valuable scientific insights into the history and the environment of the Earth. Despite the benefits, the surface temperature of Venus, at 450C, poses a large challenge for any surface exploration. In particular, conventional Silicon electronics do not properly function under such high temperatures. Due to this constraint, the most prolonged previous surface exploration lasted only for 2 hours. Silicon Carbide (SiC) electronics, which can endure and function properly in high-temperature environments, is proposed as a strong candidate to be used in Venus surface explorations. However, this technology is still immature and associated with limiting factors, such as slower speed, power constraint, limited die area, and approximately 1,000 times longer channel than the state-of-the-art Si transistors. In this paper, we configure a computing infrastructure for high-temperature SiC-based technology, conduct design space exploration, and evaluate the performance of different SiC processors when used in Venus surface landers. Our evaluation shows that the SiC processor has an average 16.6X lower throughput than the RAD6000 Si processor used in the previous Mars rover. The Venus rover with SiC processor is expected to have a moving speed of 0.6 meters per hour and visual odometry processing time of 50 minutes. Lastly, we provide the design guidelines to improve the SiC processors at the microarchitecture and the instruction set architecture levels.
16:00 CEST	12.4.6	Q&A SESSION Authors: Leonidas Kosmidis¹ and Thaleia Dimitra Doudali² ¹Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES; ²IMDEA Software Institute, ES Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
15:40 CEST	12.5.1	UNSUPERVISED TEST-TIME ADAPTATION OF DEEP NEURAL NETWORKS AT THE EDGE: A CASE STUDY Speaker: Kshitij Bhardwaj, Lawrence Livermore National Laboratory, US Authors: Kshitij Bhardwaj, James Diffenderfer, Bhavya Kailkhura and Maya Gokhale, LLNL, US Abstract Deep learning is being increasingly used in mobile and edge autonomous systems. The prediction accuracy of deep neural networks (DNNs), however, can degrade after deployment due to encountering data samples whose distributions are differ- ent than the training samples. To continue to robustly predict, DNNs must be able to adapt themselves post-deployment. Such adaptation at the edge is challenging as new labeled data may not be available, and it has to be performed on a resource- constrained device. This paper performs a case study to evaluate the cost of test-time fully unsupervised adaptation strategies on a real-world edge platform: Nvidia Jetson Xavier NX. In particular, we adapt pretrained state-of-the-art robust DNNs (trained using data augmentation) to improve the accuracy on image classification data that contains various image corruptions. During this prediction-time on-device adaptation, the model parameters of a DNN are updated using a single backpropagation pass while optimizing entropy loss. The effects of following three simple model updates are compared in terms of accuracy, adaptation time and energy: updating only convolutional (Conv- Tune); only fully-connected (FC-Tune); and only batch-norm parameters (BN-Tune). Our study shows that BN-Tune and Conv- Tune are more effective than FC-Tune in terms of improving accuracy for corrupted images data (average of 6.6%, 4.97%, and 4.02%, respectively over no adaptation). However, FC-Tune leads to significantly faster and more energy efficient solution with a small loss in accuracy. Even when using FC-Tune, the extra overheads of on-device fine-tuning are significant to meet tight real-time deadlines (209ms). This study motivates the need for designing hardware-aware robust algorithms for efficient on- device adaptation at the autonomous edge.
15:50 CEST	12.5.2	SUPER-EFFICIENT SUPER RESOLUTION FOR FAST ADVERSARIAL DEFENSE AT THE EDGE Speaker: Kartikeya Bhardwaj, Arm Inc., US Authors: Kartikeya Bhardwaj¹, Dibakar Gope², James Ward³, Paul Whatmough² and Danny Loh⁴ ¹Arm Inc., US; ²Arm Research, US; ³Arm Inc., IE; ⁴Arm Inc., GB Abstract Autonomous systems are highly vulnerable to a variety of adversarial attacks on Deep Neural Networks (DNNs). Training-free model-agnostic defenses have recently gained popularity due to their speed, ease of deployment, and ability to work across many DNNs. To this end, a new technique has emerged for mitigating attacks on image classification DNNs, namely, preprocessing adversarial images using super resolution -- upscaling low-quality inputs into high-resolution images. This defense requires running both image classifiers and super resolution models on constrained autonomous systems. However, super resolution incurs a heavy computational cost. Therefore, in this paper, we investigate the following question: Does the robustness of image classifiers suffer if we use tiny super resolution models? To answer this, we first review a recent work called Super-Efficient Super Resolution (SESR) that achieves similar or better image quality than prior art while requiring 2x to 330x fewer Multiply-Accumulate (MAC) operations. We demonstrate that despite being orders of magnitude smaller than existing models, SESR achieves the same level of robustness as significantly larger networks. Finally, we estimate end-to-end performance of super resolution-based defenses on a commercial Arm Ethos-U55 micro-NPU. Our findings show that SESR achieves nearly 3x higher FPS than a baseline while achieving similar robustness.
16:00 CEST	12.5.3	FAULT-TOLERANT DEEP NEURAL NETWORKS FOR PROCESSING-IN-MEMORY BASED AUTONOMOUS EDGE SYSTEMS Speaker: Xue Lin, Northeastern University, US Authors: Siyue Wang¹, Geng Yuan¹, Xiaolong Ma¹, Yanyu Li¹, Xue Lin¹ and Bhavya Kailkhura² ¹Northeastern University, US; ²LLNL, US Abstract In-memory deep neural network (DNN) accelerators will be the key for energy-efficient autonomous edge systems. The resistive random access memory (ReRAM) is a potential solution for the non-CMOS-based in-memory computing platform for energy-efficient autonomous edge systems, thanks to its promising characteristics, such as near-zero leakage-power and non-volatility. However, due to the hardware instability of ReRAM, the weights of the DNN model may deviate from the originally trained weights, resulting in accuracy loss. To mitigate this undesirable accuracy loss, we propose two stochastic fault-tolerant training methods to generally improve the models' robustness without dealing with individual devices. Moreover, we propose Stability Score -- a comprehensive metric that serves as an indicator to the instability problem. Extensive experiments demonstrate that the DNN models trained using our proposed stochastic fault-tolerant training method achieve superior performance, which provides better flexibility, scalability, and deployability of ReRAM on the autonomous edge systems.
16:10 CEST	12.5.4	FRL-FI: TRANSIENT FAULT ANALYSIS FOR FEDERATED REINFORCEMENT LEARNING-BASED NAVIGATION SYSTEMS Speaker: Arijit Raychowdhury, Georgia Institute of Technology, US Authors: Zishen Wan¹, Aqeel Anwar¹, Abdulrahman Mahmoud², Tianyu Jia³, Yu-Shun Hsiao², Vijay Reddi² and Arijit Raychowdhury¹ ¹Georgia Institute of Technology, US; ²Harvard University, US; ³Carnegie Mellon University, US Abstract Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.
16:20 CEST	12.5.5	Q&A SESSION Authors: Dirk Ziegenbein¹ and Chung-Wei Lin² ¹Robert Bosch GmbH, DE; ²National Taiwan University, TW Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
16:40 CEST	13.1.1	IMPROVING CELL-AWARE TEST FOR INTRA-CELL SHORT DEFECTS Speaker: Dong-Zhen Lee, National Yang Ming Chiao Tung University, TW Authors: Dong-Zhen Li¹, Ying-Yen Chen², Kai-Chiang Wu³ and Chia-Tso Chao¹ ¹National Yang Ming Chiao Tung University, TW; ²Realtek Semiconductor Corporation, TW; ³Department of Computer Science, National Chiao Tung University, TW Abstract Conventional fault models define their faulty behavior at the IO ports of standard cells with simple rules of fault activation and fault propagation. However, there still exist some defects inside a cell (intra-cell) that cannot be effectively detected by the test patterns of conventional fault models and hence become a source of DPPM. In order to further increase the defect coverage, many research works have been conducted to study the fault models resulting from different types of intra-cell defects, by SPICE-simulating each targeted defect with its equivalent circuit-level defect model. In this paper, we propose to improve cell-aware (CA) test methodology by concentrating on intra-cell bridging faults due to short defects inside standard cells. The faults extracted are based on examining the actual physical proximity of polygons in the layout of a cell, and are thus more realistic and reasonable than those (faults) determined by RC extraction. Experimental results on a set of industrial designs show that the proposed methodology can indeed improve the test quality of intra-cell bridging faults. On average, 0.36% and 0.47% increases in fault coverage can be obtained for 1-time-frame and 2-time-frame CA tests, respectively. In addition to short defects between two metal polygons, short defects among three metal polygons are also considered in our methodology for another 9.33% improvement in fault coverage.
16:44 CEST	13.1.2	APUF FAULTS: IMPACT, TESTING, AND DIAGNOSIS Speaker: Wenjing Rao, University of Illinois Chicago, US Authors: Natasha Devroye, Vincent Dumoulin, Tim Fox, Wenjing Rao and Yeqi Wei, University of Illinois at Chicago, US Abstract Arbiter Physically Unclonable Functions (APUFs) are hardware security primitives that exploit manufacturing randomness to generate unique digital fingerprints for ICs. This paper theoretically and numerically examines the impact of faults native to APUFs -- mask parameter faults from the design phase, or process variation (PV) during the manufacturing phase. We model them statistically, and explain quantitatively how these faults affect the resulting PUF bias and uniqueness. When given access to only a single PUF instance, we focus on abnormal delta elements that are outliers in magnitude, as this is how the statistically modeled faults manifest at the individual level. To detect such bad PUF instances and diagnose the abnormal delta elements, we propose a testing methodology which partitions a random set of challenges so that a specific delta element can be targeted, forming a perceivable bias in the responses over these sets. This low-cost approach is highly effective in detecting and diagnosing bad PUFs with abnormal delta element(s).
16:48 CEST	13.1.3	GRAPH NEURAL NETWORK-BASED DELAY-FAULT LOCALIZATION FOR MONOLITHIC 3D ICS Speaker: Shao-Chun Hung, Department of Electrical and Computer Engineering, Duke University, US Authors: Shao-Chun Hung, Sanmitra Banerjee, Arjun Chaudhuri and Krishnendu Chakrabarty, Duke University, US Abstract Monolithic 3D (M3D) integration is a promising technology for achieving high performance and low power consumption. However, the limitations of current M3D fabrication flows lead to performance degradation of devices in the top tier and unreliable interconnects between tiers. Fault localization at the tier level is therefore necessary to enhance yield learning, For example, tier-level localization can enable targeted diagnosis and process optimization efforts. In this paper, we develop a graph neural network-based diagnosis framework to efficiently localize faults to a device tier. The proposed framework can be used to provide rapid feedback to the foundry and help enhance the quality of diagnosis reports generated by commercial tools. Results for four M3D benchmarks, with and without response compaction, show that the proposed solution achieves up to 39.19% improvement in diagnostic resolution with less than 1% loss of accuracy, compared to results from commercial tools.
16:52 CEST	13.1.4	A COMPACTION METHOD FOR STLS FOR GPU IN-FIELD TEST Speaker: Juan David Guerrero Balaguera, Politecnico di Torino, IT Authors: Juan Guerrero Balaguera, Josie Rodriguez Condia and Matteo Sonza Reorda, Politecnico di Torino, IT Abstract Nowadays, Graphics Processing Units (GPUs) are effective platforms for implementing complex algorithms (e.g., for Artificial Intelligence) in different domains (e.g., automotive and robotics), where massive parallelism and high computational effort are required. In some domains, strict safety-critical requirements exist, mandating the adoption of mechanisms to detect faults during the operational phases of a device. An effective test solution is based on Self-Test Libraries (STLs) aiming at testing devices functionally. This solution is frequently adopted for CPUs, but can also be used with GPUs. Nevertheless, the in-field constraints restrict the size and duration of acceptable STLs. This work proposes a method to automatically compact the test programs of a given STL targeting GPUs. The proposed method combines a multi-level abstraction analysis resorting to logic simulation to extract the microarchitectural operations triggered by the test program and the information about the thread-level activity of each instruction and to fault simulation to know its ability to propagate faults to an observable point. The main advantage of the proposed method is that it requires a single fault simulation to perform the compaction. The effectiveness of the proposed approach was evaluated, resorting to several test programs developed for an open-source GPU model (FlexGripPlus) compatible with NVIDIA GPUs. The results show that the method can compact test programs by up to 98.64% in code size and by up to 98.42% in terms of duration, with minimum effects on the achieved fault coverage.
16:56 CEST	13.1.5	Q&A SESSION Authors: Melanie Schillinsky¹ and Riccardo Cantoro² ¹NXP Germany GmbH, DE; ²Politecnico di Torino, IT Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
16:40 CEST	13.2.1	AUTOMATIC GENERATION OF ARCHITECTURE-LEVEL MODELS FROM RTL DESIGNS FOR PROCESSORS AND ACCELERATORS Speaker: Yu Zeng, Princeton University, US Authors: Yu Zeng, Aarti Gupta and Sharad Malik, Princeton University, US Abstract Hardware platforms comprise general-purpose processors and application-specific accelerators. Unlike processors, application-specific accelerators often do not have clearly specified architecture-level models/specifications (the instruction set architecture or ISA). This poses challenges to the development and verification/validation of firmware/software for these accelerators. Manually writing architecture-level models takes great effort and is error-prone. When Register-Transfer Level (RTL) designs are available, they can be a source from which to automatically derive the architecture-level models. In this work, we propose an approach for automatically generating architecture-level models for processors as well as accelerators from their RTL designs. In previous work, we showed how to automatically extract the architectural state variables (ASVs) from RTL designs. (These are the state variables that are persistent across instructions.) In this work, we present an algorithm for generating the update functions of the model: how the ASVs and outputs are updated by each instruction. Experiments on several processors and accelerators demonstrate that our approach can cover a wide range of hardware features and generate high- quality architecture-level models within reasonable computing time.
16:44 CEST	13.2.2	TWINE: A CHISEL EXTENSION FOR COMPONENT-LEVEL HETEROGENEOUS DESIGN Speaker: Shibo Chen, University of Michigan, US Authors: Shibo Chen, Yonathan Fisseha, Jean-Baptiste Jeannin and Todd Austin, University of Michigan, US Abstract Algorithm-oriented heterogeneous hardware design has been one of the major driving forces for hardware improvement in the post-Moore's Law era. To achieve the swift development of heterogeneous designs, designers reuse existing hardware components to craft their systems. However, current hardware design languages either require tremendous efforts to customize designs, or sacrifice quality for simplicity. Chisel, while attracting more users for its capability to easily reconfigure designs, lacks a few key features to further expedite the heterogeneous design flow. In this paper, we introduce Twine—a Chisel extension that provides high-level semantics to efficiently generate heterogeneous designs. Twine standardizes the interface for better reusability and supports control-free specification with flexible data type conversion, which saves designers from the busy-work of interconnecting modules. Our results show that Twine provides a smooth on-boarding experience for hardware designers, considerably improves reusability, and reduces design complexity for heterogeneous designs while maintaining high design quality.
16:48 CEST	13.2.3	TOWARDS IMPLEMENTING RTL MICROPROCESSOR AGILE DESIGN USING FEATURE ORIENTED PROGRAMMING Speaker: Tun Li, National University of Defense Technology, CN Authors: Hongji Zou, Mingchuan Shi, Tun Li and Wanxia Qu, National University of Defense Technology, CN Abstract Recently, hardware agile design methods have been developed to improve the design productivity. However, the modeling methods hinder further design productivity improvements. In this paper, we propose and implement a microprocessor agile design method using feature oriented programming technology to improve design productivity. In this method, designs could be uniquely partitioned and constructed incrementally to explore various functional design features flexibly and efficiently. The key techniques to improve design productivity are flexible modeling extension and on-the-fly feature composing mechanisms. The evaluations on RISC-V and OR1200 CPU pipelines show the effectiveness of the proposed method on duplicate codes reduction and flexible feature composing while avoiding design resource overheads.
16:52 CEST	13.2.4	CSLE: A COST-SENSITIVE LEARNING ENGINE FOR DISK FAILURE PREDICTION IN LARGE DATA CENTERS Speaker: Xinyan Zhang, Huazhong University of Science and Technology, CN Authors: Xinyan Zhang¹, Kai Shan², Zhipeng Tan³ and Dan Feng³ ¹Wuhan National Laboratory for Optoelectronics, Huazhong University of Science & Technology, CN; ²Huawei Technologies, CN; ³Huazhong University of Science and Technology, CN Abstract As the principal failure in data centers, disk failure may pose the risk of data loss, increase the maintenance cost, and affect system availability. As a proactive fault tolerance technology, disk failure prediction can minimize the loss before failure occurs. Whereas, a weak prediction model with a low Failure Detection Rate (FDR) and high False Alarm Rate (FAR) may substantially increase the system cost due to inadequate consideration or misperception of the misclassification cost. To address these challenges, we propose a cost-sensitive learning engine CSLE for disk failure prediction, which combines a two-phase feature selection based on Cohen’s D and Genetic Algorithm, a meta-algorithm based on cost-sensitive learning, and an adaptive optimal classifier for heterogeneous and homogeneous disk series. Experimental results on real datasets show that the AUC of CSLE is increased by 2%-42% compared with the commonly used rank-sum test. CSLE can reduce the misclassification cost by 52%-96% compared with the rank model. Besides, CSLE has a better pervasiveness than the traditional prediction model, it can reduce both the misclassification cost and the FAR by 16%-70% for heterogeneous disk series, and increase the FDR by 3%-29% for homogeneous disk series.
16:56 CEST	13.2.5	Q&A SESSION Authors: Andy Pimentel¹ and Matthias Jung² ¹University of Amsterdam, NL; ²Fraunhofer IESE, DE Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
16:40 CEST	13.3.1	ROBUST BINARY NEURAL NETWORK AGAINST NOISY ANALOG COMPUTATION Speaker: Zong-Han Lee, National Tsing-Hua University, TW Authors: Zong-Han Lee¹, Fu-Cheng Tsai² and Shih-Chieh Chang¹ ¹National Tsing-Hua University, TW; ²Industrial Technology Research Institute, TW Abstract Computing in memory (CIM) technology has shown promising results in reducing the energy consumption of a battery-powered device. On the other hand, to reduce MAC operations, Binary neural networks (BNN) show the potential to catch up with a full-precision model. This paper proposes a robust BNN model applied to the CIM framework, which can tolerate analog noises. These analog noises caused by various variations, such as process variation, can lead to low inference accuracy. We first observe that the traditional batch normalization can cause a BNN model to be susceptible to analog noise. We then propose a new approach to replace the batch normalization while maintaining the advantages. Secondly, in BNN, since noises can be removed when inputs are zeros during the multiplication and accumulation (MAC) operation, we also propose novel methods to increase the number of zeros in a convolution output. We apply our new BNN model in the keyword spotting application. Our results are very exciting.
16:44 CEST	13.3.2	(Best Paper Award Candidate) MU-RMW: MINIMIZING UNNECESSARY RMW OPERATIONS IN THE EMBEDDED FLASH WITH SMR DISK Speaker: Chenlin Ma, Shenzhen University, CN Authors: Chenlin Ma, Zhuokai Zhou, Yingping Wang, Yi Wang and Rui Mao, Shenzhen University, CN Abstract Emerging Shingled Magnetic Recording (SMR) Disk can improve the storage capacity significantly by overlapping multiple tracks with the shingled direction. However, the shingled-like structure leads to severe write amplification caused by RMW operations inner SMR disks. As the mainstream solid-state storage technology, NAND flash has the advantages of tiny size, cost-effective, high performance, making it suitable and promising to be incorporated into SMR disks to boost the system performance. In this hybrid embedded storage system (i.e., the Embedded Flash with SMR disk (EF-SMR) system), we observe that physical flash blocks can contain a mixture of data associated with different SMR data bands; when garbage collecting such flash blocks, multiple RMW operations are triggered to rewrite the involved SMR bands and the performance is further exacerbated. Therefore, in this paper, we for the first time present MU-RMW to guarantee data from different SMR bands will not be mixed up within the flash blocks with an aim at minimizing unnecessary RMW operations. The effectiveness of MU-RMW was evaluated with realistic and intensive I/O workloads and the results are encouraging.
16:48 CEST	13.3.3	OPTIMIZING COW-BASED FILE SYSTEMS ON OPEN-CHANNEL SSDS WITH PERSISTENT MEMORY Speaker: Runyu Zhang, Chongqing University, CN Authors: Runyu Zhang¹, Duo Liu², Chaoshu Yang³, Xianzhang Chen², Lei Qiao⁴ and Yujuan Tan² ¹College of Computer Science, Chongqing University, CN; ²Chongqing University, CN; ³Guizhou University, CN; ⁴Beijing Institute of Control Engineering, CN Abstract Block-based file systems, such as Btrfs, utilize the copy-on-write (CoW) mechanism to guarantee data consistency on solid-state drives (SSDs). Open-channel SSD provides opportunities for in-depth optimization of block-based file systems. However, existing systems fail to co-design the two-layer semantics and cannot take full advantage of the open-channel characteristics. Specifically, synchronizing an overwrite in Btrfs will copy-on-write all pages in the update path and induce severe write amplification. In this paper, we propose a hybrid fine-grained copy-on-write and journaling mechanism (HyFiM) to address these problems. We first utilize persistent memories to preserve the address mapping table of open-channel SSD. Then, we design an intra-FTL copy-on-write mechanism (IFCoW) that eliminates the recursive updates caused by overwrites. Finally, we devise fine-grained metadata journals (FGMJ) to guarantee the consistency of metadata with minimum overhead. We prototype HyFiM based on Btrfs in the Linux kernel. Comprehensive evaluations demonstrate that HyFiM can outperform over Btrfs by 30.77% and 33.82% for sequential and random overwrites, respectively.
16:52 CEST	13.3.4	MCMQ: SIMULATION FRAMEWORK FOR SCALABLE MULTI-CORE FLASH FIRMWARE OF MULTI-QUEUE SSDS Speaker: Jin Xue, The Chinese University of Hong Kong, HK Authors: Jin Xue, Tianyu Wang and Zili Shao, The Chinese University of Hong Kong, HK Abstract Solid-state drives (SSDs) have been used in a wide range of emerging data processing systems. To fully utilize the massive internal parallelism delivered by SSDs,manufacturers begin to utilize high-performance multi-core microprocessors in scalable flash firmware to process I/O requests concurrently. Designing scalable multi-core flash firmwares requires simulation tools that can model the features of a multi-core environment. However, existing SSD simulators assume a single-threading execution model and are not capable of modelling overheads incurred by multi-threading firmware execution such as lock contentions. In this paper, we propose MCMQ, a novel framework for simulating scalable multi-core flash firmware. The framework is based on an emulated multi-core RISC processor and supports executing multiple I/O traces in parallel through a multi-queue interface. Experiment results show the effectiveness of the proposed framework. We have released the open-source code of MCMQ for public access.
16:56 CEST	13.3.5	Q&A SESSION Authors: Yi Wang¹ and Zili Shao² ¹Shenzhen University, CN; ²The Chinese University of Hong Kong, HK Abstract Questions and answers with the authors

Time	Label	Presentation Title Authors
16:40 CEST	13.4.1	CR-SPECTRE: DEFENSE-AWARE ROP INJECTED CODE-REUSE BASED DYNAMIC SPECTRE Speaker: Abhijitt Dhavlle, George Mason University, US Authors: Abhijitt Dhavlle¹, Setareh Rafatirad², Houman Homayoun² and Sai Manoj Pudukotai Dinakarrao³ ¹George Mason University , VA, USA, US; ²University of California Davis, US; ³George Mason University, US Abstract Side-channel attacks have been a constant threat to computing systems. In recent times, vulnerabilities in the architecture were discovered and exploited to mount and execute a state-of-the-art attack such as Spectre. The Spectre attack exploits a vulnerability in the Intel-based processors to leak confidential data through the covert channel. There exist some defenses to mitigate the Spectre attack. Among multiple defenses, hardware-assisted attack/intrusion detection (HID) systems have received overwhelming response due to its low overhead and efficient attack detection. The HID systems deploy machine learning (ML) classifiers to perform anomaly detection to determine whether the system is under attack. For this purpose, a performance monitoring tool profiles the applications to record hardware performance counters (HPC), which performs anomaly detection. Previous HID systems assume that the Spectre is executed as a standalone application. In contrast, we propose an attack that dynamically generates variations in the injected code to evade detection. The attack is injected into a benign application. In this manner, the attack conceals itself as a benign application and generates perturbations to avoid detection. For the attack injection, we exploit a return-oriented programming (ROP)-based code-injection technique that reuses the code, called gadgets, present in the exploited victim's (host) memory to execute the attack, which, in our case, is the CR-Spectre attack to steal sensitive data from a target victim (target) application. Our work focuses on proposing a dynamic attack that can evade HID detection by injecting perturbations, and its dynamically generated variations thereof, under the cloak of a benign application. We evaluate the proposed attack on the MiBench suite as the host. From our experiments, the HID performance degrades from 90% to 16%, indicating our Spectre-CR attack avoids detection successfully.
16:44 CEST	13.4.2	CACHEREWINDER: REVOKING SPECULATIVE CACHE UPDATES EXPLOITING WRITE-BACK BUFFER Speaker: Jongmin Lee, Korea University, KR Authors: Jongmin Lee¹, Junyeon Lee², Taeweon Suh¹ and Gunjae Koo¹ ¹Korea University, KR; ²Samsung Advanced Institute of Technology, KR Abstract Transient execution attacks are critical security threats since those attacks exploit speculative execution which is an essential architectural solution that can improve the performance of out-of-order processors significantly. Such attacks change cache state by accessing secret data during speculative executions, then the attackers leak the secret information exploiting cache timing side-channels. Even though software patches against transient execution attacks have been proposed, the software solutions significantly slow down the performance of a system. In this paper, we propose CacheRewinder, an efficient hardware-based defense mechanism against transient execution attacks. CacheRewinder prevents leakage of secret information by revoking the cache updates done by speculative executions. To restore the cache state efficiently, CacheRewinder exploits the underutilized write-back buffer space as the temporary storage for victimized cache blocks that are evicted during speculative executions. Hence when speculation fails CacheRewinder can quickly restore the cache state using the evicted cache blocks held in the write-back buffer. Our evaluation exhibits that CacheRewinder can effectively defend the transient execution attacks. The performance overhead by CacheRewinder is only 0.6%, which is negligible compared to the unprotected baseline processor. CacheRewinder also requires minimal storage cost since it exploits unused write-back buffer entries as storage for evicted cache blocks.
16:48 CEST	13.4.3	SAFETEE: COMBINING SAFETY AND SECURITY ON ARM-BASED MICROCONTROLLERS Speaker: Martin Schönstedt, TU Darmstadt, DE Authors: Martin Schönstedt, Ferdinand Brasser, Patrick Jauernig, Emmanuel Stapf and Ahmad-Reza Sadeghi, TU Darmstadt, DE Abstract From industry automation to smart home, embedded devices are already ubiquitous, and the number of applications continues to grow rapidly. However, the plethora of embedded devices used in these systems leads to considerable hardware and maintenance costs. To reduce these costs, it is necessary to consolidate applications and functionalities that are currently implemented on individual embedded devices. Especially in mixed-criticality systems, consolidating applications on a single device is highly challenging and requires strong isolation to ensure the security and safety of each application. Existing isolation solutions, such as partitioning designs for ARM-based microcontrollers, do not meet these requirements. In this paper, we present SafeTEE, a novel approach to enable security- and safety-critical applications on a single embedded device. We leverage hardware mechanisms of commercially available ARM-based microcontrollers to strongly isolate applications on individual cores. This makes SafeTEE the first solution to provide strong isolation for multiple applications in terms of security as well as safety. We thoroughly evaluate our prototype of SafeTEE for the most recent ARM microcontrollers using a standard microcontroller benchmark suite.
16:52 CEST	13.4.4	Q&A SESSION Authors: Pascal Benoit¹ and Mike Hamburg² ¹University of Montpellier, FR; ²Cryptography Research, US Abstract Questions and answers with the authors