ϟ

D. Jeon

Here are all the papers by D. Jeon that you can download and read on OA.mg.
D. Jeon’s last known institution is . Download D. Jeon PDFs here.

Claim this Profile →
DOI: 10.1109/jssc.2014.2364036
2015
Cited 151 times
An Injectable 64 nW ECG Mixed-Signal SoC in 65 nm for Arrhythmia Monitoring
A syringe-implantable electrocardiography (ECG) monitoring system is proposed. The noise optimization and circuit techniques in the analog front-end (AFE) enable 31 nA current consumption while a minimum energy computation approach in the digital back-end reduces digital energy consumption by 40%. The proposed SoC is fabricated in 65 nm CMOS and consumes 64 nW while successfully detecting atrial fibrillation arrhythmia and storing the irregular waveform in memory in experiments using an ECG simulator, a live sheep, and an isolated sheep heart.
DOI: 10.1109/isscc.2019.8662398
2019
Cited 61 times
7.6 A 65nm 236.5nJ/Classification Neuromorphic Processor with 7.5% Energy Overhead On-Chip Learning Using Direct Spike-Only Feedback
Advances in neural network and machine learning algorithms have sparked a wide array of research in specialized hardware, ranging from high-performance convolutional neural network (CNN) accelerators to energy-efficient deep-neural network (DNN) edge computing systems [1]. While most studies have focused on designing inference engines, recent works have shown that on-chip training could serve practical purposes such as compensating for process variations of in-memory computing [2] or adapting to changing environments in real time [3]. However, these successes were limited to relatively simple tasks mainly due to the large energy overhead of the training process. These problems arise primarily from the high-precision arithmetic and memory required for error propagation and weight updates, in contrast to error-tolerant inference operation; the capacity requirements of a learning system are significantly higher than those of an inference system [4].
DOI: 10.1109/16.30959
1989
Cited 135 times
MOSFET electron inversion layer mobilities-a physically based semi-empirical model for a wide temperature range
A physically based semiempirical model for electron mobilities of the MOSFET inversion layers that is valid over a large temperature range (77 K<or=T<or=370 K) is discussed. It is based on a reciprocal sum of three scattering mechanisms, i.e. phonon, Coulomb, and surface roughness scattering, and is explicitly dependent on temperature and transverse electric field. The model is more physically based than other semiempirical models, but has an equivalent number of extracted parameters. It is shown that this model compares more favorably with the experimental data than previous models. The implicit dependencies of the model parameters on oxide charge density and surface roughness are confirmed.<<ETX>>
DOI: 10.1109/jssc.2011.2169311
2012
Cited 67 times
A Super-Pipelined Energy Efficient Subthreshold 240 MS/s FFT Core in 65 nm CMOS
This paper proposes a design approach targeting circuits operating at extremely low supply voltages, with the goal of reducing the voltage at which energy is minimized, thereby improving the achievable energy efficiency of the circuit. The proposed methods accomplish this by minimizing the circuit's ratio of leakage to active current. The first method, super pipelining, increases the number of pipeline stages compared to conventional ultra low voltage (ULV) pipelining strategies, reducing the leakage/dynamic energy ratio and simultaneously improving performance and energy efficiency. Measurements of super-pipelined multipliers demonstrate 30% energy savings and 1.6× performance improvement. Since super pipelining reduces the logic depth between registers, two-phase latch based design is employed to compensate for reduced averaging effects and provide better variation tolerance. The second technique introduces a parallel-pipelined architecture that suppresses leakage energy by ensuring full utilization of functional units and reduces memory size. We apply these techniques to a 16-b 1024-pt complex-valued Fast Fourier Transform (FFT) core along with low-power first-in first-out (FIFO) design and robust clock distribution network. The FFT core is fabricated in 65 nm CMOS and consumes 15.8 nJ/FFT with a clock frequency of 30 MHz and throughput of 240 Msamples/s at V <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">dd</sub> =270 mV, providing 2.4× better energy efficiency than current state-of-art and >; 10× higher throughput than typical ULV designs. Measurements of 60 dies show modest frequency (energy) σ/μ spreads of 7% (2%).
DOI: 10.1109/jssc.2019.2942367
2020
Cited 39 times
A 65-nm Neuromorphic Image Classification Processor With Energy-Efficient Training Through Direct Spike-Only Feedback
Recent advances in neural network (NN) and machine learning algorithms have sparked a wide array of research in specialized hardware, ranging from high-performance NN accelerators for use inside the server systems to energy-efficient edge computing systems. While most of these studies have focused on designing inference engines, implementing the training process of an NN for energy-constrained mobile devices has remained to be a challenge due to the requirement of higher numerical precision. In this article, we aim to build an on-chip learning system that would show highly energy-efficient training for NNs without degradation in the performance for machine learning tasks. To achieve this goal, we adapt and optimize a neuromorphic learning algorithm and propose hardware design techniques to fully exploit the properties of the modifications. We verify that our system achieves energy-efficient training with only 7.5% more energy consumption compared with its highly efficient inference of 236 nJ/image on the handwritten digit [Modified National Institute of Standards and Technology database (MNIST)] images. Moreover, our system achieves 97.83% classification accuracy on the MNIST test data set, which outperforms prior neuromorphic on-chip learning systems and is close to the performance of the conventional method for training deep neural networks (NNs), the backpropagation.
DOI: 10.1109/isscc42613.2021.9366031
2021
Cited 33 times
9.3 A 40nm 4.81TFLOPS/W 8b Floating-Point Training Processor for Non-Sparse Neural Networks Using Shared Exponent Bias and 24-Way Fused Multiply-Add Tree
Recent works on mobile deep-learning processors have presented designs that exploit sparsity [2, 3], which is commonly found in various neural networks. However, due to the shift in the machine learning community towards using non-sparse activation functions such as Leaky ReLU or Swish for better training convergence, state-of-theart models no longer exhibit the sparsity found in conventional ReLU-based models (Fig. 9.3.1, top). Moreover, contrary to error-tolerant image classification tasks, more difficult tasks such as image super-resolution require higher precision than plain 8b integers not just for training, but for inference without large accuracy degradation (Fig. 9.3.1, bottom). These changes offer new challenges faced by mobile deep-learning processors: they must process non-sparse networks efficiently and maintain higher precision for more challenging tasks.
DOI: 10.1109/icassp39728.2021.9414852
2021
Cited 32 times
Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net
Modern deep learning-based models have seen outstanding performance improvement with speech enhancement tasks. The number of parameters of state-of-the-art models, however, is often too large to be deployed on devices for real-world applications. To this end, we propose Tiny Recurrent U-Net (TRU-Net), a lightweight online inference model that matches the performance of current state-of- the-art models. The size of the quantized version of TRU-Net is 362 kilobytes, which is small enough to be deployed on edge devices. In addition, we combine the small-sized model with a new masking method called phase-aware ß-sigmoid mask, which enables simultaneous denoising and dereverberation. Results of both objective and subjective evaluations have shown that our model can achieve competitive performance with the current state-of-the-art models on benchmark datasets using fewer parameters by orders of magnitude.
DOI: 10.1109/isscc.2011.5746346
2011
Cited 60 times
A 0.27V 30MHz 17.7nJ/transform 1024-pt complex FFT core with super-pipelining
In this paper, the authors also show how clocking overhead can be reduced through circuit techniques to facilitate super pipelining while process variation is addressed through the use of latch-based design. Additionally, architecture modifications are proposed to improve energy efficiency and throughput. Measurements show that the FFT core consumes 17.7nJ per 1024-pt complex FFT while operating at 30MHz at V <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">dd</sub> =0.27V, demonstrating an improvement over the FFT energy efficiency.
DOI: 10.1109/isscc.2014.6757494
2014
Cited 56 times
24.3 An implantable 64nW ECG-monitoring mixed-signal SoC for arrhythmia diagnosis
Electrocardiography (ECG) is a critical source of information for a number of heart disorders. In arrhythmia studies and treatment, long-term observation is critical to determine the nature of the abnormality and its severity. However, even small body-wearable systems can impact a patient's everyday life and signals captured using such systems are prone to noise from sources such as 60Hz power and body movement. In contrast, implanted devices are less susceptible to these noise sources and, while having closer-spaced electrodes, can obtain similar quality ECG signals due to their proximity to the heart [1]. In addition, implanted devices enable continuous monitoring without affecting patient quality of life. As in other implantable systems, low power consumption is a critical factor; in this case to provide a sufficiently long operating time between wireless recharge events.
DOI: 10.1109/jssc.2021.3103603
2022
Cited 14 times
A Neural Network Training Processor With 8-Bit Shared Exponent Bias Floating Point and Multiple-Way Fused Multiply-Add Trees
Recent advances in deep neural networks (DNNs) and machine learning algorithms have induced the demand for services based on machine learning algorithms that require a large number of computations, and specialized hardware ranging from accelerators for data centers to on-device computing systems have been introduced. Low-precision math such as 8-bit integers have been used in neural networks for energy-efficient neural network inference, but training with low-precision numbers without performance degradation have remained to be a challenge. To overcome this challenge, this article presents an 8-bit floating-point neural network training processor for state-of-the-art non-sparse neural networks. As naïve 8-bit floating-point numbers are insufficient for training DNNs robustly, two additional methods are introduced to ensure high-performance DNN training. First, a novel numeric system which we dub as 8-bit floating point with shared exponent bias (FP8-SEB) is introduced. Moreover, multiple-way fused multiply-add (FMA) trees are used in FP8-SEB’s hardware implementation to ensure higher numerical precision and reduced energy. FP8-SEB format combined with multiple-way FMA trees is evaluated under various scenarios to show a trained-from-scratch performance that is close to or even surpasses that of current networks trained with full-precision (FP32). Our silicon-verified DNN training processor utilizes 24-way FMA trees implemented with FP8-SEB math and flexible 2-D routing schemes to show <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$2.48\times $ </tex-math></inline-formula> higher energy efficiency than prior low-power neural network training processors and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$78.1\times $ </tex-math></inline-formula> lower energy than standard GPUs.
DOI: 10.1109/lra.2023.3251193
2023
Cited 5 times
Curriculum Reinforcement Learning From Avoiding Collisions to Navigating Among Movable Obstacles in Diverse Environments
Curriculum learning has proven highly effective to speed up training convergence with improved performance in a variety of tasks. Researchers have been studying how a curriculum can be constituted to train reinforcement learning (RL) agents in various application domains. However, discovering curriculum sequencing requires the ranking of sub-tasks or samples in order of difficulty, which is not yet sufficiently studied for robot navigation problems. It is still an open question what navigation strategies can be learned and transferred during multi-stage transfer learning from easy to hard. Furthermore, despite of some attempts of learning real robot manipulation tasks using curriculum, most of existing works are limited to toy or simulated settings rather than realistic scenarios. To address those issues, we first investigated how the model convergence in diverse environments relates to the navigation strategies and difficulty metrics. We found that only some of the environments can be trained from scratch, such as in a relatively open tunnel-like environment that only required wall following. We then carried out a two-stage transfer learning for more difficult environments. We found such approach effective for goal navigation, but failed for more complex tasks where movable obstacles may be on the navigation path. To facilitate more complex policies in the navigation among movable obstacles (NAMO) task, another curriculum with distance and pace functions appropriate to the difficulty of the environment was developed. The proposed scheme was proved effective and the strategies learned were discussed via comprehensive evaluations conducted in simulated and real environments.
DOI: 10.1109/16.83736
1991
Cited 76 times
A temperature-dependent SOI MOSFET model for high-temperature application (27 degrees C-300 degrees C)
A temperature-dependent model for long-channel silicon-on-insulator (SOI) MOSFETs for use in the temperature range 27 degrees C-300 degrees C, suitable for circuit simulators such as SPICE, is presented. The model physically accounts for the temperature-dependent effects in SOI MOSFETs (such as threshold-voltage reduction, increase of leakage current, decrease of generation due to impact ionization, and channel mobility degradation with increase of temperature) which are influenced by the uniqueness of SOI device structure, i.e. the back gate and the floating film body. The model is verified by the good agreement of the simulations with the experimental data. The model is implemented in SPICE2 to be used for circuit and device CAD. Simple SOI CMOS circuits are successfully simulated at different temperatures.< <ETX xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">&gt;</ETX>
DOI: 10.1109/tvlsi.2018.2875934
2019
Cited 33 times
An Area-Efficient 128-Channel Spike Sorting Processor for Real-Time Neural Recording With &lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;$0.175~\mu$ &lt;/tex-math&gt; &lt;/inline-formula&gt;W/Channel in 65-nm CMOS
This paper presents a power- and area-efficient spike sorting processor (SSP) for real-time neural recordings. The proposed SSP includes novel detection, feature extraction, and improved K-means algorithms for better clustering accuracy, online clustering performance, and lower power and smaller area per channel. Time-multiplexed registers are utilized in the detector for dynamic power reduction. Finally, an ultra-low-voltage 8T static random access memory (SRAM) is developed to reduce area and leakage consumption when compared to D flip-flop-based memory. The proposed SSP, fabricated in 65-nm CMOS process technology, consumes only 0.175 μW/channel when processing 128 input channels at 3.2 MHz and 0.54 V, which is the lowest among the compared state-of-the-art SSPs. The proposed SSP also occupies 0.003 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> /channel, which allows 333 channels/mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> .
DOI: 10.1109/tvlsi.2019.2893256
2019
Cited 28 times
Enhancing Reliability of Analog Neural Network Processors
Recently, analog and mixed-signal neural network processors have been extensively studied due to their better energy efficiency and small footprint. However, analog computing is more vulnerable to circuit nonidealities such as process variation than their digital counterparts. On-chip calibration circuits can be adopted to measure and compensate for those effects, but it leads to unavoidable area and power overheads. In this brief, we propose a variation and noise-tolerant learning algorithm and postsilicon process variation compensation technique which does not require any additional monitoring circuitry. The proposed techniques reduce the accuracy degradation in the corrupted fully connected network down to 1% under large amount of variations including 10% unit capacitor mismatch, 8-mV <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">rms</sub> comparator noise and 20-mV <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">rms</sub> comparator offset.
DOI: 10.1002/aelm.202200378
2022
Cited 11 times
Vertical Metal‐Oxide Electrochemical Memory for High‐Density Synaptic Array Based High‐Performance Neuromorphic Computing
Abstract Cross‐point arrays of analog synaptic devices are expected to realize neuromorphic computing hardware for neural network computations with compelling speed boost and superior energy efficiency, as opposed to the conventional hardware based on the von Neumann architecture. To achieve desired characteristics of analog synaptic devices for fully parallel vector–matrix multiplication and vector–vector outer‐product updates, metal‐oxide based electrochemical random‐access memory (ECRAM) is proposed as a promising synaptic device due to its complementary metal‐oxide‐semiconductor‐compatibility and outstanding synaptic characteristics over other non‐volatile memory candidates. In this work, ECRAM devices with 3D vertical structure is fabricated to demonstrate a minimal 4 F 2 cell size, highly scalable channel volume and low programming energy, providing optimized synaptic device performance and characteristics as well as high integrity as a cross‐point array. Various weight‐update profiles of the vertical ECRAM devices are obtained by adjusting programming voltage pulses, exhibiting trade‐offs among dynamic range, linearity, symmetry, and update deviation. Based on simulation with advanced algorithms for analog cross‐point array and neural network designs, the potential of vertical ECRAM for high‐density array is evaluated. Simulation studies suggest that the neuromorphic computing performance can be improved further by balancing the weight update characteristics of vertical ECRAM.
DOI: 10.1109/jssc.2017.2661838
2017
Cited 31 times
A 23-mW Face Recognition Processor with Mostly-Read 5T Memory in 40-nm CMOS
This paper presents an energy-efficient face detection and recognition processor aimed at mobile applications. The algorithmic optimizations including hybrid search scheme for face detection significantly reduce computational complexity and architecture modification such as feature memory segmentation and further reduce energy consumption. We utilize characteristics of the implemented algorithm and propose a 5T SRAM design heavily optimized for mostly-read operations. Systematic reset and write schemes allow for reliable data write operation. The 5T SRAM reduces the cell area by 7.2% compared to a conventional 6T bit cell in logic rule while significantly improving read margin and voltage scalability due to a decoupled read path. The fabricated processor consumes only 23 mW while processing both face detection and recognition in real time at 5.5 frames/s throughput.
DOI: 10.1109/jssc.2014.2309692
2014
Cited 25 times
An Energy Efficient Full-Frame Feature Extraction Accelerator With Shift-Latch FIFO in 28 nm CMOS
This paper presents an energy-efficient feature extraction accelerator design aimed at visual navigation. The hardware-oriented algorithmic modifications such as a circular-shaped sampling region and unified description are proposed to minimize area and energy consumption while maintaining feature extraction quality. A matched-throughput accelerator employs fully-unrolled filters and single-stream descriptor enabled by algorithm-architecture co-optimization, which requires lower clock frequency for the given throughput requirement and reduces hardware cost of description processing elements. Due to the large number of FIFO blocks, a robust low-power FIFO architecture for the ultra-low voltage (ULV) regime is also proposed. This approach leverages shift-latch delay elements and balanced-leakage readout technique to achieve 62% energy savings and 37% delay reduction. We apply these techniques to a feature extraction accelerator that can process 30 fps VGA video in real time and is fabricated in 28 nm LP CMOS technology. The design consumes 2.7 mW with a clock frequency of 27 MHz at V <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">dd</sub> = 470 mV, providing 3.5× better energy efficiency than previous state-of-the-art while extracting features from entire image.
DOI: 10.1109/tcsii.2012.2231036
2012
Cited 25 times
Design Methodology for Voltage-Overscaled Ultra-Low-Power Systems
This paper proposes a design methodology for voltage overscaling (VOS) of ultra-low-power systems. This paper first proposes a probabilistic model of the timing error rate for basic arithmetic units and validates it using both simulations and silicon measurements of multipliers in 65-nm CMOS. The model is then applied to a modified K-best decoder that employs error tolerance to reveal the potential of the framework. With simple modifications and timing error detection-only circuitry, the conventional K-best decoder improves its error tolerance in child node expansion modules by up to 30% with less than 0.4-dB SNR degradation. With this error tolerance, the supply voltage can be overscaled by 12.1%, leading to 22.5% energy savings.
DOI: 10.1109/vlsic.2015.7231322
2015
Cited 21 times
A 23mW face recognition accelerator in 40nm CMOS with mostly-read 5T memory
This paper presents a face recognition accelerator for HD (1280×720) images. The proposed design detects faces from the input image using cascaded classifiers. A SVM (Support Vector Machine) performs face recognition based on features extracted by PCA (Principal Component Analysis). Algorithm optimizations including a hybrid search scheme that reduces the workload for face detection by 12×. A new mostly-read 5T memory reduces bitcell area by 7.2% compared to a conventional 6T bitcell while achieving significantly better read reliability and voltage scalability due to a decoupled read path. The resulting design consumes 23mW while processing both face detection and recognition in real time at 5.5 frames/s throughput.
DOI: 10.1109/lca.2018.2889042
2019
Cited 18 times
Improving GPU Multitasking Efficiency Using Dynamic Resource Sharing
As GPUs have become essential components for embedded computing systems, a shared GPU with multiple CPU cores needs to efficiently support concurrent execution of multiple different applications. Spatial multitasking, which assigns a different amount of streaming multiprocessors (SMs) to multiple applications, is one of the most common solutions for this. However, this is not a panacea for maximizing total resource utilization. It is because an SM consists of many different sub-resources such as caches, execution units and scheduling units, and the requirements of the sub-resources per kernel are not well matched to their fixed sizes inside an SM. To solve the resource requirement mismatch problem, this paper proposes a GPU Weaver, a dynamic sub-resource management system of multitasking GPUs. GPU Weaver can maximize sub-resource utilization through a shared resource controller (SRC) that is added between neighboring SMs. The SRC dynamically identifies idle sub-resources of an SM and allows them to be used by the neighboring SM when possible. Experiments show that the combination of multiple sub-resource borrowing techniques enhances the total throughput by up to 26 and 9.5 percent on average over the baseline spatial multitasking GPU.
DOI: 10.1093/wber/13.1.67
1999
Cited 39 times
The Efficient Mechanism for Downsizing the Public Sector
This article analyzes the efficient mechanism for downsizing the public sector, focusing on adverse selection in productive efficiency. Each worker is assumed to have two type-dependent reservation utilities: the status quo utility in the public sector before downsizing and the utility that the worker expects to obtain by entering the private sector. The efficient mechanism consists of a menu of probability (of remaining in the public sector) and transfer pairs that induces self-selection. A worker's full cost is defined by the sum of production cost in the public sector and reservation utility in the private sector. It is optimal to start by laying off the agents with higher full cost. When the public sector before downsizing is discriminating as the differential of private information about productive efficiency suggests, there are countervailing incentives. This makes the size of downsizing smaller under asymmetric information than under complete information.
DOI: 10.1111/j.1601-183x.2006.00267.x
2006
Cited 34 times
Impaired long‐term memory and long‐term potentiation in N‐type Ca<sup>2+</sup> channel‐deficient mice
Voltage‐dependent N‐type Ca 2+ channels, along with the P/Q‐type, have a crucial role in controlling the release of neurotransmitters or neuromodulators at presynaptic terminals. However, their role in hippocampus‐dependent learning and memory has never been examined. Here, we investigated hippocampus‐dependent learning and memory and synaptic plasticity at hippocampal CA3–CA1 synapses in mice deficient for the α 1B subunit of N‐type Ca 2+ channels. The mutant mice exhibited impaired learning and memory in the Morris water maze and the social transmission of food preference tasks. In particular, long‐term memory was impaired in the mutant mice. Interestingly, among activity‐dependent long‐lasting synaptic changes, theta burst‐ or 200‐Hz‐stimulation‐induced long‐term potentiation (LTP) was decreased in the mutant, compared with the wild‐type mice. This type of LTP is known to require brain‐derived neurotrophic factor (BDNF). It was found that both BDNF‐induced potentiation of field excitatory postsynaptic potentials and facilitation of the frequency of miniature excitatory postsynaptic currents (mEPSCs) were reduced in the mutant. Taken together, these results demonstrate that N‐type Ca 2+ channels are required for hippocampus‐dependent learning and memory, and certain forms of LTP.
DOI: 10.1145/2024724.2024943
2011
Cited 23 times
Pipeline strategy for improving optimal energy efficiency in ultra-low voltage design
This paper investigates pipelining methodologies for the ultra low voltage regime. Based on an analytical model and simulations, we propose a pipelining technique that provides higher energy efficiency and performance than conventional approaches to ultra low voltage design. Two-phase latch based design and sequential circuit optimizations are also proposed to further improve energy efficiency and performance. Silicon results demonstrate a 16b multiplier using the approaches in 65nm CMOS improve energy efficiency by 30% and performance by 60%.
DOI: 10.1109/tbcas.2021.3134660
2021
Cited 11 times
A Multi-Channel Spike Sorting Processor With Accurate Clustering Algorithm Using Convolutional Autoencoder
This paper presents a spike sorting processor based on an accurate spike clustering algorithm. The proposed spike sorting algorithm employs an L2-normalized convolutional autoencoder to extract features from the input, where the autoencoder is trained using the proposed spike sorting-aware loss. In addition, we propose a similarity-based K-means clustering algorithm that conditionally updates the means by observing the cosine similarity. The modified K-means algorithm exhibits better convergence and enables online clustering with higher classification accuracy. We implement a spike sorting processor based on the proposed algorithm using an efficient time-multiplexed hardware architecture in a 40-nm CMOS process. Experimental results show that the processor consumes 224.75μW/mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> when processing 16 input channels at 7.68 MHz and 0.55 V. Our design achieves 95.54% clustering accuracy, outperforming prior spike sorting processor designs.
DOI: 10.1109/tcsii.2023.3252501
2023
A 65 nm 12.92-nJ/Inference Mixed-Signal Neuromorphic Processor for Image Classification
Spiking neural networks are a promising candidate for next-generation machine learning and are suitable for powerconstrained edge devices. In this paper, we present a mixed-signal neuromorphic processor that efficienctly implements an echo state network (ESN) and achieves high accuracy without a costly on-chip training process. The design employs a charge-domain computation circuit that efficiently realizes a leaky integrate and fire neuron. Combined with optimizing sparse connections, the proposed algorithm-hardware co-design approach results in a highly energy-efficient operation while delivering high accuracy. Fabricated in a 65nm LP process, the processor is measured to achieve 95.35% MNIST classification accuracy, which closely matches the software model, and energy efficiency of 12.92nJ per classification.
DOI: 10.1109/isscc.2017.7870304
2017
Cited 16 times
8.4 A 2.5ps 0.8-to-3.2GHz bang-bang phase- and frequency-detector-based all-digital PLL with noise self-adjustment
Digital PLLs are popular for on-chip clock generation due to their small size and technology portability. Variability tolerance is a key design challenge when designing such PLLs in an advanced CMOS technology. Environmental variations, such as mismatch, process, supply voltage, and temperature (PVT) perturb device characteristics and result in performance changes, such as DCO gain and noise. Another consideration is the wide range of operating modes in which modern digital circuits (e.g., processors) operate. For instance, a clock generator for a processor may produce a range of frequencies from tens of MHz to several GHz depending on required processor performance. In low-frequency mode, the power consumption is more pronounced than the noise. Therefore, we seek to design a PLL that is both insensitive to environmental variations, as well as reconfigurable to changing noise and power specifications.
DOI: 10.1109/jssc.2017.2776313
2018
Cited 15 times
A Noise Reconfigurable All-Digital Phase-Locked Loop Using a Switched Capacitor-Based Frequency-Locked Loop and a Noise Detector
Programmability is one of the most significant advantages of a digital phase-locked loop (PLL) compared with a charge-pump PLL. In this paper, a digital PLL that extends programmability to include noise is introduced. A digitally controlled oscillator (DCO) using a switched capacitor for frequency feedback is proposed to maintain a constant figure of merit while reconfiguring its noise performance. The proposed DCO offers an accurate and linear frequency tuning curve that is insensitive to environmental changes. A noise detection circuit using the statistical property of a bang-bang phase and frequency detector is proposed to autonomously adjust the output noise level depending on the noise specification. A prototype design is fabricated in a 28-nm FDSOI process. The integrated phase noise of the proposed PLL can be configured from 2.5 to 15 ps, while the power consumption ranges from 1.7 to 5 mW.
DOI: 10.1109/vlsic.2015.7231327
2015
Cited 14 times
A 120nW 8b sub-ranging SAR ADC with signal-dependent charge recycling for biomedical applications
We present an 8-bit sub-ranging SAR ADC designed for bursty signals having long time periods with small code spread. A modified capacitive-DAC (CDAC) saves previous sample's MSB voltage and reuses it throughout subsequent conversions. This prevents unnecessary switching of large MSB capacitors as well as conversion cycles, reducing energy consumed in the comparator and digital logic and yielding total energy savings of 2.6×. In 0.18μm CMOS, the ADC consumes 120nW at 0.6V and 100kS/s with 46.9dB SNDR.
DOI: 10.1109/isscc.2013.6487684
2013
Cited 14 times
A 470mV 2.7mW feature extraction-accelerator for micro-autonomous vehicle navigation in 28nm CMOS
This paper proposes a power-efficient speeded-up robust features (SURF) extraction accelerator targeted primarily for micro air vehicles (MAVs) with autonomous navigation (Fig. 9.7.1). Typical object recognition SoCs [4-6] employ an application-specific algorithm to choose specific regions of interest (ROIs) to reduce computation by focusing on a small portion of the image. However, this approach is not feasible in applications where the whole image must be analyzed, such as visual navigation that requires the extraction of general features to determine location or movement. In addition, multicore architectures need to run at high clock frequencies to meet high peak performance requirements and the power consumption of inter-core communication becomes prohibitive. Since feature extraction algorithms require significant memory accesses across a large area, parallelization in a multicore system requires costly high-bandwidth memories for massive intermediate data.
DOI: 10.1109/esscirc55480.2022.9911450
2022
Cited 5 times
A 28nm 1.644TFLOPS/W Floating-Point Computation SRAM Macro with Variable Precision for Deep Neural Network Inference and Training
This paper presents a digital compute-in-memory (CIM) macro for accelerating deep neural networks. The macro provides high-precision computation required for training deep neural networks and running state-of-the-art models by supporting floating-point MAC operations. Additionally, the design supports variable computation precision, enabling optimized processing for different models and tasks. The design achieves 1.644TFLOPS/W energy efficiency and 57.9GFLOPS/mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> computation density while supporting a wide range of floating-point data formats and computation precisions.
DOI: 10.1109/tcsii.2020.3007760
2021
Cited 8 times
A 65nm 0.6–1.2V Low-Dropout Regulator Using Voltage-Difference-to-Time Converter With Direct Output Feedback
This brief proposes a fully integrated output capacitor-less low-dropout regulator (LDO) using a voltage difference to time converter (VDTC). Proposed dynamic amplifier based VDTC allows low voltage operation and significantly reduces the quiescent current. The linear characteristics of VDTC result in output ripple-less operation and good regulation performance. Using direct output feedback through a small coupling capacitor, the gate voltage of the power transistor is instantly compensated to mitigate fluctuation of the output voltage when a sharp load transient occurs. Fabricated in 65 nm LP CMOS, the proposed LDO demonstrates a wide operation range with an input voltage range of 0.6-1.2 V and a load current of over 30 mA across all voltages without an output capacitor. With reduced output impedance due to direct output feedback, the measured undershoot is 158 mV, which is recovered in 9.6 μs, when the load current changes by 28 mA in 1 ns. The peak current efficiency is more than 99.99% and the figure of merit (FOM) is 0.202 fs. The active area of the control block is 0.002 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> .
DOI: 10.1109/icicdt.2012.6232880
2012
Cited 10 times
Extending energy-saving voltage scaling in ultra low voltage integrated circuit designs
In this paper, we propose several design approaches to extend useful voltage scaling (i.e. voltage scaling with net energy savings) beyond the conventional limit, which is imposed by the rapid increase of leakage energy overhead in ultra low voltage regimes. We are able to achieve such extra voltage scaling and thus energy savings without compromising performance and variability through minimizing the ratio of leakage to dynamic energy in a circuit. Novel design approaches in pipeline, clocking and architecture optimization are investigated; and applied during the design of a 16b 1024pt complex FFT core. The measurement results from the prototyped FFT core in a 65nm CMOS show the energy consumption of 15.8nF/FFF with the clock frequency of 30MHz and the throughput of 240Msamples/s at the supply voltage of 270mV, which exhibits 2.4× higher energy efficiency and >;10× higher throughput than the previous low power FFT designs. Measurement of 60 dies shows modest frequency and energy σ/μ spreads of 7% and 2%, respectively.
DOI: 10.1109/isscc42615.2023.10067646
2023
22.8 A0.81 mm<sup>2</sup> 740μW Real-Time Speech Enhancement Processor Using Multiplier-Less PE Arrays for Hearing Aids in 28nm CMOS
Speech enhancement (SE) is a task to improve voice quality and intelligibility by removing noise from the audio, which is widely adopted in hearing assistive devices. Hearing aids are generally worn in or behind the ear, requiring real-time processing with a limited power budget. Deep learning-based algorithms provide excellent SE quality, but their large model size and high computational complexity make them unsuitable for wearable hearing assistive devices. Recent hardware-oriented works mitigate these issues through algorithm and hardware optimization [1–3]. Nonetheless, they exhibit inferior SE performance relative to state-of-the-art models or rely on large neural network models, limiting overall processing efficiency. This paper presents an end-to-end SE system that delivers high-quality SE with low power consumption and small area, while meeting real-time processing constraints. Our main contributions are: 1) an importance-aware neural network optimization method that significantly reduces computational costs, while maintaining enhancement quality, 2) a reconfigurable processing element (PE) for efficiently processing both the coordinate rotation digital computer (CORDIC) algorithm and neural network layers, and 3) PE input routing and weight mapping schemes to minimize processing latency by enhancing PE utilization. Based on these design techniques, our processor fabricated in a <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$28\mathsf{nm}$</tex> CMOS process fulfills real-time speech enhancement by processing each frame within <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$8\mathsf{ms}$</tex> while consuming only <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$740\mu \mathsf{W}$</tex> . Also, the design achieves the highest objective evaluation score compared to previous SE processors.
DOI: 10.1109/isscc.2016.7418084
2016
Cited 8 times
24.1 A 0.6V 8mW 3D vision processor for a navigation device for the visually impaired
3D imaging devices, such as stereo and time-of-flight (ToF) cameras, measure distances to the observed points and generate a depth image where each pixel represents a distance to the corresponding location. The depth image can be converted into a 3D point cloud using simple linear operations. This spatial information provides detailed understanding of the environment and is currently employed in a wide range of applications such as human motion capture [1]. However, its distinct characteristics from conventional color images necessitate different approaches to efficiently extract useful information. This paper describes a low-power vision processor for processing such 3D image data. The processor achieves high energy-efficiency through a parallelized reconfigurable architecture and hardware-oriented algorithmic optimizations. The processor will be used as a part of a navigation device for the visually impaired (Fig. 24.1.1). This handheld or body-worn device is designed to detect safe areas and obstacles and provide feedback to a user. We employ a ToF camera as the main sensor in this system since it has a small form factor and requires relatively low computational complexity [2].
DOI: 10.1109/cicc.2014.6946070
2014
Cited 7 times
Circuit techniques for miniaturized biomedical sensors
DOI: 10.1109/vlsic.2016.7573467
2016
Cited 7 times
A 128-channel spike sorting processor featuring 0.175 µW and 0.0033 mm<sup>2</sup>per channel in 65-nm CMOS
This paper presents a power and area efficient processor for real-time neural spike-sorting. We propose a robust spike detector (SD), a feature extractor (FE), and an improved k-means algorithm for better clustering accuracy. Furthermore, time-multiplexing architecture is used in SD for dynamic power reduction. A customized 39kb 8T SRAM is also implemented to minimize leakage and storage area. The proposed processor consumes 0.175 µW/ch with leakage of 0.03 µW/ch at 0.54 V and area of 0.0033 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> /ch.
DOI: 10.48550/arxiv.2102.03207
2021
Cited 6 times
Real-time Denoising and Dereverberation with Tiny Recurrent U-Net
Modern deep learning-based models have seen outstanding performance improvement with speech enhancement tasks. The number of parameters of state-of-the-art models, however, is often too large to be deployed on devices for real-world applications. To this end, we propose Tiny Recurrent U-Net (TRU-Net), a lightweight online inference model that matches the performance of current state-of-the-art models. The size of the quantized version of TRU-Net is 362 kilobytes, which is small enough to be deployed on edge devices. In addition, we combine the small-sized model with a new masking method called phase-aware $β$-sigmoid mask, which enables simultaneous denoising and dereverberation. Results of both objective and subjective evaluations have shown that our model can achieve competitive performance with the current state-of-the-art models on benchmark datasets using fewer parameters by orders of magnitude.
DOI: 10.1109/icassp.2011.5946828
2011
Cited 7 times
Energy-optimized high performance FFT processor
This paper proposes an ultra low energy FFT processor suitable for sensor applications. The processor is based on R4MDC but achieves full utilization of computational elements. It has two parallel datapaths that increase throughput by a factor of 2 and also enable high memory utilization. The proposed design is implemented in 65nm CMOS technology and post-layout simulation including parasitic capacitances shows it achieves 9.25× higher energy efficiency than state-of-the-art FFT processors and high throughput relative to past subthreshold circuit implementations.
DOI: 10.1016/j.nima.2018.10.151
2020
Cited 6 times
TID-Tolerant Inverter Designs for Radiation-Hardened Digital Systems
This work experimentally compares total ionizing dose (TID) effects on various inverter designs, which are fundamental components for implementing radiation hardening by design (RHBD) digital circuits. Based on prior works, which reported that leakage current variation of NMOS transistors is significantly larger than that of PMOS, this work suggests design methodologies to alleviate TID effects on NMOS transistors with the following inverter topologies: stacked NMOS inverter, pseudo PMOS inverter, PMOS-only inverter, and dummy transistor inverter. We have also investigated different sizes of the inverters as well as different PN ratios to optimize them for a more robust design that can operate under high radiation environments. These designs are fabricated in the 180 nm CMOS process and measured performance degradations by using a 60Co source. Experimental results show that the stacked NMOS inverter provides best performance in terms of switching point variation, area, and power consumption. In addition, one with larger transistor size and PN ratio is more helpful in TID hardening. Given that an inverter is an essential and basic building block of digital systems, the proposed techniques can be adopted in any systems requiring operation under radiation-emitting circumstances, e.g., measurement devices in nuclear power plants or electronics in space.
DOI: 10.1109/tvlsi.2022.3180774
2022
Cited 3 times
A 270-mA Self-Calibrating-Clocked Output-Capacitor-Free LDO With 0.15–1.15V Output Range and 0.183-fs FoM
This article proposes a fully integrated output-capacitor-free low-dropout regulator (LDO) for mobile applications. To overcome the limited output voltage range of typical analog LDOs, our design uses a rail-to-rail voltage-difference-to-time-converter (VDTC) and a charge pump (CP) to achieve a wide output range. Using a self-calibrating clock generator (SCCG) removes the need for an external clock source and adaptively tunes the clock frequency, enabling fast transient responses while minimizing quiescent current. A tunable undershoot compensator (TUC) mitigates voltage droop by detecting the drop in the output voltage due to a sharp increase in load current and compensating the output voltage immediately. The proposed LDO is fabricated in a 65-nm low power (LP) CMOS process and demonstrates a maximum load current capacity of 270 mA. The input and output voltage ranges of the LDO are 0.5–1.2 and 0.15–1.15 V, respectively, with 12.7- <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mu \text{A}$ </tex-math></inline-formula> quiescent current and 99.99% peak current efficiency. The measured undershoot and settling time are 150 mV and 100 ns at a slew rate of 200 mA/3 ns, respectively, achieving a figure of merit (FoM) of 0.183 fs.
DOI: 10.1109/sips50750.2020.9195236
2020
Cited 5 times
A Modified Serial Commutator Architecture for Real-Valued Fast Fourier Transform
This paper presents a modified real-valued serial commutator (mRSC) architecture for real-valued Fast Fourier Transform (RFFT). The mRSC architecture is based on an optimized data management scheme, which reduces processing latency as well as hardware resources. Combined with a new rotator circuit that requires fewer delay cells, the mRSC architecture achieves lower latency and implementation cost than prior arts. In addition, we show that the mRSC architecture can be further modified to accommodate real-valued inverse FFT (RIFFT) with minimal changes and propose an invertible mRSC (imRSC) architecture. For N-point FFT, the mRSC and imRSC architectures require N+ 5log <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> N - 9 and N + 6log <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> N - 11 real delay cells, respectively, while both architectures have the latency of N + 2log <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> N - 5.
DOI: 10.1109/isocc53507.2021.9613990
2021
Cited 4 times
Reducing Refresh Overhead with In-DRAM Error Correction Codes
DRAM technology scaling has continuously improved memory density, but the limited cell capacitance makes more susceptible to reliability issues. Hence, it has become inevitable to employ in-DRAM ECC. Also, the performance and power consumption overhead due to refresh operations have become a critical issue as the DRAM density increases. Therefore, it is very important to reduce the refresh overhead without sacrificing the reliability of DRAM. In this paper, we propose a retention-aware refresh scheme with in-DRAM ECC. The key idea of our proposed method is that the in-DRAM ECC can correct a single-bit error, and this will effectively reduce the number of weak rows that have to be refreshed every 64ms. Also, a runtime profiler is proposed to keep up-to-date information of weak rows to solve the variable retention time problem. Our experiments with SPEC benchmarks show up to 6.8% performance improvement of performance, and up to 15.4% reduction of power consumption compared with the conventional refresh schemes.
DOI: 10.1587/elex.16.20190296
2019
Cited 4 times
Power-up control techniques for reliable SRAM PUF
Physically unclonable function (PUF) is a widely used hardware-level identification method. SRAM PUFs are the most well-known PUF topology, but they typically suffer from low reproducibility due to non-deterministic behaviors and noise during power-up process. In this work, we propose two power-up control techniques that effectively improve reproducibility of the SRAM PUFs. The techniques reduce undesirable bit flipping during evaluation by controlling either evaluation region or power supply ramp-up speed. Measurement results from the 180 nm test chip confirm that native unstable bits (NUBs) are reduced by 54.87% and bit error rate (BER) decreases by 55.05% while reproducibility increases by 2.2×.
DOI: 10.1109/access.2019.2910559
2019
Cited 3 times
Microarchitecture-Aware Code Generation for Deep Learning on Single-ISA Heterogeneous Multi-Core Mobile Processors
While single-ISA heterogeneous multi-core processors are widely used in mobile computing, typical code generations optimize the code for a single target core, leaving it less suitable for the other cores in the processor. We present a microarchitecture-aware code generation methodology to mitigate this issue. We first suggest adopting Function-Multi-Versioning (FMV) to execute application codes utilizing a core at full capacity regardless of its microarchitecture. We also propose to add a simple but powerful backend optimization pass in the compiler to further boost the performance of applicable cores. Based on these schemes, we developed an automated flow that analyzes the program and generates multiple versions of hot functions tailored to different microarchitectures. At runtime, the running core chooses an optimal version to maximize computation performance. The measurements confirm that the methodology improves the performance of Cortex-A55 and Cortex-A75 cores in Samsung's next-generation Exynos 9820 processor by 11.2% and 17.9%, respectively, while running TensorFlow Lite.
DOI: 10.1109/tvlsi.2023.3257198
2023
A Real-Time Object Detection Processor With xnor-Based Variable-Precision Computing Unit
Object detection algorithms using convolutional neural networks (CNNs) achieve high detection accuracy, but it is challenging to realize real-time object detection due to their high computational complexity, especially on resource-constrained mobile platforms. In this article, we propose an algorithm-hardware co-optimization approach to designing a real-time object detection system. We first develop a compact object detection model based on a binarized neural network (BNN), which employs a new layer structure, the DenseToRes layer, to mitigate information loss due to deep quantization. We also propose an efficient object detection processor that runs object detection with high throughput using limited hardware rescources. We develop a resource-efficient processing unit supporting variable precision with minimal hardware overheads. Implemented in field-programmable gate array (FPGA), the object detection processor achieves 64.51 frames/s throughput with 64.92 mean average precision (mAP) accuracy. Compared to prior FPGA-based designs for object detection, our design achieves high throughput with competitive accuracy and lower hardware implementation costs.
DOI: 10.1109/esscirc59616.2023.10268686
2023
A 4.27TFLOPS/W FP4/FP8 Hybrid-Precision Neural Network Training Processor Using Shift-Add MAC and Reconfigurable PE Array
This paper presents an energy-efficient FP4/FP8 hybrid-precision training processor. Through hardware-software co-optimization, the design efficiently implements all general matrix multiply (GEMM) operations required for training using only shift-add multiply-accumulate (MAC) units. The reconfigurable processing element (PE) array further improves efficiency by significantly reducing on-chip memory access. The on-chip convolution decomposition technique supports a wide range of kernels using simple homogeneous data routing. Fabricated in 40nm CMOS, the processor achieves 2.61TFLOPS/W real-model efficiency for ResNet-18 training, outperforming prior art by 59%.
DOI: 10.37736/kjlr.2023.10.14.5.09
2023
Research Trend Analysis and Implications on Writing Education for Foreign Students
This study examines the trends in writing-related educational research for foreign students and discusses the results. To achieve this, a comprehensive review of 95 papers published from 2001 to August in 2023 was conducted. Various data analysis techniques were employed, including frequency analysis, centrality analysis, and topic modeling, to extract valuable insights. The keywords obtained from the frequency analysis were “university” (190 times), “purpose” (168 times), “competence” (151 times), “academic” (146 times), “target” (143 times), “course” (139 times), “study abroad” (136 times), “analysis” (129 times), “content” (124 times), and “textbook” (117 times). The result of centrality analysis showed the similar words. The topics extracted through topic modeling were “selection of textbooks suitable for student ability,” “strengthening writing skills using reading materials,” “writing education through cultural understanding,” “utilization of self-reflection and narrative writing,” and “academic writing for academic purposes.” Based on these results, research trends were analyzed and their implications were discussed.
DOI: 10.33645/cnc.2023.11.45.11.65
2023
Research Trend Analysis on Assessment of Online Korean Education
This paper attempted to examine the part of the evaluation of online Korean language education that took place during the pandemic when non-face-to-face classes were activated and to identify the trends and trends of research at the same time. Research on online Korean language education evaluation began in 2006, when the influx of foreigners was active, and various big data analysis methods were used to draw conclusions and gain insight into educational research trends. 88 academic papers were searched through the RISS (Research Information Sharing Service), and the extracted word parts were limited to nouns, and frequency analysis, centrality analysis (degree centrality, close centrality, betweenness centrality), and topic modeling were conducted to derive results and discuss them.
DOI: 10.1109/nssmicrtsd49126.2023.10338357
2023
A Boosted Active Quenching Circuit with Detector Capacitance Compensation
The detector capacitance compensation (DCC) technique which achieves double active quenching of single-photon avalanche diode (SPAD) is presented in this paper. A well-defined active quenching circuit has the advantages of a higher maximum photon-counting rate and lower afterpulsing probability. Quenching time and area are important criteria when implementing an active quenching circuit. The proposed novel active quenching circuit employing the DCC technique achieves 2.5 times better quenching time than the conventional one, with an area increase of just 23%. The schematic and simulation results are shown in this paper. Designed circuits are fabricated and simulated in 180 nm CMOS technology.
DOI: 10.1109/tcsii.2021.3132063
2022
An In-Memory Computing SRAM Macro for Memory-Augmented Neural Network
In-Memory Computing (IMC) has been widely studied to mitigate data transfer bottlenecks in von Neumann architectures. Recently proposed IMC circuit topologies dramatically reduce data transfer requirements by performing various operations such as Multiply-Accumulate (MAC) inside the memory. In this paper, we present an SRAM macro designed for accelerating Memory-Augmented Neural Network (MANN). We first propose algorithmic optimizations for a few-shot learning algorithm employing MANN for efficient hardware implementation. Then, we present an SRAM macro that efficiently accelerates the algorithm by realizing key operations such as L1 distance calculation and Winner-Take-All (WTA) operation through mixed-signal computation circuits. Fabricated in 40nm LP CMOS technology, the design demonstrates 27.7 TOPS/W maximum energy efficiency, while achieving 93.40% and 98.28% classification accuracy for 5-way 1-shot and 5-way 5-shot learning on the Omniglot dataset, which closely matches the accuracy of the baseline algorithm.
DOI: 10.1109/tcad.2022.3155444
2022
An Automatic Circuit Design Framework for Level Shifter Circuits
Although design automation is a key enabler of modern large-scale digital systems, automating the transistor-level circuit design process still remains a challenge. Some recent works suggest that deep learning algorithms could be adopted to find optimal transistor dimensions in relatively small circuitry such as analog amplifiers. However, those approaches are not capable of exploring different circuit structures to meet the given design constraints. In this work, we propose an automatic circuit design framework that can generate practical circuit structures from scratch as well as optimize the size of each transistor, considering performance and reliability. We employ the framework to design level shifter circuits, and the experimental results show that the framework produces novel level shifter circuit topologies and the automatically optimized designs achieve <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$2.8\times $ </tex-math></inline-formula> – <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$5.3\times $ </tex-math></inline-formula> lower power-delay product (PDP) than prior arts designed by human experts.
2014
Energy-Efficient Digital Signal Processing Hardware Design.
DOI: 10.1109/isocc53507.2021.9613943
2021
Fast Automatic Circuit Optimization Using Deep Learning
With recent advances in deep learning, a wide range of automatic circuit optimization techniques have been recently reported. While the optimization algorithms such as reinforcement learning can be effectively accelerated on GPU, they still require numerous SPICE simulations for training. Running SPICE simulation is a time-consuming process and hence becomes the bottleneck in circuit optimization. Therefore, it is crucial to find a way to reduce the number of SPICE simulations required for circuit optimization. In this paper, we propose a method to minimize SPICE simulations in the circuit design automation process. Specifically, we develop a data augmentation method to increase the amount of training data without additional SPICE simulations, which results in faster circuit optimization.
DOI: 10.1109/isocc.2016.7799746
2016
Design techniques for ultra-efficient computing
This paper reviews various energy saving techniques at different design levels. Aside from conventional voltage scaling and low power circuit techniques, systematic approaches including energy-aware architecture design, system-level power optimization and application-specific low power circuit design are presented along with demonstration systems.
DOI: 10.1109/apccas.2016.7804030
2016
Live demonstration: A 128-channel spike sorting processor featuring 0.175 μW and 0.0033 mm<sup>2</sup>per Channel in 65-nm CMOS
Multi-electrode intracranial recording technology is highly required for various applications such as brain studies, brain machine interface (BMI) and treatment of disorders like epilepsy, memory loss and paralysis. The recording is done by inserting electrodes into the extracellular tissue of the brain to supposedly record the single-unit activity. However, the recorded signal is the summation of some near-by neurons activities and the background noise. Therefore, after recording and digitizing the signal by analog front-end, the first crucial step is to extract the information from extracellular recording [1]. This process is called spike sorting and shown in Fig. 1. It basically assigns the detected spike to their source neurons located near the corresponding recording electrodes.
DOI: 10.1109/icassp.2013.6638152
2013
A low-power VGA full-frame feature extraction processor
This paper proposes an energy-efficient VGA full-frame feature extraction processor design. It is based on the SURF algorithm and makes various algorithmic modifications to improve efficiency and reduce hardware overhead while maintaining extraction performance. Low clock frequency and deep parallelism derived from a one-sample-per-cycle matched-throughput architecture provide significantly larger room for voltage scaling and enables full-frame extraction. The proposed design consumes 4.7mW at 400mV and achieves 72% higher energy efficiency than prior work.
DOI: 10.1109/aicas54282.2022.9869947
2022
A low power neural network training processor with 8-bit floating point with a shared exponent bias and fused multiply add trees
This demonstration showcases a neural network training processor implemented in silicon through 40nm LPCMOS technology. Based on custom 8-bit floating point and efficient tree-based processing schemes and dataflow, we achieve 2.48× higher energy efficiency than a prior low-power neural network training processor.
DOI: 10.1049/el.2018.0997
2018
Autonomous high‐speed serial link power management depending on required link performance for HMC
Many studies on 3D-stacked dynamic RAMs (DRAMs) have been conducted to overcome the shortcomings of conventional DRAM. The hybrid memory cube (HMC) is one of the most promising 3D-stacked DRAMs, thanks to its high bandwidth and expandable structure. However, a high-speed serial link that interfaces the CPU and HMC consumes significant power, primarily because of the high overhead incurred in synchronising its clock. Although the link provides low-power modes, managing them is very difficult because of their long mode transition times. An autonomous power management method for the high-speed link is proposed. The proposed method determines the optimal number of active links while satisfying the required link performance. Simulations demonstrate that the proposed method reduces link power consumption by an average of 63.06% with a performance degradation of only 1.36%. Therefore, this proposed autonomous link power management is an outstanding option for low-power HMC-based systems.
DOI: 10.48550/arxiv.2006.14317
2020
A Fast Finite Field Multiplier for SIKE
Various post-quantum cryptography algorithms have been recently proposed. Supersingluar isogeny Diffie-Hellman key exchange (SIKE) is one of the most promising candidates due to its small key size. However, the SIKE scheme requires numerous finite field multiplications for its isogeny computation, and hence suffers from slow encryption and decryption process. In this paper, we propose a fast finite field multiplier design that performs multiplications in GF(p) with high throughput and low latency. The design accelerates the computation by adopting deep pipelining, and achieves high hardware utilization through data interleaving. The proposed finite field multiplier demonstrates 4.48 times higher throughput than prior work based on the identical fast multiplication algorithm and 1.43 times higher throughput than the state-of-the-art fast finite field multiplier design aimed at SIKE.
2020
A Fast Finite Field Multiplier for SIKE
Various post-quantum cryptography algorithms have been recently proposed. Supersingluar isogeny Diffie-Hellman key exchange (SIKE) is one of the most promising candidates due to its small key size. However, the SIKE scheme requires numerous finite field multiplications for its isogeny computation, and hence suffers from slow encryption and decryption process. In this paper, we propose a fast finite field multiplier design that performs multiplications in GF(p) with high throughput and low latency. The design accelerates the computation by adopting deep pipelining, and achieves high hardware utilization through data interleaving. The proposed finite field multiplier demonstrates 4.48 times higher throughput than prior work based on the identical fast multiplication algorithm and 1.43 times higher throughput than the state-of-the-art fast finite field multiplier design aimed at SIKE.
DOI: 10.1109/tvlsi.2021.3097341
2021
Dynamic Block-Wise Local Learning Algorithm for Efficient Neural Network Training
In the backpropagation algorithm, the error calculated from the output of the neural network should backpropagate the layers to update the weights of each layer, making it difficult to parallelize the training process and requiring frequent off-chip memory access. Local learning algorithms locally generate error signals which are used for weight updates, removing the need for backpropagation of error signals. However, prior works rely on large, complex auxiliary networks for reliable training, which results in large computational overhead and undermines the advantages of local learning. In this work, we propose a local learning algorithm that significantly reduces computational complexity as well as improves training performance. Our algorithm combines multiple consecutive layers into a block and performs local learning on a block-by-block basis, while dynamically changing block boundaries during training. In experiments, our approach achieves 95.68% and 79.42% test accuracy on the CIFAR-10 and CIFAR-100 datasets, respectively, using a small fully connected layer as auxiliary networks, closely matching the performance of the backpropagation algorithm. Multiply-accumulate (MAC) operations and off-chip memory access also reduce by up to 15% and 81% compared to backpropagation.
2021
Activation Sharing with Asymmetric Paths Solves Weight Transport Problem without Bidirectional Connection