Loading...
Bimonthly    Since 1986
ISSN 1004-9037
/
Indexed in:
SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Publication Details
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
 
  • Table of Content
      05 November 2014, Volume 29 Issue 6   
    For Selected: View Abstracts Toggle Thumbnails
    Computer Architectures and Systems
    A General Low-Cost Indirect Branch Prediction Using Target Address Pointers
    Zi-Chao Xie, Dong Tong, Ming-Kai Huang
    Journal of Data Acquisition and Processing, 2014, 29 (6): 929-946. 
    Abstract   PDF(4463KB) ( 1908 )  
    Nowadays energy-efficiency becomes the first design metric in chip development. To pursue higher energy efficiency, the processor architects should reduce or eliminate those unnecessary energy dissipations. Indirect-branch pre-diction has become a performance bottleneck, especially for the applications written in object-oriented languages. Previous hardware-based indirect-branch predictors are generally inefficient, for they either require significant hardware storage or predict indirect-branch targets slowly. In this paper, we propose an energy-efficient indirect-branch prediction technique called TAP (target address pointer) prediction. Its key idea includes two parts: utilizing specific hardware pointers to accelerate the indirect branch prediction flow and reusing the existing processor components to reduce additional hardware costs and power consumption. When fetching an indirect branch, TAP prediction first gets the specific pointers called target address pointers from the conditional branch predictor, and then uses such pointers to generate virtual addresses which index the indirect-branch targets. This technique spends similar time compared to the dedicated storage techniques without requiring additional large amounts of storage. Our evaluation shows that TAP prediction with some representative state-of-the-art branch predictors can improve performance significantly over the baseline processor. Compared with those hardware-based indirect-branch predictors, the TAP-Perceptron scheme achieves performance improvement equivalent to that provided by an 8 K-entry TTC predictor, and also outperforms the VPC predictor.
    Retention Benefit Based Intelligent Cache Replacement
    Ling-Da Li, Jun-Lin Lu, Xu Cheng
    Journal of Data Acquisition and Processing, 2014, 29 (6): 947-961. 
    Abstract   PDF(4461KB) ( 1516 )  
    The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cores varies across different misses. Compared with parallel misses and store misses, isolated fetch and load misses are more costly. The variation of cache miss penalty suggests that the cache replacement policy should take it into account. To that end, first, we propose the notion of retention benefit. Retention benefits can evaluate not only the increment of processor stall cycles on cache misses, but also the reduction of processor stall cycles due to cache hits. Then, we propose Retention Benefit Based Replacement (RBR) which aims to maximize the aggregate retention benefits of blocks reserved in the cache. RBR keeps track of the total retention benefit for each block in the cache, and it preferentially evicts the block with the minimum total retention benefit on replacement. The evaluation shows that RBR can improve cache performance significantly in both single-core and multi-core environment while requiring a low storage overhead. It also outperforms other state-of-the-art techniques.
    A Static Greedy and Dynamic Adaptive Thread Spawning Approach for Loop-Level Parallelism
    Mei-Rong Li, Yin-Liang Zhao, You Tao, Qi-Ming Wang
    Journal of Data Acquisition and Processing, 2014, 29 (6): 962-975. 
    Abstract   PDF(3783KB) ( 1292 )  
    Thread-level speculation becomes more attractive for the exploitation of thread-level parallelism from irregular sequential applications. But it is common for speculative threads to fail to reach the expected parallel performance. The reason is that the performance of speculative threads is extremely complicated by the fact that it not only suffers from the imprecision of compiler-directed performance estimation due to ambiguous control and data dependences, but also depends on the underlying hardware configuration and program behaviors. Thus, this paper proposes a statically greedy and dynamically adaptive approach for loop-level speculation to dynamically determine the best loop level at runtime. It relies on the compiler to select and optimize all loop candidates greedily, which are then proceeded on the cost-benefit analysis of different loop nesting levels for the determination of the order of loop speculation. Under the runtime loop execution prediction, we dynamically schedule and update the order of loop speculation, and ensure the best loop level to be always parallelized. Two different policies are also examined to maximize overall performance. Compared with traditional static loop selection techniques, our approach can achieve comparable or better performance.
    An Intra-Server Interconnect Fabric for Heterogeneous Computing
    Zheng Cao, Xiao-Li Liu, ACM Qiang Li, Xiao-Bing Liu, ACM Zhan Wang, Xue-Jun An
    Journal of Data Acquisition and Processing, 2014, 29 (6): 976-988. 
    Abstract   PDF(3697KB) ( 1583 )  
    With the increasing diversity of application needs and computing units, the server with heterogeneous processors is more and more widespread. However, conventional SMP/ccNUMA server architecture introduces communication bottleneck between heterogeneous processors and only uses heterogeneous processors as coprocessors, which limits the efficiency and flexibility of using heterogeneous processors. To solve this problem, this paper proposes an intra-server interconnect fabric that supports both intra-server peer-to-peer interconnection and I/O resource sharing among heterogeneous processors. By connecting processors and I/O devices with the proposed fabric, heterogeneous processors can perform direct communication with each other and run in stand-alone mode with shared intra-server resources. We design the proposed fabric by extending the de-facto system I/O bus protocol PCIe (Peripheral Computer Interconnect Express) and implement it with a single chip cZodiac. By making full use of PCIe's original advantages, the interconnection and the I/O sharing mechanism are light weight and efficient. Evaluations that have been carried out on both the FPGA (Field Programmable Gate Array) prototype and the cycle-accurate simulator demonstrate that our design is feasible and scalable. In addition, our design is suitable for not only the heterogeneous server but also the high density server.
    Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor
    Yi-Qun Liu, Yan Li, Yun-Quan Zhang, Xian-Yi Zhang
    Journal of Data Acquisition and Processing, 2014, 29 (6): 989-1002. 
    Abstract   PDF(2825KB) ( 2152 )  
    Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intel® Many Integrated Core (MIC) Architecture provides not only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-dimensional fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the data array three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of non-unit strided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into two sub-dimensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively. The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-level parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectorization, are employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel® Xeon PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in o2oad mode, which beats the vendor-specific Intel® MKL library by a factor of up to 2.22X.
    Improved Blocking Time Analysis and Evaluation for the Multiprocessor Priority Ceiling Protocol
    Mao-Lin Yang, Hang Lei, Yong Liao, Furkan Rabee
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1003-1013. 
    Abstract   PDF(1176KB) ( 2295 )  
    The Multiprocessor Priority Ceiling Protocol (MPCP) is a classic suspension-based real-time locking protocol for partitioned fixed-priority (P-FP) scheduling. However, existing blocking time analysis is pessimistic under the P-FP + MPCP scheduling, which negatively impacts the schedulability for real-time tasks. In this paper, we model each task as an alternating sequence of normal and critical sections, and use both the best-case execution time (BCET) and the worst-case execution time (WCET) to describe the execution requirement for each section. Based on this model, a novel analysis is proposed to bound shared resource requests. This analysis uses BCET to derive the lower bound on the inter-arrival time for shared resource requests, and uses WCET to obtain the upper bound on the execution time of a task on critical sections during an arbitrary time interval of △t. Based on this analysis, improved blocking analysis and its associated worst-case response time (WCRT) analysis are proposed for P-FP + MPCP scheduling. Schedulability experiments indicate that the proposed method outperforms the existing methods and improves the schedulability significantly.
    Computer Graphics and Multimedia
    Feature-Adaptive Rendering of Loop Subdivision Surfaces on Modern GPUs
    Yun-Cen Huang, Jie-Qing Feng, Matthias NieBner, Yuan-Min Cui, Baoguang Yang
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1014-1025. 
    Abstract   PDF(15180KB) ( 1787 )  
    We present a novel approach for real-time rendering Loop subdivision surfaces on modern graphics hardware. Our algorithm evaluates both positions and normals accurately, thus providing the true Loop subdivision surface. The core idea is to recursively refine irregular patches using a GPU compute kernel. All generated regular patches are then directly evaluated and rendered using the hardware tessellation unit. Our approach handles triangular control meshes of arbitrary topologies and incorporates common subdivision surface features such as semi-sharp creases and hierarchical edits. While surface rendering is accurate up to machine precision, we also enforce a consistent bitwise evaluation of positions and normals at patch boundaries. This is particularly useful in the context of displacement mapping which strictly requires matching surface normals. Furthermore, we incorporate efficient level-of-detail rendering where subdivision depth and tessellation density can be adjusted on-the-fly. Overall, our algorithm provides high-quality results at real-time frame rates, thus being ideally suited to interactive rendering applications such as video games or authoring tools.
    A Two-Step Regularization Framework for Non-Local Means
    Zhong-Gui Sun, Song-Can Chen, Li-Shan Qiao
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1026-1037. 
    Abstract   PDF(2829KB) ( 1289 )  
    As an effective patch-based denoising method, non-local means (NLM) method achieves favorable denoising performance over its local counterparts and has drawn wide attention in image processing community. The implementation of NLM can formally be decomposed into two sequential steps, i.e., computing the weights and using the weights to compute the weighted means. In the first step, the weights can be obtained by solving a regularized optimization. And in the second step, the means can be obtained by solving a weighted least squares problem. Motivated by such observations, we establish a two-step regularization framework for NLM in this paper. Meanwhile, using the framework, we reinterpret several non-local filters in the unified view. Further, taking the framework as a design platform, we develop a novel non-local median filter for removing salt-pepper noise with encouraging experimental results.
    A Holistic Approach for Efficient Contour Detection
    Hong Cheng, Lin Chen
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1038-1047. 
    Abstract   PDF(7648KB) ( 1211 )  
    Object contours contain important visual information which can be applied to numerous vision tasks. As recent algorithms focus on the accuracy of contour detection, the entailed time complexity is significantly high. In this paper, we propose an efficient and effective contour extraction method based on both local cues from pixels and global cues from saliency. Experimental results demonstrate that a good trade-off between accuracy and speed can be achieved by the proposed approach for contour detection.
    Mammogram Enhancement Using Lifting Dyadic Wavelet Transform and Normalized Tsallis Entropy
    Muhammad Hussain
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1048-1057. 
    Abstract   PDF(9767KB) ( 1268 )  
    In this paper, we present a new technique for mammogram enhancement using fast dyadic wavelet transform (FDyWT) based on lifted spline dyadic wavelets and normalized Tsallis entropy. First, a mammogram image is decomposed into a multiscale hierarchy of low-subband and high-subband images using FDyWT. Then noise is suppressed using normalized Tsallis entropy of the local variance of the modulus of oriented high-subband images. After that, the wavelet coefficients of high-subbands are modified using a non-linear operator and finally the low-subband image at the first scale is modified with power law transformation to suppress background. Though FDyWT is shift-invariant and has better potential for detecting singularities like edges, its performance depends on the choice of dyadic wavelets. On the other hand, the number of vanishing moments is an important characteristic of dyadic wavelets for singularity analysis because it provides an upper bound measurement for singularity characterization. Using lifting dyadic schemes, we construct lifted spline dyadic wavelets of different degrees with increased number of vanishing moments. We also examine the effect of these wavelets on mammogram enhancement. The method is tested on mammogram images, taken from MIAS (Mammographic Image Analysis Society) database, having various background tissue types and containing different abnormalities. The comparison with the state-of-the-art contrast enhancement methods reveals that the proposed method performs better and the difference is statistically significant.
    Artificial Intelligence and Pattern Recognition
    Semisupervised Sparse Multilinear Discriminant Analysis
    Kai Huang, Li-Qing Zhang
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1058-1071. 
    Abstract   PDF(7527KB) ( 1259 )  
    Various problems are encountered when adopting ordinary vector space algorithms for high-order tensor data input. Namely, one must overcome the Small Sample Size (SSS) and overfitting problems. In addition, the structural information of the original tensor signal is lost during the vectorization process. Therefore, comparable methods using a direct tensor input are more appropriate. In the case of electrocardiograms (ECGs), another problem must be overcome; the manual diagnosis of ECG data is expensive and time consuming, rendering it difficult to acquire data with diagnosis labels. However, when effective features for classification in the original data are very sparse, we propose a semisupervised sparse multilinear discriminant analysis (SSSMDA) method. This method uses the distribution of both the labeled and the unlabeled data together with labels discovered through a label propagation algorithm. In practice, we use 12-lead ECGs collected from a remote diagnosis system and apply a short-time-fourier transformation (STFT) to obtain third-order tensors. The experimental results highlight the sparsity of the ECG data and the ability of our method to extract sparse and effective features that can be used for classification.
    Merge-Weighted Dynamic Time Warping for Speech Recognition
    Xiang-Lilan Zhang, Zhi-Gang Luo, Ming Li
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1072-1082. 
    Abstract   PDF(1625KB) ( 1573 )  
    Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language-independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several limitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.
    CPL: Detecting Protein Complexes by Propagating Labels on Protein-Protein Interaction Network
    Qi-Guo Dai, Mao-Zu Guo, Xiao-Yan Liu, Zhi-Xia Teng, Chun-Yu Wang
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1083-1093. 
    Abstract   PDF(2184KB) ( 1575 )  
    Proteins usually bind together to form complexes, which play an important role in cellular activities. Many graph clustering methods have been proposed to identify protein complexes by finding dense regions in protein-protein interaction networks. We present a novel framework (CPL) that detects protein complexes by propagating labels through interactions in a network, in which labels denote complex identifiers. With proper propagation in CPL, proteins in the same complex will be assigned with the same labels. CPL does not make any strong assumptions about the topological structures of the complexes, as in previous methods. The CPL algorithm is tested on several publicly available yeast protein-protein interaction networks and compared with several state-of-the-art methods. The results suggest that CPL performs better than the existing methods. An analysis of the functional homogeneity based on a gene ontology analysis shows that the detected complexes of CPL are highly biologically relevant.
    Regular Paper
    Complete Bipartite Anonymity for Location Privacy
    Kai Dong, Tao Gu, Xian-Ping Tao, Jian Lv
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1094-1110. 
    Abstract   PDF(4158KB) ( 1528 )  
    Users are vulnerable to privacy risks when providing their location information to location-based services (LBS). Existing work sacrifices the quality of LBS by degrading spatial and temporal accuracy for ensuring user privacy. In this paper, we propose a novel approach, Complete Bipartite Anonymity (CBA), aiming to achieve both user privacy and quality of service. The theoretical basis of CBA is that: if the bipartite graph of k nearby users' paths can be transformed into a complete bipartite graph, then these users achieve k-anonymity since the set of "points connecting to a specific start point in a graph" is an equivalence class. To achieve CBA, we design a Collaborative Path Confusion (CPC) protocol which enables nearby users to discover and authenticate each other without knowing their real identities or accurate locations, predict the encounter location using users' moving pattern information, and generate fake traces obfuscating the real ones. We evaluate CBA using a real-world dataset, and compare its privacy performance with existing path confusion approach. The results show that CBA enhances location privacy by increasing the chance for a user confusing his/her path with others by 4 to 16 times in low user density areas. We also demonstrate that CBA is secure under the trace identification attack.
    Automated Power Control for Virtualized Infrastructures
    Yu Wen, Wei-Ping Wang, Li Guo, Dan Meng
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1111-1122. 
    Abstract   PDF(1185KB) ( 1286 )  
    Power control for virtualized environments has gained much attention recently. One of the major challenges is keeping underlying infrastructure in reasonably low power states and achieving service-level objectives (SLOs) of upper applications as well. Existing solutions, however, cannot effectively tackle this problem for virtualized environments. In this paper, we propose an automated power control solution for such scenarios in hope of making some progress. The major advantage of our solution is being able to precisely control the CPU frequency levels of a physical environment and the CPU power allocations among virtual machines with respect to the SLOs of multiple applications. Based on control theory and online model estimation, our solution can adapt to the variations of application power demands. Additionally, our solution can simultaneously manage the CPU power control for all virtual machines according to their dependencies at either the application-level or the infrastructure-level. The experimental evaluation demonstrates that our solution outperforms three state-of-the-art methods in terms of achieving the application SLOs with low infrastructure power consumption.
    Improved Linear Attacks on the Chinese Block Cipher Standard
    Ming-Jie Liu, Jia-Zhe Chen
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1123-1133. 
    Abstract   PDF(409KB) ( 1649 )  
    The block cipher used in the Chinese Wireless LAN Standard (WAPI), SMS4, was recently renamed as SM4, and became the block cipher standard issued by the Chinese government. This paper gives a method for finding the linear approximations of SMS4. With this method, 19-round one-dimensional approximations are given, which are used to improve the previous linear cryptanalysis of SMS4. The 19-round approximations hold with bias 2-62:27; we use one of them to leverage a linear attack on 23-round SMS4. Our attack improves the previous 23-round attacks by reducing the time complexity. Furthermore, the data complexity of our attack is further improved by the multidimensional linear approach.
    Improved Linear Cryptanalysis of CAST-256
    Jing-Yuan Zhao, Mei-Qin Wang, Long Wen
    Journal of Data Acquisition and Processing, 2014, 29 (6): 1134-1139. 
    Abstract   PDF(412KB) ( 1970 )  
    CAST-256, a first-round AES (Advanced Encryption Standard) candidate, is designed based on CAST-128. It is a 48-round Generalized-Feistel-Network cipher with 128-bit block accepting 128, 160, 192, 224 or 256 bits keys. Its S-boxes are non-surjective with 8-bit input and 32-bit output. Wang et al. identified a 21-round linear approximation and gave a key recovery attack on 24-round CAST-256. In ASIACRYPT 2012, Bogdanov et al. presented the multidimensional zero-correlation linear cryptanalysis of 28 rounds of CAST-256. By observing the property of the concatenation of forward quad-round and reverse quad-round and choosing the proper active round function, we construct a linear approximation of 26-round CAST-256 and recover partial key information on 32 rounds of CAST-256. Our result is the best attack according to the number of rounds for CAST-256 without weak-key assumption so far.
SCImago Journal & Country Rank
 

ISSN 1004-9037

         

Home
Editorial Board
Author Guidelines
Subscription
Journal of Data Acquisition and Processing
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China

E-mail: info@sjcjycl.cn
 
  Copyright ©2015 JCST, All Rights Reserved