Monday, April 27, 2020

Parallelization of Pagerank and Hits Algorithm on Cuda Essay Example

Parallelization of Pagerank and Hits Algorithm on Cuda Paper Page Rank algorithm and HITS algorithm are widely known approaches to determine the importance and popularity of web pages. Due to large number of documents available on World Wide Web, huge amount of computations are required to determine the rank of web pages making it very time consuming. Researchers have devoted much attention In parallelizing Pageant on PC Cluster, Grids, and Multi- core processors Like Cell Broadband Engine to overcome this Issue but with little or no success. In this paper, we discuss the Issues In porting these algorithms on Compute unified Device Architecture (CODA) and Introduce efficient parallel implementation of these algorithms on CUD by exploiting the block structure of web, which not only cut down the computation time but also significantly reduces of the cost of hardware required (only few thousands). 1 . Introduction In present days, the unceasing growth of World Wide Web has lead to a lot of research in page ranking algorithms used by the search engines to provide the most relevant results to the user for any particular query. The dynamic and diverse nature of web graph further exaggerates the challenges in achieving the optimum results. Web link analysis provides a way to order the web pages by studying the link structure of web graphs. Pageant and HITS (Hyperlink Induced Topic Search) are two such most popular algorithms widely used by the current search engines either In same or modified form to rank the documents based on the ink structure of the documents. Pageant, originally introduced by Bring and Page is based on the fact that a web page is more important if many other web pages link to it. We will write a custom essay sample on Parallelization of Pagerank and Hits Algorithm on Cuda specifically for you for only $16.38 $13.9/page Order now We will write a custom essay sample on Parallelization of Pagerank and Hits Algorithm on Cuda specifically for you FOR ONLY $16.38 $13.9/page Hire Writer We will write a custom essay sample on Parallelization of Pagerank and Hits Algorithm on Cuda specifically for you FOR ONLY $16.38 $13.9/page Hire Writer In core, it contains continuously iterating over the web graph until the Rank assigned to all of the pages converges to a stable value. In contrast to Pageant, a similar HITS algorithm, developed by Glibber [1 1], ranks the documents on the basis of two scores which it assigns to a particular set of documents dependent on a specific query, although basis for computation are same for both. Enormous size of web [1 5] makes the need of fast Implementation of these algorithms very clear. Till date, several approaches have been designed to accelerate this algorithm Like exploiting the block structure of web [6], running on parallel environment like PC cluster [4, 5, 10] but they have brought their own overheads like nudge nearer cost Ana approximate result TTS. Some research NAS also Eden cone on implementing this algorithm on Multi-core processors like Cell Broadband Engine [9] but due to the issues involved like random read and writes from large memory, this has lead to even poorer performance. In [9], it is shown that its implementation on Cell is 22 times slower than Intel Xenon Quad Core 2. GHz. Parallel implementation of these algorithms involves issues like no specific order in the number of pages that points to a particular page and randomness in the links of the nodes hindering the load balancing, which is the basis of any parallel implementation. This paper addresses these issues in an interesting manner and proposes an innovative way of exploiting the block structure of web existing at much lower level. Our approach in parallel implementation of these algorithms on Invalids Multi-core CUD Architecture not only reduces the computation time but also requires much cheaper rearward. In our study, we have used the standard input of approximately one million documents generated through the widely accepted Hebraic framework from publicly available datasets in [16]. This paper is organized as follows. Section 2 describes block structure of web, issues involved in porting Pageant on CUD architecture, the proposed parallel implementation and results. Section 3 discusses parallel implementation of HITS algorithm and results. And finally section 4 ends with conclusion. 2. Pageant 2. 1 . Algorithm Let Rank(p) denotes the rank of web page p from set of all web pages P. Let Sp bet a set of all web pages that points to page p and Nu be the outgrew of the page u ? Sp. Then the importance given by a page u to the page p due to its link is measured as Rank(u)/Nu. So total importance given to a page is the sum of all the importance due to incoming link to page p. This is computed iteratively n times for each page rank to converge. This iteration is as follows. V e , ? 2. 2. Sequential implementation of Pageant Algorithm 1: , 2: 3: 4: 5: 7: 8: 9: 10: V ,11: 6: 32 h h 2. 3. Comparison with Related Works Where d is the damping factor from the random surfer model [1]. The range for alee of d is from Tot 1. D is the probability that a random surfer will get bored at page p and will Jump to another random page with probability (1- d). We will be using 0. 85 as the value of d further in this paper as given in [1]. The use of d insures the convergence of Pageant algorithm [5]. The input file containing the web graph s o a De converted to Dollar link structure Tile I . Nils Tile consist AT all nodes as numbers, with their number of outguesses and the pages (also in the numerical form) to which it points to as shown in Fig 1 . The above equation (1) is the Jacobin iterative solution to system of linear equations. The extensive use of Jacobin method is because it can be easily parallelized, as the calculation of the rank of a page is dependent on the initial rank of pages. There is another efficient method to solve system of linear equations called Gauss-Sidle method. In Gauss-Sidle method for calculating the rank of page p in the iteration I, the recently calculated rank of the all pages before p is used and previous iterations rank is used for pages after page p. The iterative formula is as follows: V e , = 1- * + (2) The main advantage of this algorithm is its usage of less space during calculation. In this only one array containing the rank of pages is used and both retrieving and updating of pages rank can be done on the same array. But the disadvantage is that it cannot be easily parallelized. Since Pageant involves huge amount of computation, therefore many researchers have attempted with their own approaches towards its parallel implementation. Here we list most notable works and discuss the advantage of our approach over their work. 1. Block rank: This algorithm by Wavelike et al. [6], splits the web graph into blocks according to their domain and then calculates Pageant of each block locally. It then uses an approximation to merge these results and calculate global Pageant. This implementation increases the performance by 2 times. 2. Partition-Based Parallel Pageant Algorithm: Orangutans and Namesakes [3] discusses about three algorithms and their implementation on PC cluster. He compares PC cluster implementation of block based algorithm, split accumulate algorithm and partition based parallel algorithm, with results favoring last implementation. . Pageant Computation Using PC Cluster: This is another implementation by Orangutans and Namesakes [4] in which they divide the input graph into equal sets and lactate them on each PC cluster nodes. Each cluster node solves them for 5 iterations locally and then updates the new rank on other nodes. This implementation achieves 4 times speed up. 4. Another efficient paral lel implementation of Pageant on PC cluster by Schoolhouses et al. [5] achieves gain of 10 times by using block structure of web page and reformulating the algorithm by combining Jacobin and Gauss-Sidle method. Most researchers have implemented Pageant algorithm on PC cluster, which increases the efficiency but not in comparison to the added hardware cost. Very less research, with success has been one in the field of implementing Pageant algorithm on rapid evolving multi core architectures. The implementation on multi core 6 SPUN based Cell BE has shown that ten Pageant Algorithm runs 22 times slowly. We odometer Implement It on more efficient SIMI based multi core CUD architecture, containing large number of processors, using an entirely innovative approach. Our approach also exploits the block structure of web but dont involve approximation like in [6]. We further show that if we merge both the use of PC cluster implementation and CUD device a huge increase in performance can be achieved in comparison to ere small increase in hardware cost. 2. 4. The Block Structure of Web Graph 2. 3. As discussed in section 2. 3, several researches have been done on utilizing the block structure of web for efficient implementation of Pageant. They reveal that in most cases the numbers of intra-host links are much larger than the inter-host links leading to creation of block structures in the web graph. Here, we take it to a further extent and analyses this property in a more magnified view which reveals that this kind of block structure also exists at lower level of hierarchy. For instance, most of the inks for a certain block of pages say BBC. Ex./docs are in and around BBC. Ex./docs. For studying the structure of web, the link structured file, generated using Hebraic, has been used. To visualize it further, we construct dot plots such that if there is a link from node J to node I, then there is black point in the graph at (I, J). Since, our full dataset is too large to observe the individual points, a slice of graph are shown in Figure 2. The following things are noticeable: 1. There is a dense diagonal line indicating that most of the pages link in and around themselves. 2. There are several blocks of points which show that a certain blocks of pages have large number of intra-linkages between them. This clearly shows that the block structure at domain or host level is also prevalent at smaller level. 3. There are several horizontal lines highlighting that some blocks of pages are pointed by a narrow set of pages and few isolated vertical lines that indicate that a certain block of pages points to a very narrow set of pages. . There are number of isolated points which highlight the degree of randomness in the link structure. 2. 5. CUD Architecture CAUDAL, introduced by INVALID, is a general purpose parallel computing that leverages the parallel compute engine in INVALID Spies to solve many complex computational problems in a more efficient way than on a CAP]. These Spies are used as coprocessor to assist CPU for compu tational intensive task. More details about this architecture can be explored at [10]. Here, we highlight the features that need special mention in relation to our work. 4. SIMI Architecture: Employs Single Instruction Multiple Thread Architecture leading to the execution of a single instruction by a huge number of threads. Asynchronous Concurrent Execution: In order to facilitate concurrent execution between host and evolve, Kernel launches are asynchronous I. E. Control Is returned to ten most Athena before the device has completed the requested task. Warps: A warp is a group of 32 parallel threads which executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. Memory Coalescing: Global memory bandwidth is most efficiently used if the simultaneous memory accesses by threads in a half-warp (during the execution of a single read or rite instruction) are such that they can be coalesced into a single memory transaction of 32, 64, or 128 bytes, leading to minimum access time. 2. 6. Porting issues on CUD Architecture Porting issues with the Pageant algorithm are mainly concerned with hardware restrictions of CUD architecture. CUD demands the execution of all threads in a warp to be similar for the thread to execute in parallel, hence, if the execution paths of threads in a warp become divergent then it causes the CUD to suspend the parallel execution of threads and executes them sequentially (or become serialized), hence decreasing throughput. As he number of industries of nodes can be very dissimilar, the loop involved in calculation, iterating for number of industries, can make the threads control flow to become divergent or different from other threads. Another constraint of the device is related to memory accesses. Due to the huge size of link structure arrays containing the industries and rank of nodes, it has to be stored in the global memory. But the latency of global memory is very low, hundreds of memory cycles, compared to shared memory. The protocol followed by the CUD architecture for memory transaction ensures that all the threads referencing memory in the same segment re serviced simultaneously. Therefore bandwidth will be used most efficiently only if, simultaneous memory accesses by threads in a half warp belongs to a segment. But due to uneven and random nature of industries of nodes the memory reference sometimes become non coalesced hindering the simultaneous service of memory transactions leading to the wastage of memory bandwidth. There are different structures of input file for the Pageant algorithm. It can be either nodes with their industries or nodes with their outguesses. The problem with the outgrew is that while iterating through the list of outguesses of a particular node, calculation squires dividing the initial Pageant of the node with its number of outguesses and adding the result to the memory location for storing the new Pageant of the node which it points to. The pseudo code for parallel implementation of this on CUD architecture is described in pseudopodia [1]. : ? Result: Since, step 2 in [1] will be simultaneously executed by large number of threads, this may lead to conflict between them while updating the same memory location, producing unpredictable results. Though this problem can be solved using atomic operation, but there are no atomic operations for floating point values yet incorporated in the architecture. Hence the input files format should be, each node with their number of industries and a list of nodes pointing to the node. The outgrew of each node required in calculation can be stored in a separate file or in the sane file. The structure of input file used in our experiment is shown in the Figure When this implementation is executed on the device then the running time is more than the sequential implementation on the hardware specified in section 2. . The reason for this can be easily attributed to porting issues we mentioned in section 2. 6. 2. 7. 1 . Solving the problem of non coalesced memory access Figure 2: Topple of all links One of the reasons for poor performance of naive implementation is that the global memory accesses were not coalesced. As discussed while considering the porting issues in section 2. 6, for better p erformance, the simultaneous memory accesses done by all threads in a half warp should belong to same segment for lesser transactions to occur. The Figure 1: Input File Structure The problem with implementation of Guidelines for calculating Pageant is that for calculating the rank of a page we need the new ranks of all pages before it, but due to personalization the calculation of some of them may e still in process. This happens especially for threads belonging to same block, as all threads in a block execute in parallel. 2. 7. Parallel implementation using Jacobin method The naive implementation involves each thread is assigned one node for the calculation. The thread iterates though can I TTS Menageries Detecting tenet Nominal rank Ana teen Livelong It Walt outgrew and adding it up. Finally multiply it with the damping factor d and add (1 d) to the sum. The initial Pageant before starting to iterate is kept 1 . The decent number of iteration is 32 as normally the algorithm converges in 32 iteration. Figure 3: Topple of links calculated on device nodes generally link in the locality, with few links to farther nodes as described in section 2. 4. To improve the rank calculation of a node, say p, we process only those nodes on kernel which belong to locality of p, determined by the range. And the rest of the nodes are processed on host processor as shown in Figure 4. So we create two link structured input file, one to be processed by kernel, which contains nodes lying in locality, and other contains rest of the nodes to be processed on host processor. 2. 7. 2. Solving the problem of divergence in control flow Another season for poor performance of naive implementation is the divergence in the program control flow. The main problem causing divergence is uneven number of industries, which causes different threads to end up with different number of iteration, also discussed in section 2. . The solution to this problem is to allow maximum possible threads to iterate same number of times, so that the program execution path does not diverge much from the normal flow. For this we calculate the average of all nodes number of industries for the kernel link structured input file. And then each thread is allowed to calculate up to the average number of industries if it is less t han average number) and the rest calculation is done simultaneously on the host processor. Those points that are included in the calculation of rank on the device. Figure 5: Topple of links calculated on host The number of calculation on host can be further decreased if we use some constant multiple of the average value. This constant for peak performance is different for different input graphs depending on the distribution of the number of industries among the nodes. This constant if too large can also increase the number of threads going divergent. But if perfect balance exist between the increase in time due to increase in calculation on the host processor and number of threads going divergent, then further decreases in time can be achieved. This constant can be called as average factor which is a function of the distribution of number of industries among nodes and block size. The final implementation is shown in Parallel Algorithm 1. Figure 4: Processing the input File The total number of calculations done on host processor is considerably decreased. But in order to increase the performance the block structure of web can be exploited. As we showed in section 2. There exists block structure even at small level. So in our next improvement instead of calculating the average of the industries of all nodes, we divide all the nodes into blocks.