deploy: 9c88abd

learning2hash · May 22, 2024 · 7506480 · 7506480
1 parent f283e67
commit 7506480
Show file tree

Hide file tree

Showing 228 changed files with 672 additions and 672 deletions.
diff --git a/index.html b/index.html
@@ -177,7 +177,7 @@ <h3 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h3>
 <tag><a href="/tags.html#KDD">KDD</a></tag>
 <tag><a href="/tags.html#LSH">LSH</a></tag>
 <tag><a href="/tags.html#MM">MM</a></tag>
-<tag><a href="/tags.html#Minhash">Minhash</a></tag>
+<tag><a href="/tags.html#MSR">MSR</a></tag>
 <tag><a href="/tags.html#NAACL">NAACL</a></tag>
 <tag><a href="/tags.html#NIPS">NIPS</a></tag>
 <tag><a href="/tags.html#NeurIPS">NeurIPS</a></tag>

diff --git a/paper-abstracts.json b/paper-abstracts.json
@@ -154,7 +154,7 @@
 {"key": "shrivastava2014asymmetric", "year": "2014", "title":"Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS).", "abstract": "<p>We present the first provably sublinear time hashing algorithm for approximate\nMaximum Inner Product Search (MIPS). Searching with (un-normalized) inner\nproduct as the underlying similarity measure is a known difficult problem and\nfinding hashing schemes for MIPS was considered hard. While the existing Locality\nSensitive Hashing (LSH) framework is insufficient for solving MIPS, in this\npaper we extend the LSH framework to allow asymmetric hashing schemes. Our\nproposal is based on a key observation that the problem of finding maximum inner\nproducts, after independent asymmetric transformations, can be converted into\nthe problem of approximate near neighbor search in classical settings. This key\nobservation makes efficient sublinear hashing scheme for MIPS possible. Under\nthe extended asymmetric LSH (ALSH) framework, this paper provides an example\nof explicit construction of provably fast hashing scheme for MIPS. Our proposed\nalgorithm is simple and easy to implement. The proposed hashing scheme\nleads to significant computational savings over the two popular conventional LSH\nschemes: (i) Sign Random Projection (SRP) and (ii) hashing based on p-stable\ndistributions for L2 norm (L2LSH), in the collaborative filtering task of item recommendations\non Netflix and Movielens (10M) datasets.</p>\n", "tags": [] },
 {"key": "shrivastava2014densifying", "year": "2014", "title":"Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search", "abstract": "<p>The query complexity of locality sensitive hashing\n(LSH) based similarity search is dominated\nby the number of hash evaluations, and this number\ngrows with the data size (Indyk &amp; Motwani,\n1998). In industrial applications such as search\nwhere the data are often high-dimensional and\nbinary (e.g., text n-grams), minwise hashing is\nwidely adopted, which requires applying a large\nnumber of permutations on the data. This is\ncostly in computation and energy-consumption.\nIn this paper, we propose a hashing technique\nwhich generates all the necessary hash evaluations\nneeded for similarity search, using one\nsingle permutation. The heart of the proposed\nhash function is a “rotation” scheme which densifies\nthe sparse sketches of one permutation\nhashing (Li et al., 2012) in an unbiased fashion\nthereby maintaining the LSH property. This\nmakes the obtained sketches suitable for hash table\nconstruction. This idea of rotation presented\nin this paper could be of independent interest for\ndensifying other types of sparse sketches.\nUsing our proposed hashing method, the query\ntime of a (K, L)-parameterized LSH is reduced\nfrom the typical O(dKL) complexity to merely\nO(KL + dL), where d is the number of nonzeros\nof the data vector, K is the number of hashes\nin each hash table, and L is the number of hash\ntables. Our experimental evaluation on real data\nconfirms that the proposed scheme significantly\nreduces the query processing time over minwise\nhashing without loss in retrieval accuracies.</p>\n", "tags": [] },
 {"key": "sift1m2009searching", "year": "2009", "title":"Searching with quantization: approximate nearest neighbor search using short codes and distance estimators", "abstract": "<p>We propose an approximate nearest neighbor search method based\non quantization. It uses, in particular, product quantizer to produce short codes\nand corresponding distance estimators approximating the Euclidean distance\nbetween the orginal vectors. The method is advantageously used in an asymmetric\nmanner, by computing the distance between a vector and code, unlike\ncompeting techniques such as spectral hashing that only compare codes.\nOur approach approximates the Euclidean distance based on memory efficient codes and, thus, permits efficient nearest neighbor search. Experiments\nperformed on SIFT and GIST image descriptors show excellent search accuracy.\nThe method is shown to outperform two state-of-the-art approaches of the literature.\nTimings measured when searching a vector set of 2 billion vectors are\nshown to be excellent given the high accuracy of the method.</p>\n", "tags": [] },
-{"key": "silavong2021deskew", "year": "2021", "title":"DeSkew-LSH based Code-to-Code Recommendation Engine", "abstract": "<p>Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code snippets that usefully extend the code currently being written by a developer in their development environment (IDE). Code-to-code recommendation engines hold the promise of increasing developer productivity by reducing context switching from the IDE and increasing code-reuse. Existing code-to-code recommendation engines do not scale gracefully to large codebases, exhibiting a linear growth in query time as the code repository increases in size. In addition, existing code-to-code recommendation engines fail to account for the global statistics of code repositories in the ranking function, such as the distribution of code snippet lengths, leading to sub-optimal retrieval results. We address both of these weaknesses with \\emph{Senatus}, a new code-to-code recommendation engine. At the core of Senatus is \\emph{De-Skew} LSH a new locality sensitive hashing (LSH) algorithm that indexes the data for fast (sub-linear time) retrieval while also counteracting the skewness in the snippet length distribution using novel abstract syntax tree-based feature scoring and selection algorithms. We evaluate Senatus via automatic evaluation and with an expert developer user study and find the recommendations to be of higher quality than competing baselines, while achieving faster search. For example, on the CodeSearchNet dataset we show that Senatus improves performance by 6.7% F1 and query time 16x is faster compared to Facebook Aroma on the task of code-to-code recommendation.</p>\n", "tags": ["Minhash"] },
+{"key": "silavong2021deskew", "year": "2021", "title":"DeSkew-LSH based Code-to-Code Recommendation Engine", "abstract": "<p>Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code snippets that usefully extend the code currently being written by a developer in their development environment (IDE). Code-to-code recommendation engines hold the promise of increasing developer productivity by reducing context switching from the IDE and increasing code-reuse. Existing code-to-code recommendation engines do not scale gracefully to large codebases, exhibiting a linear growth in query time as the code repository increases in size. In addition, existing code-to-code recommendation engines fail to account for the global statistics of code repositories in the ranking function, such as the distribution of code snippet lengths, leading to sub-optimal retrieval results. We address both of these weaknesses with \\emph{Senatus}, a new code-to-code recommendation engine. At the core of Senatus is \\emph{De-Skew} LSH a new locality sensitive hashing (LSH) algorithm that indexes the data for fast (sub-linear time) retrieval while also counteracting the skewness in the snippet length distribution using novel abstract syntax tree-based feature scoring and selection algorithms. We evaluate Senatus via automatic evaluation and with an expert developer user study and find the recommendations to be of higher quality than competing baselines, while achieving faster search. For example, on the CodeSearchNet dataset we show that Senatus improves performance by 6.7% F1 and query time 16x is faster compared to Facebook Aroma on the task of code-to-code recommendation.</p>\n", "tags": ["MSR"] },
 {"key": "song2011random", "year": "2011", "title":"Random Maximum Margin Hashing", "abstract": "<p>Following the success of hashing methods for multidimensional\nindexing, more and more works are interested\nin embedding visual feature space in compact hash codes.\nSuch approaches are not an alternative to using index structures\nbut a complementary way to reduce both the memory\nusage and the distance computation cost. Several data\ndependent hash functions have notably been proposed to\nclosely fit data distribution and provide better selectivity\nthan usual random projections such as LSH. However, improvements\noccur only for relatively small hash code sizes\nup to 64 or 128 bits. As discussed in the paper, this is mainly\ndue to the lack of independence between the produced hash\nfunctions. We introduce a new hash function family that\nattempts to solve this issue in any kernel space. Rather\nthan boosting the collision probability of close points, our\nmethod focus on data scattering. By training purely random\nsplits of the data, regardless the closeness of the training\nsamples, it is indeed possible to generate consistently\nmore independent hash functions. On the other side, the\nuse of large margin classifiers allows to maintain good generalization\nperformances. Experiments show that our new\nRandom Maximum Margin Hashing scheme (RMMH) outperforms\nfour state-of-the-art hashing methods, notably in\nkernel spaces.</p>\n", "tags": [] },
 {"key": "song2013intermedia", "year": "2013", "title":"Inter-Media Hashing for Large-Scale Retrieval from Heterogeneous Data Sources", "abstract": "<p>In this paper, we present a new multimedia retrieval paradigm to innovate large-scale search of heterogenous multimedia data. It is able to return results of different media types from heterogeneous data sources, e.g., using a query image to retrieve relevant text documents or images from different data sources. This utilizes the widely available data from different sources and caters for the current users’ demand of receiving a result list simultaneously containing multiple types of data to obtain a comprehensive understanding of the query’s results. To enable large-scale inter-media retrieval, we propose a novel inter-media hashing (IMH) model to explore the correlations among multiple media types from different data sources and tackle the scalability issue. To this end, multimedia data from heterogeneous data sources are transformed into a common Hamming space, in which fast search can be easily implemented by XOR and bit-count operations. Furthermore, we integrate a linear regression model to learn hashing functions so that the hash codes for new data points can be efficiently generated. Experiments conducted on real-world large-scale multimedia datasets demonstrate the superiority of our proposed method compared with state-of-the-art techniques.</p>\n", "tags": ["SIGMOD","Image Retrieval","Cross-Modal","Has Code"] },
 {"key": "song2015rank", "year": "2015", "title":"Top Rank Supervised Binary Coding for Visual Search", "abstract": "<p>In recent years, binary coding techniques are becoming\nincreasingly popular because of their high efficiency in handling large-scale computer vision applications. It has been\ndemonstrated that supervised binary coding techniques that\nleverage supervised information can significantly enhance\nthe coding quality, and hence greatly benefit visual search\ntasks. Typically, a modern binary coding method seeks\nto learn a group of coding functions which compress data\nsamples into binary codes. However, few methods pursued\nthe coding functions such that the precision at the top of\na ranking list according to Hamming distances of the generated binary codes is optimized.\nIn this paper, we propose a novel supervised binary coding approach, namely\nTop Rank Supervised Binary Coding (Top-RSBC), which\nexplicitly focuses on optimizing the precision of top positions in a Hamming-distance ranking list towards preserving the supervision information. The core idea is to train\nthe disciplined coding functions, by which the mistakes at\nthe top of a Hamming-distance ranking list are penalized\nmore than those at the bottom. To solve such coding functions, we relax the original discrete optimization objective\nwith a continuous surrogate, and derive a stochastic gradient descent to optimize the surrogate objective. To further reduce the training time cost, we also design an online\nlearning algorithm to optimize the surrogate objective more\nefficiently. Empirical studies based upon three benchmark\nimage datasets demonstrate that the proposed binary coding approach achieves superior image search accuracy over\nthe state-of-the-arts.</p>\n", "tags": ["ICCV","Supervised"] },

diff --git a/papers.html b/papers.html
@@ -314,7 +314,7 @@ <h1>
 		</span>
 	</td>
 	<td>Fran Silavong, Sean Moran, Antonios Georgiadis, Rohan Saphal, Robert Otter</td>
-	<td>Arxiv</td>
+	<td>MSR</td>
 	<td><p>Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code snippets that usefully extend the code currently being written by a developer in their development environment (IDE). Code-to-code recommendation engines hold the promise of increasing developer productivity by reducing context switching from the IDE and increasing code-reuse. Existing code-to-code recommendation engines do not scale gracefully to large codebases, exhibiting a linear growth in query time as the code repository increases in size. In addition, existing code-to-code recommendation engines fail to account for the global statistics of code repositories in the ranking function, such as the distribution of code snippet lengths, leading to sub-optimal retrieval results. We address both of these weaknesses with \emph{Senatus}, a new code-to-code recommendation engine. At the core of Senatus is \emph{De-Skew} LSH a new locality sensitive hashing (LSH) algorithm that indexes the data for fast (sub-linear time) retrieval while also counteracting the skewness in the snippet length distribution using novel abstract syntax tree-based feature scoring and selection algorithms. We evaluate Senatus via automatic evaluation and with an expert developer user study and find the recommendations to be of higher quality than competing baselines, while achieving faster search. For example, on the CodeSearchNet dataset we show that Senatus improves performance by 6.7% F1 and query time 16x is faster compared to Facebook Aroma on the task of code-to-code recommendation.</p>
 </td>
 </tr>

diff --git a/publications/andoni2006near/index.html b/publications/andoni2006near/index.html
@@ -50,11 +50,11 @@
 <meta property="og:url" content="https://learning2hash.github.io/publications/andoni2006near/" />
 <meta property="og:site_name" content="Awesome Learning to Hash" />
 <meta property="og:type" content="article" />
-<meta property="article:published_time" content="2024-05-22T05:47:54-05:00" />
+<meta property="article:published_time" content="2024-05-22T05:48:43-05:00" />
 <meta name="twitter:card" content="summary" />
 <meta property="twitter:title" content="Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions" />
 <script type="application/ld+json">
-{"@context":"https://schema.org","@type":"BlogPosting","dateModified":"2024-05-22T05:47:54-05:00","datePublished":"2024-05-22T05:47:54-05:00","description":"We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(dn + n1+1c2/+o(1)). This almost matches the lower bound for hashing-based algorithm recently obtained in (R. Motwani et al., 2006). We also obtain a space-efficient version of the algorithm, which uses dn+n logO(1) n space, with a query time of dnO(1/c2). Finally, we discuss practical variants of the algorithms that utilize fast bounded-distance decoders for the Leech lattice","headline":"Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions","mainEntityOfPage":{"@type":"WebPage","@id":"https://learning2hash.github.io/publications/andoni2006near/"},"url":"https://learning2hash.github.io/publications/andoni2006near/"}</script>
+{"@context":"https://schema.org","@type":"BlogPosting","dateModified":"2024-05-22T05:48:43-05:00","datePublished":"2024-05-22T05:48:43-05:00","description":"We present an algorithm for the c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time of O(dn 1c2/+o(1)) and space O(dn + n1+1c2/+o(1)). This almost matches the lower bound for hashing-based algorithm recently obtained in (R. Motwani et al., 2006). We also obtain a space-efficient version of the algorithm, which uses dn+n logO(1) n space, with a query time of dnO(1/c2). Finally, we discuss practical variants of the algorithms that utilize fast bounded-distance decoders for the Leech lattice","headline":"Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions","mainEntityOfPage":{"@type":"WebPage","@id":"https://learning2hash.github.io/publications/andoni2006near/"},"url":"https://learning2hash.github.io/publications/andoni2006near/"}</script>
 <!-- End Jekyll SEO tag -->