Many ML algorithms, e.g. NLP and information retrieval ML algorithms, output high dimensional feature vectors for their respective objects of interests. Information sources are then transformed into a vector database. When an object is to be searched, the ML algorithm would similarly infer a feature vector, a.k.a a query point for the object. Then nearest neighbors, which are the nearest to the query point in terms of a certain distance measure from the information source database, are returned as the search results. As the popularity and maturity of ML increase in the field of information retrieval, we expect more and more usage of the aforementioned mechanism.
Recommendation algorithms also make heavy use of high dimensional feature vectors to represent users of interest and items to be recommended. Those items, whose feature vectors are nearest to the feature vector of a target user, are often recommended to the target user. Recommendation systems are vital components of today’s commerce and web portals.
However, computing distances between two high dimensional vectors can be computationally expensive. In the era of big data, we often need to deal with databases of billions of records. A linear scan of the database records, followed by expensive distance computations, is simply impractical. This is where hashing algorithms, or more accurately, learn-to-hash algorithms come to rescue. The simple idea behind these classes of hashing algorithms is that, for each high dimensional vector, we can derive a short binary hashing code, with the characteristics of preserving similarities in the original vector space using a simple Hamming distance in the hash code space. This way computation of distances among data points in the code space is trivially fast. Combined with other techniques, we can deliver sub-linear and fast search implementation.
Combining powerful ML modeling algorithms which produce feature vectors of high quality, and powerful learn-to-hash algorithms that produce short and efficient binary codes, we can design highly accurate and yet highly scalable information retrieval systems. And that is why I surveyed the field of learning to hash, and the result is the following PPT:
I recently published a conference paper accessible at IEEEXplore.
Published in: 2018 21st International Conference on Information Fusion (FUSION)
In the presence of unknown correlations, the optimal data fusion, in the sense of Minimum Mean Square Error, can be formulated as a problem of minimizing a nondifferentiable but convex function. The popular projected subgradient methods are known to converge slowly. The single-projection optimal subgradient method, OSGA-V, is known to be numerically more efficient. This paper presents necessary formulations and methods for the application of the OSGA-V algorithm in the minimization of the optimal data fusion problem, achieving much faster convergence rate than the projected subgradient method. We expect this method to significantly reduce the computational cost and time to achieve optimal data fusion in the presence of unknown correlations.
Abstract: In this article, we first presented a mean 3D face model from , , with 21 facial landmark coordinates, in a easy-to-use CSV file format. We reviewed the popular POSIT algorithm for head pose estimation. We then presented our simplified derivation of the POSIT algorithm. At the end, we provided a Python implementation of the modernPosit algorithm, and demonstrated the computation of face bounding rectangle, based on computed head pose. The novelty lies in the simplified interpretation of the POSIT algorithm.
There are two classes of Python users, those who use Python for management scripting and web development, and those who use Python for scientific computing. According to many, Python is winning the war against R in the field of machine learning. As machine learning becomes ever more important and popular, so is Python.
This article is for those just got into the Python world. For those experienced Python developers, you are excused to leave now. Python is extremely simple to start with, the traditional Hello World program takes only one line. Before no time, you would be writing classes, methods, system calls, and you really enjoy the convenience and power of the language. But quickly, you also realize that, there are many strange codes floating around. And you hear about lots of unacquainted jargons. You could feel a little puzzled, and a little confused about the big picture of the Python world.
That is where this article comes to help. I intend to draw a broad picture of the Python world, with pointers for you to explore in depth. At the end of the article, you won’t become an expert on anything with Python. But, hopefully, you will be satisfied in understanding the big picture, and know where to go for further information and to gain expertise. Continue reading →
Micro-services are the norm nowadays when building backend internet applications or business applications. Of course we would not want to build everything from scratch. What micro-service framework should we use then? If you work in Java, your choices basically boil down to either Dropwizard or Spring. If you work in Scala, you have more choices like Twitter’s Finagle, Play framework and others.
Between Dropwizard and Spring, I personally prefer Dropwizard. For Dropwizard seems simpler and has a more elegant design. Then, what will Dropwizard do for us as a developer? Let’s quickly run through a micro-service building virtual exercise.
I recently did some Pandas-based analysis of Uber trip data for trip prediction purpose. During the process, I found Panads is really helpful in data processing and visualization, and thus this post.
However, due to obvious reasons, I am not supposed to publish internal business data online. And hence here I stripped down the original codes and changed the data source to stock prices.
Unfortunately, Uber trip data and stock price data are fundamentally different. For example, Uber trip data exhibit strong weekly pattern while stock prices do not. As the side effect, some of the analysis in this post is not practically useful, and some of it might not even make sense. But I still consider it valuable for practicing Pandas.
Pandas is really a great tool for data transformation, analyzing and visualization, as long as the data set can fit in memory.
To better understand Pandas, one needs first to have a grasp of the basic concepts. I found the best starting point was to read through Introduction to Data Structures. Once you understand the basic data structures, it should be easier to understand the others.
It is always a good idea to setup a SSL certificate for your site. SSL certificate is an online identification document issued by CAs (Certificate Authority). There are many CA vendors to choose from. I bought from register.com with an annual cost of around $27. You should stay away from StartCom & WoSign because they are blacklisted by major browsers (read here).