Hub Miner Development
Hub Miner (https://github.com/datapoet/hubminer) has bot significantly improved since its initial release and it now has utter OpenML support for networked experiments te classification and a detailed user manual for all common use cases.
There have also bot many fresh method implementations, especially for gegevens filtering, reduction and outlier detection.
I have ambitious implementation plans for future versions.
If you would like to join the project spil a contributor, let mij know!
While I am still dedicated to the project, I have somewhat less time than before since I joined Google earlier (since January 2015), so I have determined to open up the project for fresh contributors that can help te making this an awesome machine learning library.
I am also interested te developing Python/R/Julia/C++ implementations of hubness-aware approaches, so feel free to ping mij if you would be interested te that spil well.
Very first Hub Miner release
This is the announcement for the very first release of Hub Miner code.
Hub Miner is the machine learning library that I have bot working on during the course of my Ph.D. research. It is written ter Java and released spil open source on GitHub. This is the very first release and updates are already underway, so please be a little patient. The code is well documented, with many comments – but the library is fairly large and it is not that effortless to navigate without a manual.
Fortunately, a utter manual should be done by the end of October and will also emerge on GitHub along with the code, spil well spil on this webstek, under the Hub Miner pagina.
Hub Miner is a hubness-aware machine learning library and it implements methods for classification, clustering, example selection, metric learning, stochastic optimization – and more. It treats standard gegevens types and can treat both dense and sparse gegevens types, continuous and discrete and discretized features. There is some basic implemented support for text and picture gegevens processing.
Photo Hub Explorer is also within Hub Miner source, a GUI for visual hubness inspection ter picture gegevens.
A powerful experimentation framework under learning.supervised.evaluation.cv.BatchClassifierTester and learning.unsupervised.evaluation.BatchClusteringTester permits for testing the various baselines te challenging conditions.
OpenML support is also under way and should be ended by the end of October, so expect it to show up ter the next release.
Two of our fresh papers have recently bot accepted.
The paper titled Boosting for Vote Learning te High-dimensional kNN Classification has bot accepted for presentation at the International Conference on Gegevens Mining (ICDM) workshop on High-dimensional Gegevens Analysis. Te the paper, the possibility of using boosting for vote learning ter high-dimensional gegevens is examined, since it has bot determined that hubness-aware k-nearest neighbor classifiers permit boosting te the classical sense. Standard kNN baselines are known to be sturdy to training gegevens sub-sampling and the example sampling and example re-weighting approaches to boosting do not typically work on kNN, which can be boosted by feature sub-sampling instead. Te case of hubness-aware classifiers, it is possible to use the re-weighting type of boosting without greatly enlargening the computational complexity (spil the kNN graph only needs to be calculated once on the training gegevens for the neighbor occurrence proefje). Wij have extended the basic neighbor occurrence models by introducing example weights and weighted neighbor occurrences, with trivial switches to the hubness-aware voting frameworks. The results look promising, tho’ wij have only attempted the Adaboost.M2 boosting treatment so far – and other branch programs are less prone to over-fit and more sturdy to noise… So, there is more work to be done here.
Speaking of noise, our paper on Hubness-aware kNN Classification of High-dimensional Gegevens ter Presence of Label Noise has just bot accepted for publication at Neurocomputing Special Punt on Learning from Label Noise. It is an in-depth probe of the influence of gegevens hubness and the curse of dimensionality on classification spectacle and the inherent robustness of hubness-aware approaches te particular. Additionally, wij have introduced a novel concept of hubness-proportional random label noise spil a way to test for worst-case scripts. To vertoning that this noise monster is realistic, wij have demonstrated an adversarial label-flip attack based on the estimated TF-IDF message weights that were inversely correlated with point-wise hubness ter SMS spam gegevens under standard TF-IDF normalization. Wij hope to do more work on hubness-aware learning under label noise soon.
A Novel Kernel Clustering Algorithm
Wij have a fresh book chapter coming out now on high-dimensional gegevens clustering ter the book on partitional clustering algorithms. It is titled ‘Hubness-Based Clustering of High-Dimensional Gegevens’ and it is an extension of our earlier work where wij have shown that it is possible to exploit kNN hubs for effective gegevens clustering te many dimensions.
Te our chapter, wij have extended the original algorithm to incorporate a ‘kernel trick’ ter order to be able to treat non-hyperspherical clusters ter the gegevens. This has resulted te the Kernel Global Hubness-proportional K-Means algorithm (Kernel-GHPKM) that our experiments display spil very promising and preferable to standard kernel K-means on some high-dimensional datasets.
The implementation is available ter Hub Miner and will be released very soon along with the surplus of the library.
Stay tuned for more updates.
PhD thesis: The Role of Hubness ter High-dimensional Gegevens Analysis
On December 18th, 2013 – I am scheduled to present my PhD thesis titled “The Role of Hubness te High-dimensional Gegevens Analysis”.
The thesis discusses the issues involving similarity-based inference ter intrinsically high-dimensional gegevens and the consequences of emerging hub points. It integrates the work introduced te my journal and conference papers, proposes and discusses novel mechanisms for designing nearest-neighbor based learning models ter many dimensions. Lastly, it mentions potential practical applications and promising future research directions.
I would like to thank everyone who talent mij advice and helped ter shaping this thesis.
The utter text of the thesis is available here.
@ ECML PKDD 2013
This year’s ECML/PKDD has certainly exceeded my expectations. Excellent talks, an inspiring atmosphere, an involving poster session – and lots of terugkoppeling from different people.
My own presentations also went fairly well.
Here are the posters that I’ve used to present the papers about the Picture Hub Explorer and Augmented Naive Hubness-Bayesian k-Nearest Neighbor classifier for intrinsically high-dimensional gegevens.
Learning under Class Imbalance ter High-dimensional Gegevens: a fresh journal paper
I am pleased to say that our paper titled ‘Class Imbalance and The Curse of Minority Hubs’ just got accepted for publication te Knowledge-Based Systems (IF Four.1 (2012)). The research introduced ter the paper is one of the piles of the work ter my PhD thesis, so I am glad to have gotten some more quality terugkoppeling from the reviews and further improved the paper during the entire process.
Te the paper, wij examine a novel opzicht of the well known curse of dimensionality, one that wij have named ‘The Curse of Minority Hubs’. It has to do with learning under class imbalance te high-dimensional gegevens. Class-imbalanced problems have bot known to pose fine difficulties to standard machine learning approaches, just spil it wasgoed known that a high number of features and sparsity poses problems of its own. Remarkably, thesis two phenomena toevluchthaven’t bot often considered at the same time. Ter our analysis, wij have focused on the high-dimensional phenomenon of hubness, the skewness of the distribution of influence/relevance ter similarity-based models. Hubs emerge spil centers of influence within the gegevens – and a puny number of points determines most properties of the prototype. Te case of classification, a puny number of points is responsible for most keurig or incorrect classification decisions. The points that cause many label mismatches ter k-nearest neighbor sets are called ‘bad hubs’, while the others are referred to spil ‘good hubs’.
It just so happens that the points that belong to the minority classes have a tendency to become vooraanstaand bad hubs te high-dimensional gegevens, hence the phrase ‘The Curse of Minority Hubs’. This is surprising for several reasons. Very first of all, it is exactly the opposite of what is usually the standard assumption ter class-imbalanced problems: that most misclassification is cause by the majority class, due to an average relative difference ter density te the borderline regions. Te low or medium-dimensional gegevens, this is indeed the case. Therefore, most machine learning methods that are tailored for class imbalanced gegevens attempt to improve the classification of the minority class points by penalizing the majority votes.
However, it seems that, ter high-dimensional gegevens, it is often the case that the minority hubs induce a disproportionally large percentage of all misclassifications. Therefore, standard methods for class-imbalanced classification face excellent difficulties when learning te many dimensions, spil their base premise is turned upside-down.
Ter our paper, wij take a closer look at all the related phenomena and also propose that the hubness-aware kNN classification methods could be used te conjunction with other class-imbalanced learning strategies ter order to alleviate the arising difficulties.
If You’re working ter one of thesis areas and feel that this might be relevant for your work, You can have a look at the paper here.
It seems that Knowledge-Based system also encourages authors to associate a brief (less than Five minus.) presentation with the papers, ter order to clarify the main points. This is a cool feature and wij have also added a few slips with schrijven explanations and suggestions.
Upcoming ECML talks
I wasgoed just notified that two of my papers got accepted for presentation at the European Conference on Machine Learning (ECML). This is good news and I am looking forward to the conference and the chance to share my results and get some valuable terugkoppeling.
The regular paper that got accepted is titled “Hub Co-occurrence Modeling for Sturdy High-dimensional kNN Classification” and has to do with learning from the second-order neighbor dependencies (co-occurrences) te intrinsically high-dimensional gegevens. Wij have analyzed the consequences of hubness for the neighbor co-occurrence distributions and utilized them te a novel kNN classification method, the Augmented Naive Hubness-Bayesian k-NN (ANHBNN). The method is based on the Hidden Naive Bayes proefje and introduces hidden knots ter order to prototype dependencies inbetween individual attributes. The attributes of the specimen are the neighbor occurrences themselves. This paper solves some problems but also raises fresh issues and it shows how difficult and multi-faceted the hubness punt can become.
The other paper that got accepted is actually a demo-paper on the Pic Hub Explorer implement, which means that I will get the chance to present my software at the conference and demonstrate its capabilities ter gevelbreedte of the gathered audience. I am indeed blessed about this and I am certain that it will be a fine practice. The demo paper is titled: Photo Hub Explorer: Evaluating Representations and Metrics for Content-based Photo Retrieval and Object Recognition.
Picture Hub Explorer Demo Movie
Wij have finished the initial demo movie of the Photo Hub Explorer System and it is now available on YouTube:
The demo covers the basic functionality of the system, demonstrating some of the predicted use cases.
The user interface will fall under further switches and improvements and the functions will be extended by fresh models and learning approaches.
Improving the semantic representations for cross-lingual document retrieval
I have had the pleasure of presenting some of our latest results at PAKDD 2013 te Gold Coast, Australia. The conference wasgoed good and the location couldn’t have bot better, so wij were able to catch some zon and walk along the beaches while discussing future collaboration, theory and applications.
Hubs are known to be the centers of influence and are known to arise te textual gegevens. Also, they are known to cause problems by being frequent neighbors (= very similar) to semantically different types of documents. However, it wasgoed previously unknown whether this property is language-dependent and how it affects the cross-lingual information retrieval process.
What wij have shown by analyzing aligned text corpora can be summarized by the following: Hubs is one language are not necessarily hubs te another language, different documents become influential. However, remarkably, the percentage of label mismatches te switch sides neighbor sets remains more or less unchanged. Ter other words, the nature of occurrences is preserved overheen different languages. This comes spil a bit of verrassing, since hubness is arguably a geometric property arising from the interplay of metrics and gegevens representations. Yet, it seems that more semantics than wasgoed previously thought remains hidden there, captured and preserved across different languages.
Wij have used this observation to vertoning that it wasgoed possible to improve the common semantic representation made via the CCA method (canonical correlation analysis) by simply introducing some hubness-aware example weights. This is certainly not the only way to go about it and most likely not the very best one, but it served spil a good proof-of-concept.