All projects
Information Retrieval
Information RetrievalShipped2026
Scientific Citation Retrieval
Information RetrievalNLPEnsemblesEmbeddings
Built with
PythonPyTorchTransformersBM25scikit-learn
This was a competition task: given a query paper, retrieve the 100 papers it is most likely to cite from a corpus of 20,000, scored on NDCG@10.
No single retriever wins everywhere, so I built an ensemble of eight. Dense models (SPECTER2, SciNCL, MiniLM) capture meaning; sparse methods (BM25 over title, abstract, full text, and per-section, plus TF-IDF) capture exact terms and rare entities; a citation-context pass reads how papers actually reference each other.
The eight rankings are merged with weighted reciprocal rank fusion, with domain and venue boosting on top. I tuned the fusion weights by coordinate descent against held-out relevance judgements, then submitted the final pipeline to the public leaderboard.