Silcock, Emily; D'Amico-Wong, Luca; Yang, Jinglin; … - National Bureau of Economic Research - 2022
Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse...