From 26aa121848d41860a398cac8b549531e5f21f03e Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Mon, 28 Sep 2020 19:56:37 +0200 Subject: update notes --- projects/grobid_refs/README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/projects/grobid_refs/README.md b/projects/grobid_refs/README.md index 15eaae0..498e68b 100644 --- a/projects/grobid_refs/README.md +++ b/projects/grobid_refs/README.md @@ -2,6 +2,19 @@ References extracted from [grobid](https://grobid.readthedocs.io). +## TODO + +* For a given reference string in grobid, find a matching release in fatcat. + +## Approach + +Two general ways: + +* do queries against elasticsearch, which would max out at a few hundred queries/s +* offline compute a key (e.g. title, title ngram plus authors, etc.); then do comparisons + +## Misc + Example grobid outputs: * [grobid.tei.xml](grobid.tei.xml), [pdf](http://dss.in.tum.de/files/brandt-research/me.pdf) -- here grobid does not extract many refs; GS looks ok -- cgit v1.2.3