update notes

author: Martin Czygan <martin.czygan@gmail.com> 2020-09-28 19:56:37 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2020-09-28 19:56:37 +0200
commit: 26aa121848d41860a398cac8b549531e5f21f03e (patch)
tree: 78fbad88ca7a887d5d9d9cb3ba12a525fc8f6ba6 /projects
parent: de839f7da2de11a8baf73a611706c886a2754953 (diff)
download: fuzzycat-26aa121848d41860a398cac8b549531e5f21f03e.tar.gz
fuzzycat-26aa121848d41860a398cac8b549531e5f21f03e.zip
1 files changed, 13 insertions, 0 deletions
diff --git a/projects/grobid_refs/README.md b/projects/grobid_refs/README.md
index 15eaae0..498e68b 100644
--- a/projects/grobid_refs/README.md
+++ b/projects/grobid_refs/README.md
@@ -2,6 +2,19 @@
 
 References extracted from [grobid](https://grobid.readthedocs.io).
 
+## TODO
+
+* For a given reference string in grobid, find a matching release in fatcat.
+
+## Approach
+
+Two general ways:
+
+* do queries against elasticsearch, which would max out at a few hundred queries/s
+* offline compute a key (e.g. title, title ngram plus authors, etc.); then do comparisons
+
+## Misc
+
 Example grobid outputs:
 
 * [grobid.tei.xml](grobid.tei.xml), [pdf](http://dss.in.tum.de/files/brandt-research/me.pdf) -- here grobid does not extract many refs; GS looks ok
author	Martin Czygan <martin.czygan@gmail.com>	2020-09-28 19:56:37 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2020-09-28 19:56:37 +0200
commit	26aa121848d41860a398cac8b549531e5f21f03e (patch)
tree	78fbad88ca7a887d5d9d9cb3ba12a525fc8f6ba6 /projects
parent	de839f7da2de11a8baf73a611706c886a2754953 (diff)
download	fuzzycat-26aa121848d41860a398cac8b549531e5f21f03e.tar.gz fuzzycat-26aa121848d41860a398cac8b549531e5f21f03e.zip