aboutsummaryrefslogtreecommitdiffstats
path: root/extra
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-10-31 00:42:08 +0100
committerMartin Czygan <martin.czygan@gmail.com>2020-10-31 00:42:08 +0100
commit250181aead188499ce8a567183d1287289a127b5 (patch)
tree8525eb47beea6b47051d45e8843ede7432d55b68 /extra
parentdb62473041ade8315a75e07b2898908438f71e60 (diff)
downloadfuzzycat-250181aead188499ce8a567183d1287289a127b5.tar.gz
fuzzycat-250181aead188499ce8a567183d1287289a127b5.zip
cleanup dirs
Diffstat (limited to 'extra')
-rw-r--r--extra/grobid_references/README.md24
-rw-r--r--extra/oai_metadata/.gitignore1
-rw-r--r--extra/oai_metadata/Makefile5
-rw-r--r--extra/oai_metadata/README.md2
4 files changed, 32 insertions, 0 deletions
diff --git a/extra/grobid_references/README.md b/extra/grobid_references/README.md
index e69de29..c880f3b 100644
--- a/extra/grobid_references/README.md
+++ b/extra/grobid_references/README.md
@@ -0,0 +1,24 @@
+# Grobid refs
+
+References extracted from [grobid](https://grobid.readthedocs.io).
+
+## TODO
+
+* For a given reference string in grobid, find a matching release in fatcat.
+
+## Approach
+
+Two general ways:
+
+* do queries against elasticsearch, which would max out at a few hundred queries/s
+* offline compute a key (e.g. title, title ngram plus authors, etc.); then do comparisons
+
+## Misc
+
+Example grobid outputs:
+
+* [grobid.tei.xml](grobid.tei.xml),
+ [pdf](http://dss.in.tum.de/files/brandt-research/me.pdf) -- here grobid does
+not extract many refs; GS looks ok
+* [pdf](https://ia803202.us.archive.org/21/items/jstor-1064270/1064270.pdf)
+
diff --git a/extra/oai_metadata/.gitignore b/extra/oai_metadata/.gitignore
new file mode 100644
index 0000000..96077ed
--- /dev/null
+++ b/extra/oai_metadata/.gitignore
@@ -0,0 +1 @@
+oai.ndjson.zst
diff --git a/extra/oai_metadata/Makefile b/extra/oai_metadata/Makefile
new file mode 100644
index 0000000..f9deb7f
--- /dev/null
+++ b/extra/oai_metadata/Makefile
@@ -0,0 +1,5 @@
+SHELL := /bin/bash
+
+oai_harvest_20200215.ndjson.zst:
+ wget -c https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst
+
diff --git a/extra/oai_metadata/README.md b/extra/oai_metadata/README.md
index 865311f..9bd8497 100644
--- a/extra/oai_metadata/README.md
+++ b/extra/oai_metadata/README.md
@@ -16,3 +16,5 @@ Siempre hay que defender la poesía
Faúndez, Edson. Bajo la piel de tu capa
```
+* [https://archive.org/details/oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215)
+* [oai.ndjson.zst](https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst)