diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2020-10-31 00:42:08 +0100 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2020-10-31 00:42:08 +0100 |
commit | 250181aead188499ce8a567183d1287289a127b5 (patch) | |
tree | 8525eb47beea6b47051d45e8843ede7432d55b68 /extra | |
parent | db62473041ade8315a75e07b2898908438f71e60 (diff) | |
download | fuzzycat-250181aead188499ce8a567183d1287289a127b5.tar.gz fuzzycat-250181aead188499ce8a567183d1287289a127b5.zip |
cleanup dirs
Diffstat (limited to 'extra')
-rw-r--r-- | extra/grobid_references/README.md | 24 | ||||
-rw-r--r-- | extra/oai_metadata/.gitignore | 1 | ||||
-rw-r--r-- | extra/oai_metadata/Makefile | 5 | ||||
-rw-r--r-- | extra/oai_metadata/README.md | 2 |
4 files changed, 32 insertions, 0 deletions
diff --git a/extra/grobid_references/README.md b/extra/grobid_references/README.md index e69de29..c880f3b 100644 --- a/extra/grobid_references/README.md +++ b/extra/grobid_references/README.md @@ -0,0 +1,24 @@ +# Grobid refs + +References extracted from [grobid](https://grobid.readthedocs.io). + +## TODO + +* For a given reference string in grobid, find a matching release in fatcat. + +## Approach + +Two general ways: + +* do queries against elasticsearch, which would max out at a few hundred queries/s +* offline compute a key (e.g. title, title ngram plus authors, etc.); then do comparisons + +## Misc + +Example grobid outputs: + +* [grobid.tei.xml](grobid.tei.xml), + [pdf](http://dss.in.tum.de/files/brandt-research/me.pdf) -- here grobid does +not extract many refs; GS looks ok +* [pdf](https://ia803202.us.archive.org/21/items/jstor-1064270/1064270.pdf) + diff --git a/extra/oai_metadata/.gitignore b/extra/oai_metadata/.gitignore new file mode 100644 index 0000000..96077ed --- /dev/null +++ b/extra/oai_metadata/.gitignore @@ -0,0 +1 @@ +oai.ndjson.zst diff --git a/extra/oai_metadata/Makefile b/extra/oai_metadata/Makefile new file mode 100644 index 0000000..f9deb7f --- /dev/null +++ b/extra/oai_metadata/Makefile @@ -0,0 +1,5 @@ +SHELL := /bin/bash + +oai_harvest_20200215.ndjson.zst: + wget -c https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst + diff --git a/extra/oai_metadata/README.md b/extra/oai_metadata/README.md index 865311f..9bd8497 100644 --- a/extra/oai_metadata/README.md +++ b/extra/oai_metadata/README.md @@ -16,3 +16,5 @@ Siempre hay que defender la poesía Faúndez, Edson. Bajo la piel de tu capa ``` +* [https://archive.org/details/oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215) +* [oai.ndjson.zst](https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst) |