diff options
Diffstat (limited to 'projects/oai_harvest_md')
-rw-r--r-- | projects/oai_harvest_md/.gitignore | 1 | ||||
-rw-r--r-- | projects/oai_harvest_md/Makefile | 5 | ||||
-rw-r--r-- | projects/oai_harvest_md/README.md | 20 |
3 files changed, 0 insertions, 26 deletions
diff --git a/projects/oai_harvest_md/.gitignore b/projects/oai_harvest_md/.gitignore deleted file mode 100644 index 96077ed..0000000 --- a/projects/oai_harvest_md/.gitignore +++ /dev/null @@ -1 +0,0 @@ -oai.ndjson.zst diff --git a/projects/oai_harvest_md/Makefile b/projects/oai_harvest_md/Makefile deleted file mode 100644 index f9deb7f..0000000 --- a/projects/oai_harvest_md/Makefile +++ /dev/null @@ -1,5 +0,0 @@ -SHELL := /bin/bash - -oai_harvest_20200215.ndjson.zst: - wget -c https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst - diff --git a/projects/oai_harvest_md/README.md b/projects/oai_harvest_md/README.md deleted file mode 100644 index 5f2b655..0000000 --- a/projects/oai_harvest_md/README.md +++ /dev/null @@ -1,20 +0,0 @@ -# OAI metadata matching - -Goal: end-to-end data workflow (acquisition, harvest, matching, new release entities). - -## Plan - -* [ ] get JSON version, via [oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215) -* [ ] filter out out of scope data -* [ ] (a) for items that have a doi, figure out, whether we already have md for this doi via API -* [ ] (b) for items w/o doi, get a list of (id, title) -* [ ] run fuzzy matching over title list to find out which one we have - -## Get data - -``` -$ make -``` - -* compressed 12G, around 100G uncompressed - |