diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2020-08-27 16:52:13 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2020-08-27 16:52:13 +0200 |
commit | 4ab53ddfeef8fa99f5cf507f582c224a32e4c8b9 (patch) | |
tree | 99b9669335bcf1eb282b731eeec19b3b79343763 /projects | |
parent | ce6a2ee453d29d0521c1dc3672363ec8934d2f2a (diff) | |
download | fuzzycat-4ab53ddfeef8fa99f5cf507f582c224a32e4c8b9.tar.gz fuzzycat-4ab53ddfeef8fa99f5cf507f582c224a32e4c8b9.zip |
update project README
Diffstat (limited to 'projects')
-rw-r--r-- | projects/oai_harvest_md/.gitignore | 1 | ||||
-rw-r--r-- | projects/oai_harvest_md/Makefile | 5 | ||||
-rw-r--r-- | projects/oai_harvest_md/README.md | 10 | ||||
-rw-r--r-- | projects/titlelist/README.md | 6 |
4 files changed, 22 insertions, 0 deletions
diff --git a/projects/oai_harvest_md/.gitignore b/projects/oai_harvest_md/.gitignore new file mode 100644 index 0000000..96077ed --- /dev/null +++ b/projects/oai_harvest_md/.gitignore @@ -0,0 +1 @@ +oai.ndjson.zst diff --git a/projects/oai_harvest_md/Makefile b/projects/oai_harvest_md/Makefile new file mode 100644 index 0000000..f9deb7f --- /dev/null +++ b/projects/oai_harvest_md/Makefile @@ -0,0 +1,5 @@ +SHELL := /bin/bash + +oai_harvest_20200215.ndjson.zst: + wget -c https://archive.org/download/oai_harvest_20200215/oai.ndjson.zst + diff --git a/projects/oai_harvest_md/README.md b/projects/oai_harvest_md/README.md index bbaa915..5f2b655 100644 --- a/projects/oai_harvest_md/README.md +++ b/projects/oai_harvest_md/README.md @@ -1,5 +1,7 @@ # OAI metadata matching +Goal: end-to-end data workflow (acquisition, harvest, matching, new release entities). + ## Plan * [ ] get JSON version, via [oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215) @@ -8,3 +10,11 @@ * [ ] (b) for items w/o doi, get a list of (id, title) * [ ] run fuzzy matching over title list to find out which one we have +## Get data + +``` +$ make +``` + +* compressed 12G, around 100G uncompressed + diff --git a/projects/titlelist/README.md b/projects/titlelist/README.md new file mode 100644 index 0000000..a58f50e --- /dev/null +++ b/projects/titlelist/README.md @@ -0,0 +1,6 @@ +# Title list + +Given a list of over 20M publication titles, determine matches, to generate a +seed list. + + |