aboutsummaryrefslogtreecommitdiffstats
path: root/projects/oai_harvest_md/README.md
blob: 5f2b655dc45f668d380cc0013045b3de4444acd1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# OAI metadata matching

Goal: end-to-end data workflow (acquisition, harvest, matching, new release entities).

## Plan

* [ ] get JSON version, via [oai_harvest_20200215](https://archive.org/details/oai_harvest_20200215)
* [ ] filter out out of scope data
* [ ] (a) for items that have a doi, figure out, whether we already have md for this doi via API
* [ ] (b) for items w/o doi, get a list of (id, title)
* [ ] run fuzzy matching over title list to find out which one we have

## Get data

```
$ make
```

* compressed 12G, around 100G uncompressed