diff options
| author | Bryan Newbold <bnewbold@archive.org> | 2023-01-02 19:16:09 -0800 |
|---|---|---|
| committer | Bryan Newbold <bnewbold@archive.org> | 2023-01-02 19:16:09 -0800 |
| commit | 99cc7de073baee53bb97075377906743d364ab84 (patch) | |
| tree | 12b68a9695097c69eed68b1f8ece12b3007e3d4c /proposals/20201103_xml_ingest.md | |
| parent | e433990172c157707d92452652aefe2f21b6a4a0 (diff) | |
| download | sandcrawler-99cc7de073baee53bb97075377906743d364ab84.tar.gz sandcrawler-99cc7de073baee53bb97075377906743d364ab84.zip | |
proposals: update status; include some brainstorm-only docs
Diffstat (limited to 'proposals/20201103_xml_ingest.md')
| -rw-r--r-- | proposals/20201103_xml_ingest.md | 19 |
1 files changed, 1 insertions, 18 deletions
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md index 25ec973..34e00b0 100644 --- a/proposals/20201103_xml_ingest.md +++ b/proposals/20201103_xml_ingest.md @@ -1,22 +1,5 @@ -status: wip - -TODO: -x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor) -x differential JATS XML and scielo XML from generic XML? - application/xml+jats is what fatcat is doing for abstracts - but it should be application/jats+xml? - application/tei+xml - if startswith "<article " and "<article-meta>" => JATS -x refactor ingest worker to be more general -x have ingest code publish body to kafka topic -x write a persist worker -/ create/configure kafka topic -- test everything locally -- fatcat: ingest tool to create requests -- fatcat: entity updates worker creates XML ingest requests for specific sources -- fatcat: ingest file import worker allows XML results -- ansible: deployment of persist worker +status: deployed XML Fulltext Ingest ==================== |
