aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/20201103_xml_ingest.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2023-01-02 19:16:09 -0800
committerBryan Newbold <bnewbold@archive.org>2023-01-02 19:16:09 -0800
commit99cc7de073baee53bb97075377906743d364ab84 (patch)
tree12b68a9695097c69eed68b1f8ece12b3007e3d4c /proposals/20201103_xml_ingest.md
parente433990172c157707d92452652aefe2f21b6a4a0 (diff)
downloadsandcrawler-99cc7de073baee53bb97075377906743d364ab84.tar.gz
sandcrawler-99cc7de073baee53bb97075377906743d364ab84.zip
proposals: update status; include some brainstorm-only docs
Diffstat (limited to 'proposals/20201103_xml_ingest.md')
-rw-r--r--proposals/20201103_xml_ingest.md19
1 files changed, 1 insertions, 18 deletions
diff --git a/proposals/20201103_xml_ingest.md b/proposals/20201103_xml_ingest.md
index 25ec973..34e00b0 100644
--- a/proposals/20201103_xml_ingest.md
+++ b/proposals/20201103_xml_ingest.md
@@ -1,22 +1,5 @@
-status: wip
-
-TODO:
-x XML fulltext URL extractor (based on HTML biblio metadata, not PDF url extractor)
-x differential JATS XML and scielo XML from generic XML?
- application/xml+jats is what fatcat is doing for abstracts
- but it should be application/jats+xml?
- application/tei+xml
- if startswith "<article " and "<article-meta>" => JATS
-x refactor ingest worker to be more general
-x have ingest code publish body to kafka topic
-x write a persist worker
-/ create/configure kafka topic
-- test everything locally
-- fatcat: ingest tool to create requests
-- fatcat: entity updates worker creates XML ingest requests for specific sources
-- fatcat: ingest file import worker allows XML results
-- ansible: deployment of persist worker
+status: deployed
XML Fulltext Ingest
====================