update notes/TODO

author: Bryan Newbold <bnewbold@robocracy.org> 2018-11-21 22:21:52 -0800
committer: Bryan Newbold <bnewbold@robocracy.org> 2018-11-21 22:22:32 -0800
commit: 3f6a7483e64c907e4af2e87427321e65e77b169c (patch)
tree: dd9ea5a3ed000015cd3b36c8ea938622b27822aa
parent: fb80eab1bdae2d21a3dda2e82230b7477ed41ebc (diff)
download: fatcat-3f6a7483e64c907e4af2e87427321e65e77b169c.tar.gz
fatcat-3f6a7483e64c907e4af2e87427321e65e77b169c.zip
3 files changed, 34 insertions, 0 deletions
diff --git a/TODO b/TODO
index c27374bf..704b003d 100644
--- a/TODO
+++ b/TODO
@@ -65,6 +65,8 @@ new importers:
 
 ## Schema / Entity Fields
 
+- arxiv_id field (keep flip-flopping)
+- original_title field (?)
 - FileSet and WebSnapshot entities
 - `doi` field for containers (at least for "journal" type; maybe for "series"
   as well?)
diff --git a/notes/crossref_types.txt b/notes/crossref_types.txt
new file mode 100644
index 00000000..9823ab04
--- /dev/null
+++ b/notes/crossref_types.txt
@@ -0,0 +1,16 @@
+
+https://www.crossref.org/services/content-registration/
+
+We store metadata and DOIs for many types of research-related content. The content types that we currently accept are below. If you have a content type that isn’t listed please contact us. At the moment we’re developing schemas for grants, conferences, and projects.
+
+- Journals and journal articles: at the journal title and article level and includes supplemental materials.
+- Books, chapters, and reference works: book title and/or chapter-level records, books can be deposited as a monograph, series, or set. Read our best practice for book content.
+- Conference proceedings: information about a single conference and records for each conference paper/proceeding.
+- Reports/working papers: this includes content that is formally published and is published with an ISSN or ISBN.
+- Standards: includes publications from Standards Development Organizations and Standards Setting Organizations.
+- Datasets: includes database records or collections. (See also DataCite.
+- Dissertations: includes single dissertations and theses - not collections.
+- Preprints: consists of preprints, eprints, working papers, reports, and other types of content that has been posted but not formally published.
+- Peer reviews: any number of reviews, reports, or comments attached to an associated article.
+- Components: typically assigned to parts of a whole, most commonly including figures, tables, and supplemental materials for a journal article or book chapter.
+
diff --git a/notes/performance/kafka_pipeline.txt b/notes/performance/kafka_pipeline.txt
index 0a503a18..0ff2e411 100644
--- a/notes/performance/kafka_pipeline.txt
+++ b/notes/performance/kafka_pipeline.txt
@@ -29,3 +29,19 @@ elastic-release python processing is at 66% (of one core) CPU! and elastic at
 ~30%. Huh.
 
 But, in general, "seems to be working".
+
+## End-To-End
+
+release-updates: 40/sec
+api-crossref: 40/sec
+api-datacite: 15/sec
+changelog: 11/sec
+consumer_offsets: 0.5/sec
+
+elastic indexing looks like only 8/sec or so. Probably need to batch.
+
+Tried running additional fatcat-elasticsearch-release-worker processes, and
+throughput goes linearly.
+
+Are consumer group names not actually topic-dependent? Hrm, might need to
+rename them all for prod/qa split.
author	Bryan Newbold <bnewbold@robocracy.org>	2018-11-21 22:21:52 -0800
committer	Bryan Newbold <bnewbold@robocracy.org>	2018-11-21 22:22:32 -0800
commit	3f6a7483e64c907e4af2e87427321e65e77b169c (patch)
tree	dd9ea5a3ed000015cd3b36c8ea938622b27822aa
parent	fb80eab1bdae2d21a3dda2e82230b7477ed41ebc (diff)
download	fatcat-3f6a7483e64c907e4af2e87427321e65e77b169c.tar.gz fatcat-3f6a7483e64c907e4af2e87427321e65e77b169c.zip