diff options
-rw-r--r-- | TODO | 2 | ||||
-rw-r--r-- | notes/crossref_types.txt | 16 | ||||
-rw-r--r-- | notes/performance/kafka_pipeline.txt | 16 |
3 files changed, 34 insertions, 0 deletions
@@ -65,6 +65,8 @@ new importers: ## Schema / Entity Fields +- arxiv_id field (keep flip-flopping) +- original_title field (?) - FileSet and WebSnapshot entities - `doi` field for containers (at least for "journal" type; maybe for "series" as well?) diff --git a/notes/crossref_types.txt b/notes/crossref_types.txt new file mode 100644 index 00000000..9823ab04 --- /dev/null +++ b/notes/crossref_types.txt @@ -0,0 +1,16 @@ + +https://www.crossref.org/services/content-registration/ + +We store metadata and DOIs for many types of research-related content. The content types that we currently accept are below. If you have a content type that isn’t listed please contact us. At the moment we’re developing schemas for grants, conferences, and projects. + +- Journals and journal articles: at the journal title and article level and includes supplemental materials. +- Books, chapters, and reference works: book title and/or chapter-level records, books can be deposited as a monograph, series, or set. Read our best practice for book content. +- Conference proceedings: information about a single conference and records for each conference paper/proceeding. +- Reports/working papers: this includes content that is formally published and is published with an ISSN or ISBN. +- Standards: includes publications from Standards Development Organizations and Standards Setting Organizations. +- Datasets: includes database records or collections. (See also DataCite. +- Dissertations: includes single dissertations and theses - not collections. +- Preprints: consists of preprints, eprints, working papers, reports, and other types of content that has been posted but not formally published. +- Peer reviews: any number of reviews, reports, or comments attached to an associated article. +- Components: typically assigned to parts of a whole, most commonly including figures, tables, and supplemental materials for a journal article or book chapter. + diff --git a/notes/performance/kafka_pipeline.txt b/notes/performance/kafka_pipeline.txt index 0a503a18..0ff2e411 100644 --- a/notes/performance/kafka_pipeline.txt +++ b/notes/performance/kafka_pipeline.txt @@ -29,3 +29,19 @@ elastic-release python processing is at 66% (of one core) CPU! and elastic at ~30%. Huh. But, in general, "seems to be working". + +## End-To-End + +release-updates: 40/sec +api-crossref: 40/sec +api-datacite: 15/sec +changelog: 11/sec +consumer_offsets: 0.5/sec + +elastic indexing looks like only 8/sec or so. Probably need to batch. + +Tried running additional fatcat-elasticsearch-release-worker processes, and +throughput goes linearly. + +Are consumer group names not actually topic-dependent? Hrm, might need to +rename them all for prod/qa split. |