status: in progress DBLP Metadata Import ==================== ~5.3 million publications, ~2.6 million authors, ~5k conferences, ~1.7k journals. All metadata is explicitly CC-0 Container metadata: - journals: match via ISSN (?) - create containers for all conferences (at least series), and make a series/container/dblp/name/publisher mapping - make some decision about conference series vs. conference instance vs. published proceedings - TBD: lookups Release metadata: x add `dblp` as a release identifier type to fatcat schema - look at CSL fields: conference series? book series? etc - if arxiv.org, skip import for now => though note could disambiguate authors - if has a DOI: fetch fatcat record. if no stage/type/`container_id`, update record - always fuzzy match? experiment first Author metadata: - TBD Fulltext ingest: - XML to ingest requests - article key, DOI, arxiv, other repo identifiers ## Plan x get martin review of this plan x read full XML DTD x scrape container metadata (for ~6k containers): ISSN, Wikidata QID, name => selectolax? => title, issn, wikidata x implement basic release import, with tests (no container/creator linking) => surface any unexpected issues x estimate number of entities with/without external identifier (DOI) Counter({'total': 7953365, 'has-doi': 4277307, 'skip': 2953841, 'skip-key-type': 2640968, 'skip-arxiv-corr': 312872, 'skip-title': 1, 'insert': 0, 'update': 0, 'exists': 0}) / update container and creator schemas to have lookup-able dblp identifiers (creator:`dblp_pid`, container:`dblp_prefix`) . run orcid import/update of creators - container creator/update for `dblp_prefix` => chocula import first? - investigate journal+conference ISSN mapping ## Creator Metadata There is a "person ID" system. These can be just numbers (new records), just names, or alphanumeric disambiguated names. ## Container Metadata Types: - journal - book-series - proceedings - conference-series (?) TBD: - conference series or individual instances? if series, can use volume/year to distinguish, seems best - workshops as separate containers? probably yes - proceedings vs. papers vs. abstracts? Going to have many containers with no ISSN. Do we need dblp-specific lookup? Or do a special-case mapping file for expediency? Journals do not have explicit entities in the database. They do have names, in the form of URL prefix to article keys. Additionally, there are (often?) HTML pages with things like ISSN ("BHT" files). There may be a dump of these? ## Release Metadata Schema is basically BibTeX. Types: - article -> journal-article (if 'journal'), article, others - inproceedings -> conference-paper - proceedings -> (container) - book -> book - incollection -> chapter (or part?) - phdthesis -> thesis - mastersthesis -> thesis - www => often a person, if key starts with "homepages" - data (?) - publtype sub-type: encyclopedia/"encyclopedia entry" -> entry-encyclopedia (?) informal/"informal publication" (greylit) edited (editorial or news) survey (survey/review article) data (dataset) software withdrawn Future: person Fields: - element type (one of the above) - key (eg, "journals/cacm/Szalay08") - title => may contain , , , - author (multiple; each a single string) => may have HTML entities => may have a number at the end, to aid with identifier creation => orcid - editor (same as author) => orcid - journal (abbrev?) - volume, pages, number (number -> issue) - publisher - year => for conferences, year of conference not of publication - month - crossref (from inproceedings to specific proceedings volume) - booktitle => for inproceedings, this is the name of conference or workshop. acronym. - isbn - ee (electronic edition; often DOI?) => in some cases a "local" URL => publisher URL; often DOI => type attr - url => dblp internal link to table-of-contents - publnr => alternative identifier - note => for persons (www), may be name in non-Latin character set - series: ? => has href attr - cite: ? - school: ? - chapter: ? Notable CSL "extra" fields: => 'event': name of conference/workshop => 'event-place': location of conference/workshop => 'collection-title' (eg, book series) => 'container-title' (eg, book for a chapter) ## Resources "DBLP — Some Lessons Learned" https://dblp.org/xml/docu/dblpxml.pdf https://blog.dblp.org/2020/08/18/new-dblp-url-scheme-and-api-changes/