From 543dab55ade2cf2d4744c478691f085297b2545a Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 2 Dec 2020 11:30:41 -0800 Subject: dblp import proposal Had notes on this floating around since August (not in git), but mostly rewrote these in past couple days. --- proposals/20200807_dblp.md | 159 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 159 insertions(+) create mode 100644 proposals/20200807_dblp.md (limited to 'proposals') diff --git a/proposals/20200807_dblp.md b/proposals/20200807_dblp.md new file mode 100644 index 00000000..b955268f --- /dev/null +++ b/proposals/20200807_dblp.md @@ -0,0 +1,159 @@ + +status: in progress + +DBLP Metadata Import +==================== + +~5.3 million publications, ~2.6 million authors, ~5k conferences, ~1.7k journals. + +All metadata is explicitly CC-0 + +Container metadata: + +- journals: match via ISSN (?) +- create containers for all conferences (at least series), and make a series/container/dblp/name/publisher mapping +- make some decision about conference series vs. conference instance vs. published proceedings +- TBD: lookups + +Release metadata: + +x add `dblp` as a release identifier type to fatcat schema +- look at CSL fields: conference series? book series? etc +- if arxiv.org, skip import for now + => though note could disambiguate authors +- if has a DOI: fetch fatcat record. if no stage/type/`container_id`, update record +- always fuzzy match? experiment first + +Author metadata: + +- TBD + +Fulltext ingest: + +- XML to ingest requests +- article key, DOI, arxiv, other repo identifiers + +## Plan + +- get martin review of this plan +x read full XML DTD +- scrape container metadata (for ~6k containers): ISSN, Wikidata QID, name + => selectolax? + => title, issn, wikidata, "is OA" +- implement basic release import, with tests (no container/creator linking) + => surface any unexpected issues +- estimate number of entities with/without external identifier (DOI) +- investigate journal+conference ISSN mapping +- run orcid import/update of creators +- update container and creator schemas to have lookup-able dblp identifiers (creator:`dblp_pid`, container:`dblp_prefix`) + + +## Creator Metadata + +There is a "person ID" system. These can be just numbers (new records), just +names, or alphanumeric disambiguated names. + + +## Container Metadata + +Types: + +- journal +- book-series +- proceedings +- conference-series (?) + +TBD: + +- conference series or individual instances? if series, can use volume/year to + distinguish, seems best +- workshops as separate containers? probably yes +- proceedings vs. papers vs. abstracts? + +Going to have many containers with no ISSN. Do we need dblp-specific lookup? Or +do a special-case mapping file for expediency? + +Journals do not have explicit entities in the database. They do have names, in +the form of URL prefix to article keys. Additionally, there are (often?) HTML +pages with things like ISSN ("BHT" files). There may be a dump of these? + + +## Release Metadata + +Schema is basically BibTeX. + +Types: + +- article -> journal-article (if 'journal'), article, others +- inproceedings -> conference-paper +- proceedings -> (container) +- book -> book +- incollection -> chapter (or part?) +- phdthesis -> thesis +- mastersthesis -> thesis +- www + => often a person, if key starts with "homepages" +- data (?) +- publtype sub-type: + encyclopedia/"encyclopedia entry" -> entry-encyclopedia (?) + informal/"informal publication" (greylit) + edited (editorial or news) + survey (survey/review article) + data (dataset) + software + withdrawn + +Future: person + +Fields: + +- element type (one of the above) +- key (eg, "journals/cacm/Szalay08") +- title + => may contain , , , +- author (multiple; each a single string) + => may have HTML entities + => may have a number at the end, to aid with identifier creation + => orcid +- editor (same as author) + => orcid +- journal (abbrev?) +- volume, pages, number (number -> issue) +- publisher +- year + => for conferences, year of conference not of publication +- month +- crossref (from inproceedings to specific proceedings volume) +- booktitle + => for inproceedings, this is the name of conference or workshop. acronym. +- isbn +- ee (electronic edition; often DOI?) + => in some cases a "local" URL + => publisher URL; often DOI + => type attr +- url + => dblp internal link to table-of-contents +- publnr + => alternative identifier +- note + => for persons (www), may be name in non-Latin character set + +- series: ? + => has href attr +- cite: ? +- school: ? +- chapter: ? + +Notable CSL "extra" fields: + => 'event': name of conference/workshop + => 'event-place': location of conference/workshop + => 'collection-title' (eg, book series) + => 'container-title' (eg, book for a chapter) + + +## Resources + +"DBLP — Some Lessons Learned" +https://dblp.org/xml/docu/dblpxml.pdf + +https://blog.dblp.org/2020/08/18/new-dblp-url-scheme-and-api-changes/ -- cgit v1.2.3