summaryrefslogtreecommitdiffstats
path: root/proposals
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-12-02 11:30:41 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-12-17 23:03:08 -0800
commit543dab55ade2cf2d4744c478691f085297b2545a (patch)
tree0b5a28e7809a0209798d8c2e7aaa14c0b0991f73 /proposals
parentc66f9b2d98de88a98d3a1737d415bdab4e89027c (diff)
downloadfatcat-543dab55ade2cf2d4744c478691f085297b2545a.tar.gz
fatcat-543dab55ade2cf2d4744c478691f085297b2545a.zip
dblp import proposal
Had notes on this floating around since August (not in git), but mostly rewrote these in past couple days.
Diffstat (limited to 'proposals')
-rw-r--r--proposals/20200807_dblp.md159
1 files changed, 159 insertions, 0 deletions
diff --git a/proposals/20200807_dblp.md b/proposals/20200807_dblp.md
new file mode 100644
index 00000000..b955268f
--- /dev/null
+++ b/proposals/20200807_dblp.md
@@ -0,0 +1,159 @@
+
+status: in progress
+
+DBLP Metadata Import
+====================
+
+~5.3 million publications, ~2.6 million authors, ~5k conferences, ~1.7k journals.
+
+All metadata is explicitly CC-0
+
+Container metadata:
+
+- journals: match via ISSN (?)
+- create containers for all conferences (at least series), and make a series/container/dblp/name/publisher mapping
+- make some decision about conference series vs. conference instance vs. published proceedings
+- TBD: lookups
+
+Release metadata:
+
+x add `dblp` as a release identifier type to fatcat schema
+- look at CSL fields: conference series? book series? etc
+- if arxiv.org, skip import for now
+ => though note could disambiguate authors
+- if has a DOI: fetch fatcat record. if no stage/type/`container_id`, update record
+- always fuzzy match? experiment first
+
+Author metadata:
+
+- TBD
+
+Fulltext ingest:
+
+- XML to ingest requests
+- article key, DOI, arxiv, other repo identifiers
+
+## Plan
+
+- get martin review of this plan
+x read full XML DTD
+- scrape container metadata (for ~6k containers): ISSN, Wikidata QID, name
+ => selectolax?
+ => title, issn, wikidata, "is OA"
+- implement basic release import, with tests (no container/creator linking)
+ => surface any unexpected issues
+- estimate number of entities with/without external identifier (DOI)
+- investigate journal+conference ISSN mapping
+- run orcid import/update of creators
+- update container and creator schemas to have lookup-able dblp identifiers (creator:`dblp_pid`, container:`dblp_prefix`)
+
+
+## Creator Metadata
+
+There is a "person ID" system. These can be just numbers (new records), just
+names, or alphanumeric disambiguated names.
+
+
+## Container Metadata
+
+Types:
+
+- journal
+- book-series
+- proceedings
+- conference-series (?)
+
+TBD:
+
+- conference series or individual instances? if series, can use volume/year to
+ distinguish, seems best
+- workshops as separate containers? probably yes
+- proceedings vs. papers vs. abstracts?
+
+Going to have many containers with no ISSN. Do we need dblp-specific lookup? Or
+do a special-case mapping file for expediency?
+
+Journals do not have explicit entities in the database. They do have names, in
+the form of URL prefix to article keys. Additionally, there are (often?) HTML
+pages with things like ISSN ("BHT" files). There may be a dump of these?
+
+
+## Release Metadata
+
+Schema is basically BibTeX.
+
+Types:
+
+- article -> journal-article (if 'journal'), article, others
+- inproceedings -> conference-paper
+- proceedings -> (container)
+- book -> book
+- incollection -> chapter (or part?)
+- phdthesis -> thesis
+- mastersthesis -> thesis
+- www
+ => often a person, if key starts with "homepages"
+- data (?)
+- publtype sub-type:
+ encyclopedia/"encyclopedia entry" -> entry-encyclopedia (?)
+ informal/"informal publication" (greylit)
+ edited (editorial or news)
+ survey (survey/review article)
+ data (dataset)
+ software
+ withdrawn
+
+Future: person
+
+Fields:
+
+- element type (one of the above)
+- key (eg, "journals/cacm/Szalay08")
+- title
+ => may contain <i>, <sub>, <sup>, <tt>
+- author (multiple; each a single string)
+ => may have HTML entities
+ => may have a number at the end, to aid with identifier creation
+ => orcid
+- editor (same as author)
+ => orcid
+- journal (abbrev?)
+- volume, pages, number (number -> issue)
+- publisher
+- year
+ => for conferences, year of conference not of publication
+- month
+- crossref (from inproceedings to specific proceedings volume)
+- booktitle
+ => for inproceedings, this is the name of conference or workshop. acronym.
+- isbn
+- ee (electronic edition; often DOI?)
+ => in some cases a "local" URL
+ => publisher URL; often DOI
+ => type attr
+- url
+ => dblp internal link to table-of-contents
+- publnr
+ => alternative identifier
+- note
+ => for persons (www), may be name in non-Latin character set
+
+- series: ?
+ => has href attr
+- cite: ?
+- school: ?
+- chapter: ?
+
+Notable CSL "extra" fields:
+ => 'event': name of conference/workshop
+ => 'event-place': location of conference/workshop
+ => 'collection-title' (eg, book series)
+ => 'container-title' (eg, book for a chapter)
+
+
+## Resources
+
+"DBLP — Some Lessons Learned"
+https://dblp.org/xml/docu/dblpxml.pdf
+
+https://blog.dblp.org/2020/08/18/new-dblp-url-scheme-and-api-changes/