1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
|
status: in progress
DBLP Metadata Import
====================
~5.3 million publications, ~2.6 million authors, ~5k conferences, ~1.7k journals.
All metadata is explicitly CC-0
Container metadata:
- journals: match via ISSN (?)
- create containers for all conferences (at least series), and make a series/container/dblp/name/publisher mapping
- make some decision about conference series vs. conference instance vs. published proceedings
- TBD: lookups
Release metadata:
x add `dblp` as a release identifier type to fatcat schema
- look at CSL fields: conference series? book series? etc
- if arxiv.org, skip import for now
=> though note could disambiguate authors
- if has a DOI: fetch fatcat record. if no stage/type/`container_id`, update record
- always fuzzy match? experiment first
Author metadata:
- TBD
Fulltext ingest:
- XML to ingest requests
- article key, DOI, arxiv, other repo identifiers
## Plan
- get martin review of this plan
x read full XML DTD
- scrape container metadata (for ~6k containers): ISSN, Wikidata QID, name
=> selectolax?
=> title, issn, wikidata, "is OA"
- implement basic release import, with tests (no container/creator linking)
=> surface any unexpected issues
- estimate number of entities with/without external identifier (DOI)
- investigate journal+conference ISSN mapping
- run orcid import/update of creators
- update container and creator schemas to have lookup-able dblp identifiers (creator:`dblp_pid`, container:`dblp_prefix`)
## Creator Metadata
There is a "person ID" system. These can be just numbers (new records), just
names, or alphanumeric disambiguated names.
## Container Metadata
Types:
- journal
- book-series
- proceedings
- conference-series (?)
TBD:
- conference series or individual instances? if series, can use volume/year to
distinguish, seems best
- workshops as separate containers? probably yes
- proceedings vs. papers vs. abstracts?
Going to have many containers with no ISSN. Do we need dblp-specific lookup? Or
do a special-case mapping file for expediency?
Journals do not have explicit entities in the database. They do have names, in
the form of URL prefix to article keys. Additionally, there are (often?) HTML
pages with things like ISSN ("BHT" files). There may be a dump of these?
## Release Metadata
Schema is basically BibTeX.
Types:
- article -> journal-article (if 'journal'), article, others
- inproceedings -> conference-paper
- proceedings -> (container)
- book -> book
- incollection -> chapter (or part?)
- phdthesis -> thesis
- mastersthesis -> thesis
- www
=> often a person, if key starts with "homepages"
- data (?)
- publtype sub-type:
encyclopedia/"encyclopedia entry" -> entry-encyclopedia (?)
informal/"informal publication" (greylit)
edited (editorial or news)
survey (survey/review article)
data (dataset)
software
withdrawn
Future: person
Fields:
- element type (one of the above)
- key (eg, "journals/cacm/Szalay08")
- title
=> may contain <i>, <sub>, <sup>, <tt>
- author (multiple; each a single string)
=> may have HTML entities
=> may have a number at the end, to aid with identifier creation
=> orcid
- editor (same as author)
=> orcid
- journal (abbrev?)
- volume, pages, number (number -> issue)
- publisher
- year
=> for conferences, year of conference not of publication
- month
- crossref (from inproceedings to specific proceedings volume)
- booktitle
=> for inproceedings, this is the name of conference or workshop. acronym.
- isbn
- ee (electronic edition; often DOI?)
=> in some cases a "local" URL
=> publisher URL; often DOI
=> type attr
- url
=> dblp internal link to table-of-contents
- publnr
=> alternative identifier
- note
=> for persons (www), may be name in non-Latin character set
- series: ?
=> has href attr
- cite: ?
- school: ?
- chapter: ?
Notable CSL "extra" fields:
=> 'event': name of conference/workshop
=> 'event-place': location of conference/workshop
=> 'collection-title' (eg, book series)
=> 'container-title' (eg, book for a chapter)
## Resources
"DBLP — Some Lessons Learned"
https://dblp.org/xml/docu/dblpxml.pdf
https://blog.dblp.org/2020/08/18/new-dblp-url-scheme-and-api-changes/
|