aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2020-08-07_dblp.md
blob: b6c734a43f22655bd6500572c92d1c712b62c28f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159

status: implemented

DBLP Metadata Import
====================

~5.3 million publications, ~2.6 million authors, ~5k conferences, ~1.7k journals.

All metadata is explicitly CC-0

Container metadata:

- journals: match via ISSN, if there is one
- create containers for all conferences (at least series), and make a series/container/dblp/name/publisher mapping
- make some decision about conference series vs. conference instance vs. published proceedings

Release metadata:

x add `dblp` as a release identifier type to fatcat schema
- look at CSL fields: conference series? book series? etc
- if arxiv.org, skip import for now
    => though note could disambiguate authors
- if has a DOI: fetch fatcat record. if no stage/type/`container_id`, update record
- always fuzzy match? experiment first

Author metadata won't be imported in this iteration.

Fulltext ingest:

- XML to ingest requests
- article key, DOI, arxiv, other repo identifiers

## Plan

x get martin review of this plan
x read full XML DTD
x scrape container metadata (for ~6k containers): ISSN, Wikidata QID, name
    => selectolax?
    => title, issn, wikidata
x implement basic release import, with tests (no container/creator linking)
    => surface any unexpected issues
x estimate number of entities with/without external identifier (DOI)
    Counter({'total': 7953365, 'has-doi': 4277307, 'skip': 2953841, 'skip-key-type': 2640968, 'skip-arxiv-corr': 312872, 'skip-title': 1, 'insert': 0, 'update': 0, 'exists': 0})
/ update container and creator schemas to have lookup-able dblp identifiers (creator:`dblp_pid`, container:`dblp_prefix`)
. run orcid import/update of creators
- container creator/update for `dblp_prefix`
    => chocula import first?
- investigate journal+conference ISSN mapping


## Creator Metadata

There is a "person ID" system. These can be just numbers (new records), just
names, or alphanumeric disambiguated names.


## Container Metadata

Types:

- journal
- book-series
- proceedings
- conference-series (?)

TBD:

- conference series or individual instances? if series, can use volume/year to
  distinguish, seems best
- workshops as separate containers? probably yes
- proceedings vs. papers vs. abstracts?

Going to have many containers with no ISSN. Do we need dblp-specific lookup? Or
do a special-case mapping file for expediency?

Journals do not have explicit entities in the database. They do have names, in
the form of URL prefix to article keys. Additionally, there are (often?) HTML
pages with things like ISSN ("BHT" files). There may be a dump of these?


## Release Metadata

Schema is basically BibTeX.

Types:

- article -> journal-article (if 'journal'), article, others
- inproceedings -> conference-paper
- proceedings -> (container)
- book -> book
- incollection -> chapter (or part?)
- phdthesis -> thesis
- mastersthesis -> thesis
- www
    => often a person, if key starts with "homepages"
- data (?)
- publtype sub-type:
    encyclopedia/"encyclopedia entry" -> entry-encyclopedia (?)
    informal/"informal publication" (greylit)
    edited (editorial or news)
    survey (survey/review article)
    data (dataset)
    software
    withdrawn

Future: person

Fields:

- element type (one of the above)
- key (eg, "journals/cacm/Szalay08")
- title
    => may contain <i>, <sub>, <sup>, <tt>
- author (multiple; each a single string)
    => may have HTML entities
    => may have a number at the end, to aid with identifier creation
    => orcid
- editor (same as author)
    => orcid
- journal (abbrev?)
- volume, pages, number (number -> issue)
- publisher
- year
    => for conferences, year of conference not of publication
- month
- crossref (from inproceedings to specific proceedings volume)
- booktitle
    => for inproceedings, this is the name of conference or workshop. acronym.
- isbn
- ee (electronic edition; often DOI?)
    => in some cases a "local" URL
    => publisher URL; often DOI
    => type attr
- url
    => dblp internal link to table-of-contents
- publnr
    => alternative identifier
- note
    => for persons (www), may be name in non-Latin character set

- series: ?
    => has href attr
- cite: ?
- school: ?
- chapter: ?

Notable CSL "extra" fields:
    => 'event': name of conference/workshop
    => 'event-place': location of conference/workshop
    => 'collection-title' (eg, book series)
    => 'container-title' (eg, book for a chapter)


## Resources

"DBLP — Some Lessons Learned"
https://dblp.org/xml/docu/dblpxml.pdf

https://blog.dblp.org/2020/08/18/new-dblp-url-scheme-and-api-changes/