summaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorbnewbold <bnewbold@archive.org>2021-11-11 01:12:18 +0000
committerbnewbold <bnewbold@archive.org>2021-11-11 01:12:18 +0000
commit6ad9d24e4d7d901d6fc394e6e91575f6acba7ff4 (patch)
tree1b80344125152b46ae727dc8bbff73cc12abfd3e /notes
parent7e3f91f1a49ea85707cae31125021ba761f5373d (diff)
parent6eaf4f57c1f92b6f4f46adc38e5b39fd30b65d81 (diff)
downloadfatcat-6ad9d24e4d7d901d6fc394e6e91575f6acba7ff4.tar.gz
fatcat-6ad9d24e4d7d901d6fc394e6e91575f6acba7ff4.zip
Merge branch 'bnewbold-import-refactors' into 'master'
import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields
Diffstat (limited to 'notes')
-rw-r--r--notes/cleanups/double_slash_dois.md46
1 files changed, 46 insertions, 0 deletions
diff --git a/notes/cleanups/double_slash_dois.md b/notes/cleanups/double_slash_dois.md
new file mode 100644
index 00000000..d4e9ded6
--- /dev/null
+++ b/notes/cleanups/double_slash_dois.md
@@ -0,0 +1,46 @@
+
+Relevant github issue: https://github.com/internetarchive/fatcat/issues/48
+
+
+## Investigate
+
+At least some of these DOIs actually seem valid, like
+`10.1026//1616-1041.3.2.86`. So shouldn't be re-writing them!
+
+ zcat release_extid.tsv.gz \
+ | cut -f1,3 \
+ | rg '\t10\.\d+//' \
+ | wc -l
+ # 59,904
+
+ zcat release_extid.tsv.gz \
+ | cut -f1,3 \
+ | rg '\t10\.\d+//' \
+ | pv -l \
+ > doubleslash_dois.tsv
+
+Which prefixes have the most double slashes?
+
+ cat doubleslash_dois.tsv | cut -f2 | cut -d/ -f1 | sort | uniq -c | sort -nr | head
+ 51220 10.1037
+ 2187 10.1026
+ 1316 10.1024
+ 826 10.1027
+ 823 10.14505
+ 443 10.17010
+ 186 10.46925
+ 163 10.37473
+ 122 10.18376
+ 118 10.29392
+ [...]
+
+All of the 10.1037 DOIs seem to be registered with Crossref, and at least some
+have redirects to the not-with-double-slash versions. Not all doi.org lookups
+include a redirect.
+
+I think the "correct thing to do" here is to add special-case handling for the
+pubmed and crossref importers, and in any other case allow double slashes.
+
+Not clear that there are any specific cleanups to be done for now. A broader
+"verify that DOIs are actually valid" push and cleanup would make sense; if
+that happens checking for mangled double-slash DOIs would make sense.