aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-09-02 19:32:46 -0700
committerBryan Newbold <bnewbold@archive.org>2020-09-02 19:32:46 -0700
commit3ad7a3c48de77c00ad0e777d24021f8db340912c (patch)
tree0a37614523361dc444d1756c5debe647e13ab8f8
parent3d1852453a3cadd2db8e5a1014c3451a9a0b5fb8 (diff)
downloadchocula-3ad7a3c48de77c00ad0e777d24021f8db340912c.tar.gz
chocula-3ad7a3c48de77c00ad0e777d24021f8db340912c.zip
notes on hathitrust importer
-rw-r--r--notes/hathitrust.md58
1 files changed, 58 insertions, 0 deletions
diff --git a/notes/hathitrust.md b/notes/hathitrust.md
new file mode 100644
index 0000000..a79418a
--- /dev/null
+++ b/notes/hathitrust.md
@@ -0,0 +1,58 @@
+
+Download: <https://www.hathitrust.org/hathifiles>
+Schema: <https://www.hathitrust.org/hathifiles_description>
+
+Munging/filtering huge file to just serials:
+
+ zcat hathi_full_20200801.txt.gz | rg '\tSE\t' | rg '\t\d\d\d\d-\d\d\d.\t' | pv -l > hathi_full_20200801_serials.txt
+ => 2.65M 0:00:50 [53.1k/s]
+
+ cut -f10 hathi_full_20200801_serials.txt | sort -u | wc -l
+ => 102,008
+
+Wow, that is a lot of coverage! If true.
+
+Columns we would be interested in:
+
+- 2 access (allow=bright, deny=dark)
+- 5 description
+- 10 issn ("multiple values separated by comma")
+- 12 title (if translated, separated by equals or slash)
+- 13 imprint (publisher and year; often "publisher, year")
+- 17 rights_date_used (year; 9999=unknown)
+- 19 lang (MARC format)
+
+Inspect "extent" (volumes/years), ISSN, title:
+
+ shuf -n10 hathi_full_20200801_serials.txt | cut -f2,5,10,12,13,17,19
+
+If we did, eg, onix CSV output, would want:
+
+- ISSN
+- Title
+- Publisher
+- Url
+- Vol
+- No
+- Published
+- Deposited
+
+KBART directory fields:
+
+- issnl
+- title
+- publisher
+- year
+- volume
+- url
+
+Note: could extract some bounds on publication (start date, end date, or both)
+from the publisher field
+
+If we trust this metadata, it is going to add some 90k container entities to
+fatcat, with very partial metadata. Likely we should at least pull in ISSN
+portal (scraped) metadata at the same time.
+
+TODO:
+- year ranges (eg, 1976-1978 instead of just the rights_date_used=1978)
+- not going to create new 'journal' DB rows if only hathitrust metadata. should expand ISSN metadata for this purpose later