diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-09-02 19:32:46 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-09-02 19:32:46 -0700 |
commit | 3ad7a3c48de77c00ad0e777d24021f8db340912c (patch) | |
tree | 0a37614523361dc444d1756c5debe647e13ab8f8 /notes | |
parent | 3d1852453a3cadd2db8e5a1014c3451a9a0b5fb8 (diff) | |
download | chocula-3ad7a3c48de77c00ad0e777d24021f8db340912c.tar.gz chocula-3ad7a3c48de77c00ad0e777d24021f8db340912c.zip |
notes on hathitrust importer
Diffstat (limited to 'notes')
-rw-r--r-- | notes/hathitrust.md | 58 |
1 files changed, 58 insertions, 0 deletions
diff --git a/notes/hathitrust.md b/notes/hathitrust.md new file mode 100644 index 0000000..a79418a --- /dev/null +++ b/notes/hathitrust.md @@ -0,0 +1,58 @@ + +Download: <https://www.hathitrust.org/hathifiles> +Schema: <https://www.hathitrust.org/hathifiles_description> + +Munging/filtering huge file to just serials: + + zcat hathi_full_20200801.txt.gz | rg '\tSE\t' | rg '\t\d\d\d\d-\d\d\d.\t' | pv -l > hathi_full_20200801_serials.txt + => 2.65M 0:00:50 [53.1k/s] + + cut -f10 hathi_full_20200801_serials.txt | sort -u | wc -l + => 102,008 + +Wow, that is a lot of coverage! If true. + +Columns we would be interested in: + +- 2 access (allow=bright, deny=dark) +- 5 description +- 10 issn ("multiple values separated by comma") +- 12 title (if translated, separated by equals or slash) +- 13 imprint (publisher and year; often "publisher, year") +- 17 rights_date_used (year; 9999=unknown) +- 19 lang (MARC format) + +Inspect "extent" (volumes/years), ISSN, title: + + shuf -n10 hathi_full_20200801_serials.txt | cut -f2,5,10,12,13,17,19 + +If we did, eg, onix CSV output, would want: + +- ISSN +- Title +- Publisher +- Url +- Vol +- No +- Published +- Deposited + +KBART directory fields: + +- issnl +- title +- publisher +- year +- volume +- url + +Note: could extract some bounds on publication (start date, end date, or both) +from the publisher field + +If we trust this metadata, it is going to add some 90k container entities to +fatcat, with very partial metadata. Likely we should at least pull in ISSN +portal (scraped) metadata at the same time. + +TODO: +- year ranges (eg, 1976-1978 instead of just the rights_date_used=1978) +- not going to create new 'journal' DB rows if only hathitrust metadata. should expand ISSN metadata for this purpose later |