diff options
| author | Bryan Newbold <bnewbold@robocracy.org> | 2020-08-10 17:27:26 -0700 | 
|---|---|---|
| committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-10 13:04:44 -0800 | 
| commit | c133f3077aa975aa4706a8e5ca894fc1b71fbc67 (patch) | |
| tree | e0d9d0e31482c2ec92a025bdb871c34447a686b7 /python | |
| parent | 23fd36a3e8505c1ed6d13367a3fb62a8bf2242d7 (diff) | |
| download | fatcat-c133f3077aa975aa4706a8e5ca894fc1b71fbc67.tar.gz fatcat-c133f3077aa975aa4706a8e5ca894fc1b71fbc67.zip | |
datacite import: store less subject metadata
Many of these 'subject' objects have the equivalent of several lines of
text, with complex URLs that don't compress well. I think it is fine we
have included these thus far instead of parsing more deeply, but going
forward I don't think this nested 'extra' metadata is worth the database
space.
Diffstat (limited to 'python')
| -rw-r--r-- | python/fatcat_tools/importers/datacite.py | 8 | 
1 files changed, 7 insertions, 1 deletions
| diff --git a/python/fatcat_tools/importers/datacite.py b/python/fatcat_tools/importers/datacite.py index d4d7a9f5..fe02cac4 100644 --- a/python/fatcat_tools/importers/datacite.py +++ b/python/fatcat_tools/importers/datacite.py @@ -597,7 +597,13 @@ class DataciteImporter(EntityImporter):          if license_extra:              extra_datacite["license"] = license_extra          if attributes.get("subjects"): -            extra_datacite["subjects"] = attributes["subjects"] +            # these subjects with schemeUri are too much metadata, which +            # doesn't compress. filter them out. +            extra_subjects = [ +                subj for subj in attributes["subjects"] if not subj.get("schemeUri") +            ] +            if extra_subjects: +                extra_datacite["subjects"] = extra_subjects          # Include version information.          metadata_version = attributes.get("metadataVersion") or "" | 
