datacite import: store less subject metadata

Many of these 'subject' objects have the equivalent of several lines of text, with complex URLs that don't compress well. I think it is fine we have included these thus far instead of parsing more deeply, but going forward I don't think this nested 'extra' metadata is worth the database space.
author: Bryan Newbold <bnewbold@robocracy.org> 2020-08-10 17:27:26 -0700
committer: Bryan Newbold <bnewbold@robocracy.org> 2021-11-10 13:04:44 -0800
commit: c133f3077aa975aa4706a8e5ca894fc1b71fbc67 (patch)
tree: e0d9d0e31482c2ec92a025bdb871c34447a686b7 /python
parent: 23fd36a3e8505c1ed6d13367a3fb62a8bf2242d7 (diff)
download: fatcat-c133f3077aa975aa4706a8e5ca894fc1b71fbc67.tar.gz
fatcat-c133f3077aa975aa4706a8e5ca894fc1b71fbc67.zip
1 files changed, 7 insertions, 1 deletions
diff --git a/python/fatcat_tools/importers/datacite.py b/python/fatcat_tools/importers/datacite.py
index d4d7a9f5..fe02cac4 100644
--- a/python/fatcat_tools/importers/datacite.py
+++ b/python/fatcat_tools/importers/datacite.py
@@ -597,7 +597,13 @@ class DataciteImporter(EntityImporter):
         if license_extra:
             extra_datacite["license"] = license_extra
         if attributes.get("subjects"):
-            extra_datacite["subjects"] = attributes["subjects"]
+            # these subjects with schemeUri are too much metadata, which
+            # doesn't compress. filter them out.
+            extra_subjects = [
+                subj for subj in attributes["subjects"] if not subj.get("schemeUri")
+            ]
+            if extra_subjects:
+                extra_datacite["subjects"] = extra_subjects
 
         # Include version information.
         metadata_version = attributes.get("metadataVersion") or ""
author	Bryan Newbold <bnewbold@robocracy.org>	2020-08-10 17:27:26 -0700
committer	Bryan Newbold <bnewbold@robocracy.org>	2021-11-10 13:04:44 -0800
commit	c133f3077aa975aa4706a8e5ca894fc1b71fbc67 (patch)
tree	e0d9d0e31482c2ec92a025bdb871c34447a686b7 /python
parent	23fd36a3e8505c1ed6d13367a3fb62a8bf2242d7 (diff)
download	fatcat-c133f3077aa975aa4706a8e5ca894fc1b71fbc67.tar.gz fatcat-c133f3077aa975aa4706a8e5ca894fc1b71fbc67.zip