diff options
author | bnewbold <bnewbold@archive.org> | 2021-11-11 01:11:49 +0000 |
---|---|---|
committer | bnewbold <bnewbold@archive.org> | 2021-11-11 01:11:49 +0000 |
commit | 7e3f91f1a49ea85707cae31125021ba761f5373d (patch) | |
tree | 34c482d15821765ffd7a27f6f049c320a2bf4b2a /python/fatcat_tools/importers | |
parent | b6d228b7171252c8f9f70194c09aba0ed0c55567 (diff) | |
parent | cd09c6d6bd4deef0627de4f8a8a301725db01e14 (diff) | |
download | fatcat-7e3f91f1a49ea85707cae31125021ba761f5373d.tar.gz fatcat-7e3f91f1a49ea85707cae31125021ba761f5373d.zip |
Merge branch 'bnewbold-cleanups-nov2021' into 'master'
Fatcat metadata cleanups/fixups, November 2021
Three cleanups implemented in this branch:
- update non-lowercase DOIs on releases (couple hundred thousand entities)
- fix incorrectly imported file/release pairs, on the file entity side (~250k entities)
- expand truncated wayback URL timestamps in file entities (up to 10 million entities)
Instead of proposals, there are documents for each cleanup in `notes/cleanups/`.
Have done spot testing of tens of thousands of entities each in QA, and confident about running in production.
Plan is to run updates in the order above. DOI and bugfix updates will go fairly fast; the wayback timestamp updates will go slower, and result in large re-indexing load both in fatcat and scholar, because both release and work entities will get triggered for update when file entities are updated.
Diffstat (limited to 'python/fatcat_tools/importers')
-rw-r--r-- | python/fatcat_tools/importers/common.py | 9 |
1 files changed, 9 insertions, 0 deletions
diff --git a/python/fatcat_tools/importers/common.py b/python/fatcat_tools/importers/common.py index fd472d11..2ec6efda 100644 --- a/python/fatcat_tools/importers/common.py +++ b/python/fatcat_tools/importers/common.py @@ -436,6 +436,15 @@ class EntityImporter: if u.rel == "social": u.rel = "academicsocial" + # remove exact URL duplicates, while preserving order, and removing + # "later" copies, not "first" copies + # this is sensitive to both url.url and url.rel combined! + dedupe_urls = [] + for url_pair in existing.urls: + if url_pair not in dedupe_urls: + dedupe_urls.append(url_pair) + existing.urls = dedupe_urls + # remove URLs which are near-duplicates redundant_urls = [] all_urls = [u.url for u in existing.urls] |