diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-22 13:41:11 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2020-01-22 13:41:11 -0800 |
commit | 2e3988fcf6441bef7ee4b030e499fd129e7cb189 (patch) | |
tree | a80218dec84b48b5ff7486a3ebe7502530f98fd0 /proposals/2020_metadata_cleanups.md | |
parent | da64fa0b36218d7f9726aa98dff0e834c1845193 (diff) | |
download | fatcat-2e3988fcf6441bef7ee4b030e499fd129e7cb189.tar.gz fatcat-2e3988fcf6441bef7ee4b030e499fd129e7cb189.zip |
more TODO/proposal cleanup
Diffstat (limited to 'proposals/2020_metadata_cleanups.md')
-rw-r--r-- | proposals/2020_metadata_cleanups.md | 28 |
1 files changed, 26 insertions, 2 deletions
diff --git a/proposals/2020_metadata_cleanups.md b/proposals/2020_metadata_cleanups.md index e53c47d3..cf6b08e5 100644 --- a/proposals/2020_metadata_cleanups.md +++ b/proposals/2020_metadata_cleanups.md @@ -45,7 +45,8 @@ is of the compressed body, not the actual inner file). The current file URL metadata has a few warts: - inconsistent or incorrect tagging of URL "rel" type. It is possible we should - just strip/skip this tag and always recompute from scratch + just strip/skip this tag and always recompute from scratch. Or target just + those domains with >= 1% of links, or top 100 domains - duplicate URLs (lack of normalization): - `http://example.com/file.pdf` - `http://example.com:80/file.pdf` @@ -72,7 +73,8 @@ a reasonable constraint, but am open to other opinions. I think that all web URLs should be normalized for issues like `jsessionid` and `:80` port specification. -In user interface we should limit to a single wayback link, and single link per domain. +In user interface we should limit to a single wayback link, and single link per +domain. NOTE: "host" means the fully qualified domain hostname; domain means the "registered" part of the domain. @@ -82,6 +84,8 @@ NOTE: "host" means the fully qualified domain hostname; domain means the At some point, had many "NULL" publishers. +"NA" in ISSNe, ISSNp. Eg: <https://fatcat.wiki/container/s3gm7274mfe6fcs7e3jterqlri> + "Type" coverage should be improved. "Publisher type" (infered in various ways in chocula tool) could be included in @@ -107,3 +111,23 @@ A partial list: - "Full title page with Editorial board (with Elsevier tree)" - "Advisory Board Editorial Board" + +## Very Long Titles + +These are likely stubs, but the title is also "just too long". Could stash full +title in `extra`? + +- https://fatcat.wiki/release/4b7swn2zsvguvkzmt + => crossref updated + +## Abstracts + +Bad: + +- https://qa.fatcat.wiki/release/nwd5kkilybf5vdhm3iduvhvbvq +- https://qa.fatcat.wiki/release/rkigixosmvgcvmlkb5aqeyznim + +Very long: + +- https://qa.fatcat.wiki/release/s2cafgwepvfqnjp4xicsx6amsa + |