From 5f282a6267182214080ca36bcec4da1755589b46 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 19 Aug 2020 22:55:05 -0700 Subject: iterate on sitemap generation --- extra/sitemap/README.md | 37 ++++++++++++++++++++++++++++++++++++- 1 file changed, 36 insertions(+), 1 deletion(-) (limited to 'extra/sitemap/README.md') diff --git a/extra/sitemap/README.md b/extra/sitemap/README.md index 6963bb1f..735ac925 100644 --- a/extra/sitemap/README.md +++ b/extra/sitemap/README.md @@ -1,6 +1,41 @@ +## Background + Google has a limit of 50k lines / 10 MByte for text sitemap files, and 50K -lines / 50 MByte for XML site map files. +lines / 50 MByte for XML site map files. Google Scholar has indicated a smaller +20k URL / 5 MB limit. + +For the time being, we will include only a subset of fatcat entities and pages +in our sitemaps. + +- homepage, "about" pages +- all container landing pages (~150k) +- "best" release landing page for each work with fulltext (~25 million) + +In the short term, calculating "best" is tricky so let's just take the first +release with fulltext per work. + +In tree form: + +- `/robots.txt`: static file (in web app) + - `/sitemap.xml`: about page, etc. static file (in web app) + - `/sitemap-containers-index.xml`: points to .txt URL lists; generated by scripts + - `/sitemap-containers--.txt` + - `/sitemap-releases-index.xml`: same as above + - `/sitemap-releases--.txt` + +Workflow: + +- run bash script over container dump, outputing compressed, sharded container sitemaps +- run bash script over release work-grouped, outputing compressed, sharded release sitemaps +- run python script to output top-level `sitemap.xml` +- `scp` all of this into place + +To make this work, will configure an nginx rule to point all requests like +`/sitemap-*` to the directory `/srv/fatcat/sitemap/`, and will collect output +there. + +## Ideas on Huge (complete) Index With a baseline of 100 million entities, that requires an index file pointing to at least 2000x individual sitemaps. 3 hex characters is 12 bits, or 4096 -- cgit v1.2.3