## HOWTO: Update After a container dump, as `fatcat` user on prod server: cd /srv/fatcat/sitemap export DATE=`date --iso-8601` # or whatever /srv/fatcat/src/extra/sitemap/container_url_lists.sh $DATE /srv/fatcat/snapshots/container_export.json.gz /srv/fatcat/src/extra/sitemap/release_url_lists.sh $DATE /srv/fatcat/snapshots/release_export_expanded.json.gz # delete old sitemap url lists /srv/fatcat/src/extra/sitemap/generate_sitemap_indices.py ## Background Google has a limit of 50k lines / 10 MByte for text sitemap files, and 50K lines / 50 MByte for XML site map files. Google Scholar has indicated a smaller 20k URL / 5 MB limit. For the time being, we will include only a subset of fatcat entities and pages in our sitemaps. - homepage, "about" pages - all container landing pages (~150k) - "best" release landing page for each work with fulltext (~25 million) In the short term, calculating "best" is tricky so let's just take the first release with fulltext per work. In tree form: - `/robots.txt`: static file (in web app) - `/sitemap.xml`: about page, etc. static file (in web app) - `/sitemap-containers-index.xml`: points to .txt URL lists; generated by scripts - `/sitemap-containers-<date>-<shard>.txt` - `/sitemap-releases-index.xml`: same as above - `/sitemap-releases-<date>-<shard>.txt` Workflow: - run bash script over container dump, outputing compressed, sharded container sitemaps - run bash script over release work-grouped, outputing compressed, sharded release sitemaps - run python script to output top-level `sitemap.xml` - `scp` all of this into place To make this work, will configure an nginx rule to point all requests like `/sitemap-*` to the directory `/srv/fatcat/sitemap/`, and will collect output there. ## Ideas on Huge (complete) Index With a baseline of 100 million entities, that requires an index file pointing to at least 2000x individual sitemaps. 3 hex characters is 12 bits, or 4096 options; seems like an ok granularity to start with. Should look in to what archive.org does to generate their sitemap.xml, seems simple, and comes in batches of exactly 50k. ## Text Sitemaps Should be possible to create simple text-style sitemaps, one URL per line, and link to these from a sitemap index. This is appealing because the sitemaps can be generated very quickly from identifier SQL dump files, run through UNIX commands (eg, to split and turn into URLs). Some script to create an XML sitemap index to point at all the sitemaps would still be needed though. ## Resources Google sitemap verifier: https://support.google.com/webmasters/answer/7451001