extra/sitemap/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69


## HOWTO: Update

After a container dump, as `fatcat` user on prod server:

    cd /srv/fatcat/sitemap
    export DATE=`date --iso-8601` # or whatever
    /srv/fatcat/src/extra/sitemap/container_url_lists.sh $DATE /srv/fatcat/snapshots/container_export.json.gz
    /srv/fatcat/src/extra/sitemap/release_url_lists.sh $DATE /srv/fatcat/snapshots/release_export_expanded.json.gz
    # delete old sitemap url lists
    /srv/fatcat/src/extra/sitemap/generate_sitemap_indices.py

## Background

Google has a limit of 50k lines / 10 MByte for text sitemap files, and 50K
lines / 50 MByte for XML site map files. Google Scholar has indicated a smaller
20k URL / 5 MB limit.

For the time being, we will include only a subset of fatcat entities and pages
in our sitemaps.

- homepage, "about" pages
- all container landing pages (~150k)
- "best" release landing page for each work with fulltext (~25 million)

In the short term, calculating "best" is tricky so let's just take the first
release with fulltext per work.

In tree form:

- `/robots.txt`: static file (in web app)
  - `/sitemap.xml`: about page, etc. static file (in web app)
  - `/sitemap-containers-index.xml`: points to .txt URL lists; generated by scripts
    - `/sitemap-containers-<date>-<shard>.txt`
  - `/sitemap-releases-index.xml`: same as above
    - `/sitemap-releases-<date>-<shard>.txt`

Workflow:

- run bash script over container dump, outputing compressed, sharded container sitemaps
- run bash script over release work-grouped, outputing compressed, sharded release sitemaps
- run python script to output top-level `sitemap.xml`
- `scp` all of this into place

To make this work, will configure an nginx rule to point all requests like
`/sitemap-*` to the directory `/srv/fatcat/sitemap/`, and will collect output
there.

## Ideas on Huge (complete) Index

With a baseline of 100 million entities, that requires an index file pointing
to at least 2000x individual sitemaps. 3 hex characters is 12 bits, or 4096
options; seems like an ok granularity to start with.

Should look in to what archive.org does to generate their sitemap.xml, seems
simple, and comes in batches of exactly 50k.

## Text Sitemaps

Should be possible to create simple text-style sitemaps, one URL per line, and
link to these from a sitemap index. This is appealing because the sitemaps can
be generated very quickly from identifier SQL dump files, run through UNIX
commands (eg, to split and turn into URLs). Some script to create an XML
sitemap index to point at all the sitemaps would still be needed though.


## Resources

Google sitemap verifier: https://support.google.com/webmasters/answer/7451001