Repo list completely downloaded, so we now have hg-repos-22, 23 and 24. Last one has only about 2500 repos; the interesting part is that many Mercurial repos were created in 2020, and 7 of them were created... today! They are very likely forks of existing repos. Actually, 3 of the repos created in 2020 were forks of hudson/magic-lantern, created for submitting pull requests, from Bitbucket's web interface.
Links in
the big post.
I'll download another repo list, hopefully this time it will complete in a single attempt, so... at the end of the week, I should be able to cross-check the list. If I'll find anything missing (mistakes can happen), I'll add them into hg-repos-25. The others are now set in stone

First 3 sets appear to be complete on my side:
hg-repos-00: 91.23% downloaded, 8.77% errors, 0.00% todo, 2.82 MiB average
hg-repos-01: 92.38% downloaded, 7.62% errors, 0.00% todo, 4.16 MiB average
hg-repos-02: 92.08% downloaded, 7.92% errors, 0.00% todo, 6.81 MiB average
For the others, my current estimation of average size is biased, as in my first attempt, I've been skipping very large repos. Will report as soon as I'll trust the numbers

Extended status script (reporting average repo size, but slow):
from __future__ import print_function
import os, sys, subprocess, shlex
try: error_repos = open("hg-clone-errors.txt").readlines();
except: error_repos = []
error_repos = [x.strip() for x in error_repos]
def repo_size(r):
# this is slow
r = subprocess.check_output(shlex.split("du -b -d 0 " + r + "/.hg"))
return int(r.split(b"\t")[0])
for i in range(30):
fn = "hg-repos-%02d" % i
try: repos = open(fn).readlines()
except: continue
repos = [r.strip() for r in repos]
repos = list(set(repos)) # unique list (hg-repos-20 contains a few duplicates)
downloaded = 0
errors = 0
total = len(repos)
size = 0
for line in repos:
r = line.split(" ")[1]
if os.path.isfile(r + ".commits"):
size += repo_size(r)
downloaded += 1
elif r in error_repos:
errors += 1
else:
pass
print("%s: %.2f%% downloaded, %.2f%% errors, %.2f%% todo, %.2f MiB average" % (fn, downloaded * 100.0 / total, errors * 100.0 / total, (total - errors - downloaded) * 100.0 / total, size * 1.0 / total / 1024 / 1024))
I don't know if hg clone already does this, but would be nice to have SHA256 for each split...
Been looking into this as well, but the exact contents of the ".hg" directory seem to be different among fresh clones. One obvious difference is the source URL, which is different if you clone with HTTPS or with SSH. Didn't investigate much, but... even the contents of .hg/store appear to be slightly different. Not sure why exactly.
Checksums for the repo lists, to verify the downloads:
# sha256sum all-repos hg-repos*
e68bf18a3433ba921443616e7e69486c25d48f132071a8cb319a5d1212b1a330 all-repos
e1bbf56016d6f958a1539ab5de14c0c8536aa14376ac9431d92e6a4e982032ed hg-repos
66f9e6fa4fbff1af0c8f6190ae60e285534e334ec02f2b3bbebe28e9651e5580 hg-repos-00
0db369328d1b479d393d8fdd58c9b316c8ef856a71e1cfd749599dc1c9592144 hg-repos-01
761c2c2d39529294fa1aa9e26c6dbb4d2732b6fdb5704b4936f5e7d52dd46a87 hg-repos-02
ae7ef228fb60e750334176318bc672e2783e715108573bb17c3aafa12c786e2f hg-repos-03
66bbc945f3d885953efcda6a4af2b22a2530e0bdf29bd462212ed45c159123a4 hg-repos-04
cc7deeda69e5f4a1360f129616f1bfe8cb96b63209850a62b4fad391ad2e354e hg-repos-05
07b3bb8809dedb4c5c97034353393b5f28db7d08e33d1abb0f2a1363121c676f hg-repos-06
c20596b54a00d9951c2194172eca4f0f32c78a8c44d70919d5b28ab34bcac798 hg-repos-07
9fde67105c497a14ce997c0ca87cba2eba86a9b894372ae89393f4436c6da792 hg-repos-08
fff343b16fa879eef3a3c5f9719c3e5a5de8098af090cfc231124070e1f20190 hg-repos-09
bfff93447f3ec036fcb15e1bebfed97967fb3a84ca8214e44f77b3eb27a696e3 hg-repos-10
7477e4caa1163335c33f0aa7ab711fc9614298aca39158dd6fbf59fd9ae89a8d hg-repos-11
16d60a7fe0e644143b92df13abcae7af813cdf0ba8d42359da4ef98ca2ab9400 hg-repos-12
d623425b7d6f250b06c992d705f6561b7f8489337968dafe88d297e44d70ff22 hg-repos-13
b6b97a954028972fa06469dc7ce7b44be0bf90c086b5316295a4d80b39057b8c hg-repos-14
84ab787f19a4d6463c3f343da73d5824dc4acef7cbdd7670a73f6e4be0bb90e3 hg-repos-15
138c982985f812db853903c2dba4f6dfe8c16e441e9d4b801d08daf45a15870b hg-repos-16
520cdbec2fabb54e09cb66b5572b654352fbc1ff46625ad0a23b446dd61a547b hg-repos-17
ce4cf07804864d6832976c36521b502ec7e162057449d076c87de5191ad9e180 hg-repos-18
3a8353098236fbf7cfceacd2121af0e44d5b9b46734e81ef38f7e4f5bdeb2848 hg-repos-19
1b6255b7aec596d2b2931589192b72daacd52bc9c6198f25b9cd4cdf6e04b4d1 hg-repos-20
c230a100849e1d85a243dd512aba0debf996ca8e27af7ad351f6754447621e10 hg-repos-21
de76f1989577991f40f6addd2d67e62d5f66048fcb47d86b9dab6444f835bcbe hg-repos-22
661da278c57fef58306e6144f2dde7f2e0c09213fa3a47e9c6245864c68f08c8 hg-repos-23
6dbffbb608179da20d4f4c0dac8252ca9eba23657dcc6eac13288a4d3b6ff6e5 hg-repos-24
# md5sum all-repos hg-repos*
73f16c6e7d8ec44a22d35748505f4486 all-repos
a56d079deadb5689e199cc8bb9112c9a hg-repos
9b7074c0b8fa74078b977778b0aa2f51 hg-repos-00
648ae2d23a6b444423c1e7a429fffbc5 hg-repos-01
f1e72fb46f4ea57558f4e47330d663f5 hg-repos-02
ff7ea3f6b8ea07247e51b365487c4ec5 hg-repos-03
aebca1355249198916050d001dabac86 hg-repos-04
7c221e98539145e187d13543e0ccb447 hg-repos-05
ab5533cca9c6501d352bd2d981cfbe01 hg-repos-06
a76b118a375c095d5b2be91c7c846e96 hg-repos-07
19d36f29bfc12a0aa72b6f7ded194ba6 hg-repos-08
e6dffbab4d1b717eb1177ccb1133a55a hg-repos-09
884c5c49a4880b688a5d8f624dfe2188 hg-repos-10
10cf61d018db3551c4087b4e671b80b6 hg-repos-11
7cfeba30d3f9462aa5c327fd8ba12a5c hg-repos-12
c481fbad58b375b1c316e1c5493b0924 hg-repos-13
079e724a6ce23a69c0a6c0d1bc8c27a3 hg-repos-14
8763470dcbec370d062b40aea2a2dafb hg-repos-15
b365070a0066ea15aee087c77f848d6c hg-repos-16
1f45337c6efc5041e21b7d6f85bb1b5c hg-repos-17
aab6d956adf417ca369691d0858c6617 hg-repos-18
229ccf672fd9d77178ffce692a65a7f8 hg-repos-19
7052b2669a69f7d8b9229d58e5cc1975 hg-repos-20
a320bb90fc163855909d4893e395c48f hg-repos-21
265a5ac75e355fa27e45aefd23ce0dc1 hg-repos-22
89c1b7070d0a91aab8ff6a53c440199f hg-repos-23
8ebeb53282679c8f3f5c060f2c584451 hg-repos-24
Currently, for each repo, the download script creates a list of commit hashes. These depend on the "contents" (commit body, message, timestamp etc) and on the parrent commit(s) (
details), so as long as the .hg directory is not messed up, I'd say it should be OK for integrity checks. When collecting the downloaded files from all participants, I can re-create the lists of commits and compare them, and I can also run "hg verify" on each repo.
Also noticed the .hg/store/data directory has many file names resembling the ones in the source code, and got a crazy idea: what if we group similar or identical files together? Would the compressor find repeated patterns easier? This idea was
previously explored, and apparently has some merit.
On the previous example of hudson/magic-lantern and all of its forks, this command compressed the entire thing to 157 MiB (without sorting: 273.3 MiB, uncompressed size 16.4 GiB, individually compressed repos 14.3 GiB):
find -type f -path '*/*/.hg/*' | rev | sort | rev | tar -cvf - -T - | xz -9e --lzma2=dict=1536Mi -c - > ml-repos.tar.xz
It sorts the file list by the reversed-string file path, effectively grouping files with the same name together (without looking at the contents). TODO: try the approaches from the "morimori" blog post and compare various commands.