Alright folks, tonight we start backing up "
all of Bitbucket" (well, all of the Mercurial repositories stored there).
BackgroundBitbucket (Atlassian) are going to discontinue Mercurial support (whatever, it's their business; after all, it was a free service, so we can't complain). The ugly part is - they don't seem to bother providing an archive. No, they are going to DELETE more than 10 years of work from TENS OF THOUSANDS users.
Other source code hosts did provide an archive when shutting down, see e.g.
CodePlex or
Google Code. Not Bitbucket.
For this reason, I'd stay away from their current and future offerings, no matter how tempting they might be. Time to move on.
OK, they were nice enough to give us one year to migrate (they could have deleted everything right away, and they would have probably been covered by their ToS - which I
didn't read). For active projects, that's probably fine (more or less). However, many of these projects are no longer maintained; their last update was several years ago.
If you believe these unmaintained projects are no longer of interest, please stop reading here.Anyway. Migration wasn't straightforward either. To date, I'm not aware of any way to losslessly convert a Mercurial repository to Git (and I bet the Bitbucket folks are not aware either). The hg-git extension promises lossless conversion, but fails to preserve the changeset IDs. The best results I've got were with the git-remote-hg extension, which provides a way to contribute from git, but... after trying several tools, I couldn't find a way to recover the original hg repo from its git copy.
OK, so what's the plan?I've got a list of nearly ALL Bitbucket repositories, retrieved through their API, and attempted to download them. The process was not fast - only about 50.000 repos in one week (OK, my download setup was not very optimized either). There are about 250.000 Mercurial repos (as estimated by Octobus); downloading all of these would take about 5 weeks at this rate.
There are only two weeks left, so... let's parallelize!Estimated average repo size is about
5 10-15 MiB, so the entire *raw* archive should fit on
1-2 3-5 TB of disk space.
How much can this be compressed? First observation is that Mercurial raw data is already compressed, so attempting to e.g. "tar bz2" or "tar xz" every single repo, like I did in my previous attempt, is not going to help much.
However, many of these repositories are forks, which means, plenty of duplicate data. Therefore, forks are expected to compress very well
if we group them together). Here's an example from our project (hudson/magic-lantern and its forks, which I've already downloaded from Bitbucket):
- Raw archive: 16.4 GiB (471 repos out of 540 reported by the API, 35.6 MiB average, downloaded in 2.5 hours)
- Individually compressed repos (tar.xz, default settings): 14.3 GiB (compression took 2 hours, one core on i7 7700HQ)
- Archive of tarballs: 14.3 GiB (
food for thought)
- All ML forks archived together (tar.xz, -9e --lzma2=dict=1536Mi): only 273 MiB (!), compression time ~ 1 hour
Archiving everything in a single file, after downloading, might also work reasonably well (todo: test on a set of 10.000 repos).
Hence, the plan:- Stage 1 (next two weeks): download all Mercurial repos from Bitbucket and store them
uncompressed (raw .hg directories)
- Stage 2: decide the best strategy for compressing all of this stuff (possibly by grouping forks together - can be automated)
- Stage 3: publish an archive, for the entire world to use (what if other open source projects missed some important bit during their migration?)
List of repos:
all-repos (huge file; all repos until June 16, 2020)
hg-repos (huge file; list of Mercurial repos, which I'll divide in smaller chunks)
[ fields: hg/git, user/repo, creation date, last updated, optional url ]
Let's divide these into manageable chunks:
split -l 10000 --numeric-suffixes hg-repos hg-repos-
hg-repos-00 (a1ex: 91.23% downloaded, 8.77% errors, 2.82 MiB average) (Levas: started)
hg-repos-01 (a1ex: 92.38% downloaded, 7.62% errors, 4.16 MiB average)
hg-repos-02 (a1ex: 92.08% downloaded, 7.92% errors, 6.81 MiB average)
hg-repos-03 (a1ex: 92.76% downloaded, 6.43% errors, 0.81% todo)
hg-repos-04 (a1ex: 31.84% downloaded, 2.16% errors, 66.00% todo)
hg-repos-05 (critix: 23.59% downloaded, 1.78% errors)
hg-repos-06 (critix: 27.30% downloaded, 2.02% errors)
hg-repos-07 (Audionut: 23.08% downloaded, 1.96% errors) (Danne: started)
hg-repos-08 (Danne: started)
hg-repos-09 (Danne: started)
hg-repos-10 (Danne: started)
hg-repos-11 (names_are_hard: started)
hg-repos-12 (names_are_hard: started)
hg-repos-13 (names_are_hard: started) (Audionut: 25.35% downloaded, 1.18% errors)
hg-repos-14 (a1ex: 68.22% downloaded, 0.37% errors, 31.41% todo)
hg-repos-15 (a1ex: 91.68% downloaded, 0.94% errors, 7.38% todo)
hg-repos-16 (a1ex: 12.01% downloaded, 0.43% errors, 87.56% todo)
hg-repos-17 (Audionut: 24.25% downloaded, 3.39% errors) (kitor: 14.37% downloaded, 0.43% errors)
hg-repos-18 (Audionut: 22.46% downloaded, 0.51% errors) (kitor: 14.57% downloaded, 0.21% errors)
hg-repos-19 (Audionut: 15.49% downloaded, 0.27% errors) (kitor: 10.17% downloaded, 0.14% errors)
hg-repos-20 (Audionut: 16.87% downloaded, 0.21% errors)
hg-repos-21 (Audionut: 20.56% downloaded, 0.26% errors)
hg-repos-22hg-repos-23hg-repos-24 (
Edit and fix link //Audionut)
Only ML forks, as identified earlier in this thread (caveat: different file format):
ml-forks (a1ex: 471/540 downloaded; the others had errors)
The hacky download script:
#!/bin/bash
# usage: [bash] ./download_bitbucket_hg_repos.sh hg-repos-00 # or 01, 02 etc
for f in $(cat $1 | cut -d ' ' -f 2); do
echo
echo "Processing $f ..."
# skip already-downloaded repos (for which we have a valid .commits file)
if [ ! -f $f.commits ]; then
# hg clone, don't prompt for user/password/whatever,
# and don't update the working directory (we only need the .hg folder)
# this may fail (404 on some repos, auth needed on others, etc)
if hg clone --config ui.interactive=false -U -- https://bitbucket.org/$f $f; then # HTTPS version, slower, but works out of the box
#if hg clone --config ui.interactive=false -U -- ssh://hg@bitbucket.org/$f $f; then # SSH version, faster (thanks kitor), but requires additional setup ("You need to add your ssh public key to bitbucket. And run HG once by hand (without disabling interactive shell) to accept remote ssh pubkey.")
# for each successfully-cloned repo, we build a list of commits (hashes only)
# this lets us identify the contribution of every single fork
# this may be used to decide the best compression strategy, after downloading all of the stuff
(cd -- $f && hg log --template "{node}\n" > ../../$f.commits)
else
# "hg clone" failed for some reason
# todo: report status? (404 or whatever)
# these repos will be retried if you run the script twice
echo "$f" >> hg-clone-errors.txt
fi
fi
done
I'm keeping the script very simple, to avoid potential trouble. To parallelize, you should be able to start as many instances as you want (each instance with its own repo list, of course). These instances can
probably work in the same directory (not thoroughly tested, but as long as each instance processes a different list of repos, it should be fine, I think). You can stop and restart each instance as needed, by closing the terminal (CTRL-C will only stop the active process, very likely a "hg clone", resulting in a false error report).
You will need:
- a working directory with
enough free space available (assume
10 20 MiB / repo on average, although it's likely less)
- one or more lists of repos (download from above).
That's it, now you are ready to run the script.
Guess: 2...4 threads might help (not 100% sure).2-3 threads are probably best, depending on how powerful your machine is. Watch out for HDD/SSD thrashing!
If your machine starts to be unresponsive, you may have too many threads running. Stop or pause some of them!
Caveat: never edit Bash scripts while they are running!Script outputs:
- For each repo (e.g. hudson/magic-lantern):
- user directory (hudson/)
- project directory (hudson/magic-lantern/)
- .hg folder (hudson/magic-lantern/.hg/ - possibly hidden by default, depending on your file browser)
- a list of commits (hudson/magic-lantern.commits)
- For all repos:
- hg-clone-errors.txt (hopefully obvious)
Once you start running the script, let me know what subset(s) of repos you are downloading; we all should know what everybody is downloading, what has been downloaded, and so on. Ideally, each repo (or set of repos) should be downloaded by at least two participants, just in case. I'll edit this post to keep it up to date.
Afterwards, you can create an archive of what you downloaded, with:
tar -cJvf commits.tar.xz */*.commits # commit lists only
tar -cf - */*/.hg | xz -9e --lzma2=dict=1536Mi -c - > repos.tar.xz # hg repos only (slow, RAM-intensive!)
Let's hope the resulting files will be small enough to exchange them.
Aren't others already taking care of this?(or, aren't you supposed to port ML on the 13D Mark 7, instead of this bull***?)Unfortunately, the well-known archive.org -- who, to their credit, saved the day countless times -- apparently didn't do a great job on this one (see earlier report from aprofiti).
These guys did better (see
archive.softwareheritage.org) but...
1) they are one single point of failure (so, a second backup should never hurt)
2) we have noticed a little issue in their archive: the Mercurial changeset IDs were not kept, at least in our repo (example a few posts above)
3) the deadline is coming!
OK, OK. Why didn't you start earlier?As you may or may not know, this is a hobby project for us. Translation: if there is time to spare, the project advances. If there is not, the project stalls (or... disappears). And, as surprising as it may sound, we also have to eat from time to time (alongside with our families). The "default" way to put food on the table is to get a job, which can have the side effect of not leaving much time for hobbies (especially during these difficult times).
As mentioned earlier, I actually took a 2-week holiday in order to perform this migration and to catch up with other hobby projects (no travel plans or anything like that). That's when I've started to research this issue and noticed the need for a good backup of all those repos (not only ours).
The good side: after one week of messing around, I think I've got a plan that has at least some chances to work

So, let's try!