Bitbucket set to remove Mercurial support

Danne · June 15, 2020, 08:24:53 AM

Downloading:
hg-repos-17
hg-repos-18
hg-repos-19
hg-repos-20
hg-repos-21

Audionut · June 15, 2020, 08:49:00 AM

Quote from: Audionut on June 15, 2020, 03:38:53 AM
I initially had 5 scripts running which was keeping my 100Mbps reasonably saturated and things were singing along nicely, but after some time my storage drive couldn't keep up with all the random writes and crashed to a halt. #beware

There's probably an easier way to do this, but here's what I did (since HDD thrashing was my limiting factor).

Create another folder on another HDD, drop the script and (other) hg-repos-xx files in there. Then drop a symbolic link of that folder into the "home" directory for ubuntu.

Then in ubuntu.

Code Select

cd /home/newfolder
sudo ~/myscript.sh hg-repos-xx

Levas · June 15, 2020, 09:38:42 AM

Ok, managed to get this script working, thanks to the helpfull post of Audionut.

Started with hg-repos-00 just to see how this stuff works.

An Overview:
hg-repos-00 (a1ex: 90.65% downloaded) - Levas
hg-repos-01 (a1ex: 91.83% downloaded)
hg-repos-02 (a1ex: 91.03% downloaded)
hg-repos-03 (a1ex: 92.06% downloaded)
hg-repos-04 (a1ex: 4.21% downloaded)
hg-repos-05 - Critix
hg-repos-06 - Critix
hg-repos-07 - Danne - Audionut
hg-repos-08 - Danne
hg-repos-09 - Danne
hg-repos-10 - Danne
hg-repos-11 - Names_are_hard
hg-repos-12 - Names_are_hard
hg-repos-13 - Names_are_hard - Audionut
hg-repos-14 (a1ex: 60.05% downloaded)
hg-repos-15 (a1ex: 90.53% downloaded)
hg-repos-16 (a1ex: 11.59% downloaded)
hg-repos-17 - Audionut
hg-repos-18 - Audionut
hg-repos-19 - Audionut
hg-repos-20 - Audionut
hg-repos-21 - Audionut
hg-repos-22 (TBD)
hg-repos-23 (TBD)
hg-repos-24 (TBD)
hg-repos-25 (TBD)

Only ML forks, as identified earlier in this thread (caveat: different file format):
ml-forks (a1ex: 471/540 downloaded; the others had errors)

Levas · June 15, 2020, 09:50:58 AM

I'm downloading to an external drive and get these type of messages a lot:

"not trusting file /Volumes/4 TB Seagate Expansion Drive/BACKUP_MERCURIAL/FILES/Diggory/growl/.hg/hgrc from untrusted user _unknown, group _unknown"
I'm having 3 different users on my computer, the user I'm downloading with has full read/write acces to this drive.
Files are written, directory is over 1GB now.
Can these messages be ignored, or is this a problem ?

Code Select


Processing Diggory/growl ...
requesting all changes
adding changesets
adding manifests                                                                                                                                                
adding file changes
added 4171 changesets with 16677 changes to 4841 files (+8 heads)                                                                                               
new changesets 2a9b17b425fb:2ecfd4e7a571
not trusting file /Volumes/4 TB Seagate Expansion Drive/BACKUP_MERCURIAL/FILES/Diggory/growl/.hg/hgrc from untrusted user _unknown, group _unknown
not trusting file /Volumes/4 TB Seagate Expansion Drive/BACKUP_MERCURIAL/FILES/Diggory/growl/.hg/hgrc from untrusted user _unknown, group _unknown

Audionut · June 15, 2020, 09:59:02 AM

I'm seeing the same, they are all ".hg/hgrc".

Another one in red every now and then is something like "stream ended early, expected xxxx bytes, but got xxx bytes".

And some 404: not found.

Danne · June 15, 2020, 10:06:59 AM

Noticed I doubled with Audionut. I will change and download as follows instead:
An Overview:
hg-repos-00 (a1ex: 90.65% downloaded) - Levas
hg-repos-01 (a1ex: 91.83% downloaded)
hg-repos-02 (a1ex: 91.03% downloaded)
hg-repos-03 (a1ex: 92.06% downloaded)
hg-repos-04 (a1ex: 4.21% downloaded)
hg-repos-05 - Critix
hg-repos-06 - Critix
hg-repos-07 - Danne - Audionut
hg-repos-08 - Danne
hg-repos-09 - Danne
hg-repos-10 - Danne
hg-repos-11 - Names_are_hard
hg-repos-12 - Names_are_hard
hg-repos-13 - Names_are_hard - Audionut
hg-repos-14 (a1ex: 60.05% downloaded)
hg-repos-15 (a1ex: 90.53% downloaded)
hg-repos-16 (a1ex: 11.59% downloaded)
hg-repos-17 - Audionut
hg-repos-18 - Audionut
hg-repos-19 - Audionut
hg-repos-20 - Audionut
hg-repos-21 - Audionut
hg-repos-22 (TBD)
hg-repos-23 (TBD)
hg-repos-24 (TBD)
hg-repos-25 (TBD)

Danne · June 15, 2020, 10:31:05 AM

Maybe a stupid question but I want to get this right.

Using this script on my own hg projects. Found them all represented in a1ex https://a1ex.magiclantern.fm/bitbucket-mercurial-archive/hg-repos

Using the script and following starts. Looks ok.

Code Select

Last login: Mon Jun 15 10:18:08 on ttys003
Daniels-MacBook-Pro:repos-danne daniel$ ./bb_script.sh hg-repos-danne

Processing Dannephoto/switch ...
requesting all changes
adding changesets
adding manifests
adding file changes
added 810 changesets with 5642 changes to 1994 files (+1 heads)                 
new changesets 58de0454c0bb:2da1b9624c05

Processing Dannephoto/magic-lantern ...
requesting all changes
adding changesets
adding manifests                                                                
adding file changes                                                             
added 19315 changesets with 42052 changes to 4199 files (+100 heads)            
new changesets 4d0acc5c0792:6b5fa5a301dd

Processing Dannephoto/ml-dng-dannephoto ...
applying clone bundle from https://api.media.atlassian.com/file/cee72b9b-2bf7-4f0f-ac46-b7548f0b22b7/binary?client=d7e55603-7661-4c7a-b2ad-9d34e2f93a3c&token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOnsidXJuOmZpbGVzdG9yZTpmaWxlOmNlZTcyYjliLTJiZjctNGYwZi1hYzQ2LWI3NTQ4ZjBiMjJiNyI6WyJyZWFkIl19LCJleHAiOjE1OTIyMDk4MjMsImlzcyI6ImQ3ZTU1NjAzLTc2NjEtNGM3YS1iMmFkLTlkMzRlMmY5M2EzYyIsIm5iZiI6MTU5MjIwOTQwM30.9xHcMAjnkzIGEMNeiLEnNOaSN-liDWet8MrHBHzx8P4
adding changesets
adding manifests
adding file changes
added 5 changesets with 1582 changes to 1577 files                              
finished applying clone bundle
searching for changes
no changes found

Downloaded and all commits follows. Really nice:

Now to my question. In source folder all I can detect when I unhide(on mac) is the .hg folder:

No sources downloaded? Is this to be expected? Are we downloading commits and .hg content only? A bit confused...

Audionut · June 15, 2020, 10:36:51 AM

Did you look in the .hg folder?

A random folder downloaded here:

a1ex · June 15, 2020, 10:53:04 AM

Quote from: Audionut on June 15, 2020, 09:59:02 AM
Another one in red every now and then is something like "stream ended early, expected xxxx bytes, but got xxx bytes".

And some 404: not found.

Those 404 were deleted by their owners; we can ignore them.

The "stream ended early, expected xxxx bytes, but got xxx bytes" are probably network errors, and these will very likely work on a second attempt.

If you run each script 3-4 times, it should "fix" most of these errors. First run will take a long time (let's say about 1 day, give or take), but subsequent runs should be much faster (let's say 1 hour, but varies a lot, depending on how many errors actually were in that set).

For example, after fully downloading hg-repos-03, an extra run (that doesn't download anything new, but only retries the 404's) takes under 10 minutes, but for hg-repos-01, an extra run takes about 40 minutes.

Quote
No sources downloaded? Is this to be expected? Are we downloading commits and .hg content only? A bit confused...

Right. The .hg directory is all you need to recover the sources, from any version:

Code Select


hg update
hg update <changeset_id>
hg update <branch_name>
...

The working directory is therefore redundant and takes additional disk space, so we don't keep it

Here's a Python script that computes the percentages (btw, 0.01% from one set is exactly 1 repo):

Code Select


from __future__ import print_function
import os, sys

try: error_repos = open("hg-clone-errors.txt").readlines();
except: error_repos = []
error_repos = [x.strip() for x in error_repos]

for i in range(30):
  fn = "hg-repos-%02d" % i
  try: repos = open(fn).readlines()
  except: continue
  repos = [r.strip() for r in repos]
  downloaded = 0
  errors = 0
  total = len(repos)
  for line in repos:
      r = line.split(" ")[1]
      if os.path.isfile(r + ".commits"):
          downloaded += 1
      elif r in error_repos:
          errors += 1
      else:
          pass
  print("%s: %.2f%% downloaded, %.2f%% errors, %.2f%% todo" % (fn, downloaded * 100.0 / total, errors * 100.0 / total, (total - errors - downloaded) * 100.0 / total))

There is a catch: this script can't tell what errors might be recoverable (with a second run), and what errors are not (the 404's and a few others). Even if this prints "complete" (0% todo), running the download script once again might bring a few more repos that didn't work on the first try.

Audionut · June 15, 2020, 11:18:38 AM

Is this downloading all of the comments on commits, PR's etc?

a1ex · June 15, 2020, 11:33:24 AM

No, only the code and its Mercurial history.

Downloading the comments, PRs, issues and other stuff can be done with bitbucket_hg_exporter, but it took about 2 days only for the hudson repository (without the forks). It's still trying to download the same things from all ML forks, but I don't think it's going to finish before the deadline. Doing this for all Bitbucket repositories would probably take years without massive parallelization

Here's an archive of our Bitbucket PRs, issues and other stuff (also being imported by Heptapod, so it will be only for cross-checking):
https://a1ex.magiclantern.fm/bitbucket-mercurial-archive/magic-lantern/gh-pages/

But, as I didn't export the forks with that script (because it was taking too long), the PRs only contain comments and other metadata, without the code. The code is what we are downloading now

Audionut · June 15, 2020, 11:45:09 AM

Ah right, I thought I read somewhere it was getting backed up.

I might run a website cloner on the hudson repository, because why not!

kitor · June 15, 2020, 04:38:25 PM

As I'm in the hurry and just noticed this thread leaving home - is there's anything still to be downloaded? I have ~8TB free space on my server.

Levas · June 15, 2020, 05:24:17 PM

@Kitor

Quote from: a1ex on June 14, 2020, 11:34:27 PM
Ideally, each repo (or set of repos) should be downloaded by at least two participants, just in case.

See list of Danne few posts earlier, most repos are now downloaded by at least one person, according to the quote above, double downloads couldn't hurt.

I'm downloading hg-repos-00
But I'm not sure if I do download anything after that, my upload(~2.5Mbit) is 10 times less then my download(~20Mbit).
I don't wanna calculate how long uploading is gonna take

a1ex · June 15, 2020, 07:08:36 PM

Well, 8TB sounds excellent for collecting the downloads from all participants (in max. 2 weeks, possibly earlier), and running the scripts for deciding how to compress and archive the entire thing (which I didn't write yet, but thought about).

We've worked together on the EOS R port before, so I'm comfortable with this option.

Some not-so-good news about repo sizes: in the first sets, average size is about 5 MiB, but in the middle ones (15,16), it's closer to 15 MiB / repo. Watch out for the free space!

At 20 MiB / repo, 250.000 repos would require about 5 TB. Let's hope it won't exceed 30

My guess: many of the new repos are likely to be forks, so they should compress well if grouped together.

Levas · June 15, 2020, 07:47:15 PM

How big is the first repo (hg-repos-00) ?
Is it really about 1.2 TB

If that's the case, that will take me more then 5 days, IF download speed is max...
So is it 1.2 TB ?
Then I'll probably let this participating over to people with better internet connections

Levas · June 15, 2020, 07:57:51 PM

Oh wait, I see you divided it in 25 chunks of 10000.
Looking Into folder after 10 hours of downloading...2350 folders...

(about 25%)

Just curious, how is everybody else doing

kitor · June 15, 2020, 07:59:13 PM

Quote from: a1ex on June 15, 2020, 07:08:36 PM
Well, 8TB sounds excellent for collecting the downloads from all participants (in max. 2 weeks, possibly earlier), and running the scripts for deciding how to compress and archive the entire thing (which I didn't write yet, but thought about).
At 20 MiB / repo, 250.000 repos would require about 5 TB. Let's hope it won't exceed 30

I'm setting up env for those:

Quotehg-repos-22 (TBD)
hg-repos-23 (TBD)
hg-repos-24 (TBD)
hg-repos-25 (TBD)

I'll run it over night. I have 100/50 connection there, so not great, not terrible; on 2nd location I can setup about 2TB (with 100/100 connection) for side works if needed.

Quote from: a1ex on June 15, 2020, 07:08:36 PM
Well, 8TB sounds excellent for collecting the downloads from all participants (in max. 2 weeks, possibly earlier), and running the scripts for deciding how to compress and archive the entire thing (which I didn't write yet, but thought about).
At 20 MiB / repo, 250.000 repos would require about 5 TB. Let's hope it won't exceed 30

Primary has dual E5-2630 v2 and 96 gigs of ram, so it's fine for some heavy lifting. One limitation is that I need to keep bandwidth heavy things over night / cap it to half during day as it hosts all of my production sites.
On primary I can add 4TB over local gigabit network - as I have cold spare laying around (unfortunately all local storage bays are full of disks).

If somebody wonders, this is my homelab

Alex, if you want some env to experiment on, ping me on PM.

Levas · June 15, 2020, 08:02:53 PM

Quote from: a1ex on June 14, 2020, 11:34:27 PM
Let's divide these into manageable chunks:

names_are_hard · June 15, 2020, 08:11:04 PM

I'm about 50% complete on the 3 I'm downloading, looks like they'll be around 750GB total. So I predict 6.5TB for everything (uncompressed).

kitor · June 15, 2020, 08:17:46 PM

Quote from: Levas on June 15, 2020, 08:02:53 PM

Yup, I just saw that those had no links. Anyway, I won't start before moon (still 4 hours) so Alex may come to rescue

[e]Just started hg-repos-17 for test. If those 22-25 won't be available, I'll run 17-21 over night.

Danne · June 15, 2020, 08:33:29 PM

Quote from: names_are_hard on June 15, 2020, 08:11:04 PM
I'm about 50% complete on the 3 I'm downloading, looks like they'll be around 750GB total. So I predict 6.5TB for everything (uncompressed).

750gb for three repos, I might need to pause a few then. So around 200gb per repo? Is that a good estimation?

kitor · June 15, 2020, 08:51:10 PM

I just wonder, ssh clone wouldn't be faster? I think that http is overhead here.

quick test (ofc 1 run is not enough to prove anything):

Code Select


# time hg clone --config ui.interactive=false -U https://bitbucket.org/hudson/magic-lantern ./magic-lantern-http
requesting all changes
adding changesets
adding manifests                                                                                                      
adding file changes                                                                                                   
added 18025 changesets with 40250 changes to 4090 files (+60 heads)                                                   
new changesets 4d0acc5c0792:f7947b627e33

real    1m20.908s
user    0m30.178s
sys     0m2.235s

# time hg clone --config ui.interactive=false -U ssh://[email protected]/hudson/magic-lantern ./magic-lantern-ssh 
requesting all changes
adding changesets
adding manifests
adding file changes                                                                                                   
added 18025 changesets with 40250 changes to 4090 files (+60 heads)                                                   
new changesets 4d0acc5c0792:f7947b627e33

real    0m46.857s
user    0m26.173s
sys     0m3.698s

a1ex · June 15, 2020, 09:25:06 PM

Confirmed, very nice find:

Code Select


time hg clone --config ui.interactive=false -U https://bitbucket.org/hudson/magic-lantern ./magic-lantern-http
real	1m20.831s

time hg clone --config ui.interactive=false -U ssh://[email protected]/hudson/magic-lantern ./magic-lantern-ssh
remote: Warning: Permanently added the RSA host key for IP address '18.205.93.1' to the list of known hosts.
real	0m26.886s

time hg clone --config ui.interactive=false -U https://bitbucket.org/hudson/magic-lantern ./magic-lantern-http2
real	1m14.051s

time hg clone --config ui.interactive=false -U ssh://[email protected]/hudson/magic-lantern ./magic-lantern-ssh2
real	0m25.652s

Updated the script (edit: defaulting to https, so it works out of the box for everyone). Reminder: never edit Bash scripts while they are running!

Uploaded hg-repos-22, but hg-repos-23 will be ready tomorrow morning (repo list is currently at September 2019).

Danne · June 15, 2020, 09:48:28 PM

Latest ssh script gives following:

Code Select

Last login: Mon Jun 15 21:44:15 on ttys002
Daniels-MacBook-Pro:repo_10 daniel$ /Volumes/bitbucket/repos/bb_script_new.sh /Volumes/bitbucket/repos/hg-repos-10 

Processing trytonspain/trytond-account_treasury ...
The authenticity of host 'bitbucket.org (18.205.93.0)' can't be established.
RSA key fingerprint is SHA256:zzXQOXSRBEiUtuE8AikJYKwbHaxvSc0ojez9YXaGp1A.
Are you sure you want to continue connecting (yes/no)?

After writing yes, then this:

Code Select

remote: Warning: Permanently added 'bitbucket.org,18.205.93.0' (RSA) to the list of known hosts.
remote: [email protected]: Permission denied (publickey).
abort: no suitable response from remote hg!

Processing dglyzin/ndtracer ...
remote: [email protected]: Permission denied (publickey).
abort: no suitable response from remote hg!

Processing GordCaswell/peazipportable ...
remote: [email protected]: Permission denied (publickey).
abort: no suitable response from remote hg!

Processing leondz/timeml-repair ...
remote: [email protected]: Permission denied (publickey).
abort: no suitable response from remote hg!

Processing gstarcorporation/websharper ...

No tinker time atm over here so only reporting.

News:

Bitbucket set to remove Mercurial support