aki ([info]drkscrtlv) wrote,

dispelling some mozharness fallacies

My previous blog post, what is mozharness?, appears to have caused some dismay.
I'm writing this to answer Axel's invitation to respond, and to clear up some misconceptions about mozharness.


[assertions]

  1. Respect for Mozilla l10n.

    This conversation is larger than just localization. However, the above thread brings up l10n specifically, and the first two mozharness scripts involve l10n, so here are my high-level thoughts:

    I remain highly impressed with the state of l10n at Mozilla: the sheer number of locales, the contributions of volunteer localizers, the sim-shipping of localized releases with en-US releases. Never have I seen localization done anywhere near so efficiently and well, anywhere in software. I believe strongly that this is key to Firefox's success. And Axel is a significant part of that.

    I also believe that however good our localization story may be, there's definitely room for improvement.
    We seem to disagree as to how to improve that story for the moment.


  2. Mozharness is imperfect software with time-tested concepts.

    I will be the first to admit that mozharness is imperfect. I'm self-taught. Python is a relatively new language to me. I wrote mozharness to solve complex problems with tight deadlines. Mozharness, as it exists today, is essentially non-feature-complete beta software.

    (With that in mind, I will be speaking about what mozharness could be, in forward-thinking statements. Saying "Nothing prevents us from doing ____" does not mean it'll take zero development/testing/roll out time. I am speaking about technical feasibility only.)

    However, the concepts behind mozharness are lessons I've learned over the years. Usually the hard way.
    I'm open to changing mozharness' specific implementation details, but I strongly believe the concepts themselves are right.
    It falls on my shoulders to communicate those clearly.


  3. Buildbot was not written to micromanage slaves.

    The above statement sums up my first conversation with Brian Warner, where we vehemently agreed that too much complex logic had been relegated to Mozilla's buildbot masters.

    Even if one ignores his original intent, I assert that moving complex logic from an overloaded master to its relatively unloaded slaves will be

    1. more efficient
    2. more scalable
    3. more portable

    I'll revisit this, and the other assertions, in the next three sections.


[fallacy #1: mozharness will make {builds,tests,repacks} less granular]

This can be split into two concerns: granularity of status, and granularity of easy-to-replicate steps. This is going to be a long section; I still haven't fully convinced everyone on my own team about these points.

  • Mozharness can sum up status at the end of jobs.

    I don't want to spend this entire post talking about abstractions and what mozharness could be if this nebulous vision in my head somehow sees the light of day. So here's a concrete example that exists today.

    signdebs.py outputs a summary at the end of every log:

    19:29:25 INFO - #####
    19:29:25 INFO - ##### MaemoDebSigner summary:
    19:29:25 INFO - #####
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/he/fennec_4.0~b3~20101117005726_armel.deb; skipping he on fremantle
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/ja/fennec_4.0~b3~20101117005726_armel.deb; skipping ja on fremantle
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-gtk-l10n/ja-JP-mac/fennec_4.0~b3~20101117005726_armel.deb; skipping ja-JP-mac on fremantle
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/he/fennec_4.0~b3~20101117010937_armel.deb; skipping he on fremantle-qt
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/ja/fennec_4.0~b3~20101117010937_armel.deb; skipping ja on fremantle-qt
    19:29:25 ERROR - Can't download http://stage.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-maemo5-qt-l10n/ja-JP-mac/fennec_4.0~b3~20101117010937_armel.deb; skipping ja-JP-mac on fremantle-qt
    19:29:25 INFO - Uploaded multi on fremantle successfully.
    19:29:25 INFO - Uploaded en-US on fremantle successfully.
    19:29:25 INFO - Uploaded ar on fremantle successfully.
    19:29:25 INFO - Uploaded be on fremantle successfully.
    19:29:25 INFO - Uploaded ca on fremantle successfully.
    19:29:25 INFO - Uploaded cs on fremantle successfully.

    ... snip 76 more lines of "Uploaded ___ on ___ succesfully."

    I'd love for this to be prettier and take fewer lines, like "82/88 deb repos uploaded successfully!" with more verbose information about the ones that failed. But again, deadlines, and pretty statuses were not a hard requirement.

    • What's keeping us from emailing this summary somewhere? Nothing.
    • What's keeping us from updating a database or hitting a cgi with this status? Nothing.
    • What's keeping us from sending out a Pulse message with this info? Probably the fact that I know very very little about how to send a Pulse message. Other than that, probably nothing.

    We'll want to make those updates conditional, if "notify-pulse" in self.actions:, for example, so staging or development runs don't attempt to send production status messages. But if it can be done in a script, it can be done at the end of any mozharness script.


  • Mozharness can update status during jobs.

    What's keeping us from updating status in the middle of a for locale in locales: for loop, for example? NOTHING.

    Though it is more expensive. If you want to hit a cgi, for example, mid-for-loop, then that cgi needs to be up and available the entirety of every script run (or the script needs to fail gracefully, or have retry logic, or queueing logic, or whatever.) You need to decide when the cost of faster status updates is worth the effort, since post-processing tends to be less expensive.

    There is nothing preventing us from doing so, however, if the need outweighs the cost of updating status throughout the script.
    If it can be done in a script, it can be done inside of any mozharness for loop.


  • Mozharness parses its own logs during runCommand() calls already.

    The runCommand() method uses pre-defined error lists to determine whether specific lines in the log are errors.

    These are far from complete, and need to be fleshed out further, but the framework is there.

    Recently, I thought we should add a summary (or somesuch) key to that list of dictionaries. We could, say, add a substr of "Assertion failure: !(addr & GC_CELL_MASK)" with a custom level of intermittent_orange (somewhere between info and warning?) and a summary of "Intermittent orange: This looks like bug 583554!"

    As long as we're in conjecture-land, we can combine this with a post- or mid-job status update that can populate an intermittent orange database with the specific details of this job.

    Awesome or not awesome? You vote.


  • Mozharness can create buildbot-parseable status.

    This point's reverse could be a top level fallacy in itself.

    There are two approaches here.

    If it's a hard requirement that we lose none of the "granularity" of the existing buildbot steps, nothing prevents us from creating an action list that encapsulates each buildbot action.

    Then you can script.py --only-set-props-builddir, for example. A buildbot addStep(['python', 'script.py', '--only-%s' % stepName]) for every single thrilling step in the one-hundred-one steps here.

    Can you? Yes. Do you want to? I would argue no.

    Do developers really care if buildbot step #73 dies with python exception ____? Or do they only really care if compilation fails on file X at line Y (link to hg annotate with appropriate finger pointing here)?

    (What if we tied in a ping in #developers or an email message for the suspected culprit in the notify action? Not free, in terms of mozharness development time, but is it doable in scripts? Seems like it, to me.)

    Do localizers really care if buildbot step #45 dies with compare-locales exception ___? Or do they just want a description of which strings are missing, or which XML files need updating, with a link to a wiki page on how to fix those things?

    The second approach could involve writing a buildbot-property parseable file during or at the end of the mozharness script, and adding a buildbot addStep(SetProperty("cat filename")) afterwards to set buildbot-statusdb-parseable buildbot properties.


  • Mozharness could create a step-summary log.

    As I explained in my previous blog post, I lean very heavily towards more verbosity than less, though you could --log-level error (or w/e) to ignore most of this verbosity.

    In a --multi-log run, I could add a step-level log. Basically, any calls to the BaseScript wrapper methods could also write their equivalent to a log.

    (Huh??? English, do you speak it?)

    If we cared enough about this, we could create yet another log file. The BaseScript.chdir() method could output cd DIRNAME to this log. And the BaseScript.runCommand() method could write its command line to this log file, and so on. This log file would approach being an executable shell script.

    I say approach since there aren't always going to be universal scriptable equivalents. But this would be an improvement over the status quo.


  • High-level granularity is not always desirable.

    So you're a developer or a localizer. You've painstakingly set up scratchbox to the point where it works (congratulations!!). You now want to attempt a Maemo multilocale build without checking into the tree and either requesting or waiting for a nightly. What steps do you follow?

    You could drill down into the 101 buildbot steps, each with python-list-format command lines, and full env dumps per step. Granular, no? Run each one, with appropriate env settings, and then it should work! Maybe! Have fun!

    Or you could take a [yet to be written, but next in line] mozharness script, a config file that's oriented towards standalone developers (you may have to edit paths), and an example command line, and run it. It should hopefully either give you a descriptive error message (and ultra-verbose logs), or a usable multilocale deb file.

    Which would you prefer? (Don't let me influence your decision.)


[fallacy #2: a script means complex undecipherable command lines]

(Phew! That last section was a doozy, wasn't it? This section will be shorter, I promise!!)

This section also has a couple components, though much shorter.

  • First, Ben's work in bug 608004 is not mozharness.

    Is it a step in that direction? Sure, it's moving logic out of buildbot towards slave-side python scripts. Does that mean it's mozharness? No. Are we considering merging tools/lib/python code into mozharness? Yes. But it's not reality yet.

    I haven't looked at his patch; I've been bogged down trying to fix the Android Tegras. And writing this lengthy blog post. But it is not mozharness.

    Could it be ported to mozharness? In my eyes, yes, easily, without even looking at the code. I've thought about writing this script in mozharness myself in my "spare time". But again, I haven't looked at the patch.

    Could it be reliant on long command line arguments? Sure, that's probably the case outside of mozharness. But that's not what I'm discussing in this blog post.

    The only two scripts in mozharness, as of this writing, are for Maemo deb signing and Android multilocale.

  • Second, in mozharness, practically every option that's specifiable via commandline is specifiable in a config file.

    What does this mean? You certainly could specify a massive command line that challenges ARG_MAX every single time. Or, if you find yourself doing this often, you could save all of that in a config file (json only, currently, but we can add .py or other filetype support relatively easily) and just run path/to/script.py --config-file path/to/config/file .

    In fact, right now any mozharness script will look, by default, for ./localconfig.json and use that if no --config-file is specified. I'm debating whether this might actually be harmful in production systems, but can you really get much simpler than path/to/script.py? Or python path/to/script.py if your #! support is broken.


[fallacy #3: mozharness will replace all of buildbot]

Is mozharness powerful enough to replace all of the complex buildbotcustom.process.factory logic in buildbot? Absolutely.

Is that my short term goal? No. I have real bugs to fix. Blockers for shipping real product. Replacing working code with a rewrite-without-urgent-need is at the bottom of my todo list.

The reasons I wrote however much of mozharness I did include:

  • RelEng has been considering ways to move build logic to slave-side scripts, and this is my proof-of-concept
  • I've been trying to solve real problems with real deadlines. Like MultiNightlyL10n being the only real blocker to moving the mobile build infrastructure from the crufty buildbot-0.7 branch to the supported and shiny default 0.8.x branch.
  • I do secretly hope that the community will buy into this, to the point where I can afford to spend the time to do this. Because, if I haven't made it clear by this point, I believe in this. But if there's no immediate goal or community buy-in, that's a huge task to tackle.

I mean, the barrier of entry to buildbot is... high. First, install buildbot! Then, navigate our buildbot-configs and buildbotcustom repos (easy!), set up your master, then set up a slave that points to the master, then somehow use the appropriate one of the six buildbot methods to trigger a build/test/repack that you want, and debug from there.

Or, check out this repo, potentially modify this config file that's tailored to your use case, and run this script. You'll either get a [hopefully] useful error message, or your <select ... > <option ... >(while i'm promising the moon, select one)</option> <option ... >multilocale build</option> <option ... >l10n repack</option> <option ... >standalone talos performance results</option> <option ... >orange intermittent test results</option> <option ... >pgo build</option> <option ... >WHATEVER A SCRIPT CAN DO</option> <option ... >THAT WE FEEL IS WORTH THE TIME TO WRITE</option> <option ... ></option> </select> with verbose logs and python source to tweak if you want to delve into this shit.

But am I going to volunteer to port all of that stuff if people aren't into it? Fuck no. I will argue this to the ground, evidently, because I've thought about this stuff for years and years. This blog post may end up being my own personal ten fucking days rubicon, with its forward-thinking year's worth of "we could do this!" statements. But I still feel like I haven't touched on everything I've thought about over the years. And I'm defending my not-yet-fully-formed mozharness against allegations that it's going to be harmful for some reason.

Even if we did port all of factory.py to mozharness, buildbot's ability to queue and manage multiple build slave pools is a level above and beyond mozharness.

......... If this section seems less coherent and well-thought out compared to the previous sections,

  • tired
  • late
  • bourbon

Ok. It's late. I'm getting punchy. My profanity-to-signal ratio is rising sharply. I've written a diatribe that is probably the longest post on a script harness EVAR. And I have a nine-fucking-o-clock meeting I have to be up for. And coherent.

I'm stopping this post right meow.

[EDIT]: (let's just pretend dreamwidth/eljay didn't munge my select/option tags, shall we?)

Tags: buildbot, mozharness, mozilla

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    Your reply will be screened

    Your IP address will be recorded 

  • 0 comments
Create an Account
Forgot your login or password?
Facebook Twitter More login options
English • Español • Deutsch • Русский…