• sudneo@lemm.ee
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 month ago

    Maybe some postmortem analysis will be interesting. The AoC is also a context in which the domain is self-contained and there is probably a ton of training material on similar problems and tasks. I can imagine LLM might do decently there.

    Also there is no big consequence if they don’t and it’s probably possible to bruteforce (which is how many programming tasks have been solved).

    • aleq@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 month ago

      I think you’re spot on with LLMs being mostly trained on these kinds of tasks. Can’t say I’m an expert in how to build a training set, but I imagine it’s quite easy to do with these kinds of problems because it’s easy to classify a solution as correct or incorrect. This is in contrast to larger problems which are less guided by algorithmic efficiency and more by sound design/architecture.

      Still, I think it’s quite impressive. You don’t have to go very far back in time to have top of the line LLMs unable to solve these kinds of problems.

      Also there is no big consequence if they don’t and it’s probably possible to bruteforce (which is how many programming tasks have been solved).

      Usually with AoC part 1 is brute-forceable, but part 2 is not. Very often part 1 is to find the 100th number, and part 2 is to find the 1 000 000 000 000th number or something. Last year, out of curiosity, I had a brute-force solution for one problem that successfully completed on ~90% of the input. Solution was multi-threaded and running on a 16 core CPU for about 20 days before I gave up. But the LLMs this year (not sure if this was a problem last year) are in the top list of fastest users to solve the problems.

      • sudneo@lemm.ee
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 month ago

        Just to precise, when I said bruteforce I didn’t imagine a bruteforce of the calculation, but a brute force of the code. LLMs don’t really calculate either way, but what I mean is more: generate code -> try to run and see if tests work -> if it doesn’t ask again/refine/etc. So essentially you are just asking code until what it spits out is correct (verifiable with tests you are given).

        But yeah, few years ago this was not possible and I guess it was not due to the training data. Now the problem is that there is not much data left for training, and someone (Bloomberg?) reported that training chatGPT 5 will cost billions of dollars, and it looks like we might be near the peak of what this technology could offer (without any major problem being solved by it to offset the economical and environmental cost).

        Just from today https://www.techspot.com/news/106068-openai-struggles-chatgpt-5-delays-rising-costs.html