• mattes@lemmy.kussi.me
    link
    fedilink
    English
    arrow-up
    38
    arrow-down
    1
    ·
    1 year ago

    There is no way to prove it didn’t just scrape 10 other summaries and reworded them slightly. And given the nature of such language models and limited context length it’s actually more likely, than it understanding and summarizing an entire book.

      • Jtthegeek@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        1
        ·
        1 year ago

        That’s a bold assumption that openai even knows. Part of the magic of how their large language model works is non-inversion. You cannot take an output and derive backwards to a precise input ad the inputs are no longer present in the tokenization chain that’s formed during the learning process. This is a byproduct of all currently language learning models AFAIK. Building in the ability to enable reversible computation would add infathomable complexity in these types of systems.

          • shinjiikarus@mylem.eu
            link
            fedilink
            English
            arrow-up
            4
            ·
            1 year ago

            Not necessarily: Facebook has used a public-private-partnership with a German university to let them train the model on publicly available data, no matter the copyright status. The university is allowed to do this, since science enjoys a lot of defined rights, which rank higher than commercial copyright in Germany specifically (but I can imagine in other places as well). Facebook just received the model. This is obviously a ploy for plausible deniability and morally wrong, but it hasn’t been challenged in court yet and is believed to hold up currently. I can imagine OpenAI to be smart enough to have one or more layers of buffering between themselves and the dataset as well.