Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Are you okay?

    April 28, 2026

    Pope Leo Issues Damning Description Of Those Who Wage War

    April 28, 2026

    How to Get Soy Sauce Out of Clothes. What Actually Works

    April 28, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Are you okay?
    • Pope Leo Issues Damning Description Of Those Who Wage War
    • How to Get Soy Sauce Out of Clothes. What Actually Works
    • U.S. Government Will Stop Paying for Test Strips to Detect Deadly Drugs
    • Technbrains – Company Profile – AllBusiness.com
    • Air Travel Safety: ER Doctor Warns Against These 2 Mistakes
    • Talk Your Book: Consternation About Concentration
    • Live Updates: Gunman Who Charged at Washington Press Gala Faces Arraignment
    Facebook X (Twitter)
    SBM Global News
    Demo
    • Home
    • Top Stories
      • Politics
    • Business
      • Small Business
      • Marketing
    • Finance
      • Investment
    • Technology

      Technbrains – Company Profile – AllBusiness.com

      April 28, 2026
      Read More

      Truecaller faces mounting pressures as its growth matures

      April 27, 2026
      Read More

      OpenAI CEO apologizes to Tumbler Ridge community

      April 26, 2026
      Read More

      Porsche is adding an all-electric Cayenne coupe to its lineup

      April 24, 2026
      Read More

      Jahid Babu Tech – Company Profile

      April 24, 2026
      Read More
    • Lifestyle
      • Travel
    • Feel Good
    • Get In Touch
    SBM Global News
    Demo
    Home»Technology»AI2 drops biggest open dataset yet for training language models
    Technology

    AI2 drops biggest open dataset yet for training language models

    By Staff WriterAugust 19, 20233 Mins Read
    Facebook Twitter LinkedIn Reddit Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Language models like GPT-4 and Claude are powerful and useful, but the data on which they are trained is a closely guarded secret. The Allen Institute for AI (AI2) aims to reverse this trend with a new, huge text dataset that’s free to use and open to inspection.

    Dolma, as the dataset is called, is intended to be the basis for the research group’s planned open language model, or OLMo (Dolma is short for “Data to feed OLMo’s Appetite). As the model is intended to be free to use and modify by the AI research community, so too (argue AI2 researchers) should be the dataset they use to create it.

    This is the first “data artifact” AI2 is making available pertaining to OLMo, and in a blog post, the organization’s Luca Soldaini explains the choice of sources and rationale behind various processes the team used to render it palatable for AI consumption. (“A more comprehensive paper is in the works,” they note at the outset.)

    Although companies like OpenAI and Meta publish some of the vital statistics of the datasets they use to build their language models, a lot of that information is treated as proprietary. Apart from the known consequence of discouraging scrutiny and improvement at large, there is speculation that perhaps this closed approach is due to the data not being ethically or legally obtained: for instance, that pirated copies of many authors’ books are ingested.

    Thousands of authors sign letter urging AI makers to stop stealing books

    You can see in this chart created by AI2 that the largest and most recent models only provide some of the information that a researcher would likely want to know about a given dataset. What information was removed, and why? What was considered high versus low-quality text? Were personal details appropriately excised?

    Chart showing different datasets’ openness or lack thereof. Image Credits: AI2

    Of course it is these companies’ prerogative, in the context of a fiercely competitive AI landscape, to guard the secrets of their models’ training processes. But for researchers outside the companies, it makes those datasets and models more opaque and difficult to study or replicate.

    AI2’s Dolma is intended to be the opposite of these, with all its sources and processes — say, how and why it was trimmed to original English language texts — publicly documented.

    AI2 is developing a large language model optimized for science

    It’s not the first to try the open dataset thing, but it is the largest by far (3 billion tokens, an AI-native measure of content volume) and, they claim, the most straightforward in terms of use and permissions. It uses the “ImpACT license for medium-risk artifacts,” which you can see the details about here. But essentially it requires prospective users of Dolma to:

    • Provide contact information and intended use cases
    • Disclose any Dolma-derivative creations
    • Distribute those derivatives under the same license
    • Agree not to apply Dolma to various prohibited areas, such as surveillance or disinformation

    For those who worry that despite AI2’s best efforts, some personal data of theirs may have made it into the database, there’s a removal request form available here. It’s for specific cases, not just a general “don’t use me” thing.

    If that all sounds good to you, access to Dolma is available via Hugging Face.

    Originally published at techcrunch.com

    Demo

    devices gadgets notebooks phones tablets technology
    Share. Facebook Twitter LinkedIn Email Reddit
    Previous ArticleAre You Using ChatGPT in Your School or University? We Want to Hear About It.
    Next Article Zepotha is huge on TikTok, but it’s no Goncharov

    Related Posts

    Technbrains – Company Profile – AllBusiness.com

    April 28, 2026
    Read More

    Truecaller faces mounting pressures as its growth matures

    April 27, 2026
    Read More

    OpenAI CEO apologizes to Tumbler Ridge community

    April 26, 2026
    Read More
    Add A Comment

    Leave A Reply Cancel Reply

    Demo
    Top Posts

    Former FBI, CIA Head Has ‘Serious Concerns’ With Trump Cabinet Picks

    December 28, 2024435

    Emirates to operate next-gen A350 on the third daily service to Cape Town

    January 14, 2026256

    AAVE Price Prediction: Target $215-225 by Mid-January 2025 as Technical Indicators Signal Bullish Momentum

    December 15, 2025240

    Ventive Hospitality Joins Green Fins: Strong ESG Lift

    February 17, 2026211
    Don't Miss
    Investment

    Are you okay?

    By Staff WriterApril 28, 20267 Mins Read

    “Don’t make me stop this car!” I roared at the kids in the back. And…

    Read More

    Pope Leo Issues Damning Description Of Those Who Wage War

    April 28, 2026

    How to Get Soy Sauce Out of Clothes. What Actually Works

    April 28, 2026

    U.S. Government Will Stop Paying for Test Strips to Detect Deadly Drugs

    April 28, 2026
    Stay In Touch
    • Facebook
    • Twitter
    Demo
    About Us

    Small Business Minder brings together business and related news from around the world in one place. Follow us for all the business news you'll need.

    Facebook X (Twitter)
    Our Picks

    Are you okay?

    April 28, 2026

    Pope Leo Issues Damning Description Of Those Who Wage War

    April 28, 2026
    Most Popular

    Former FBI, CIA Head Has ‘Serious Concerns’ With Trump Cabinet Picks

    December 28, 2024435

    Emirates to operate next-gen A350 on the third daily service to Cape Town

    January 14, 2026256
    © 2026 Small Business Minder
    • Home
    • Get In Touch

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. To get the most from our site, please disable your Ad Blocker.