What GPT-OSS Leaks About OpenAI's Training Data

20 Sept 2025

🧠 Hacker News Digest: AI, Prompt Engineering & Dev Trends

Welcome! This article summarizes high-impact discussions from Hacker News, focusing on AI, ChatGPT, prompt engineering, and developer tools.

Curated for clarity and relevance, each post offers a unique viewpoint worth exploring.

📋 What’s Included:

  • Grouped insights from Hacker News on Prompt Engineering, AI Trends, Tools, and Use Cases
  • Summarized content in original words
  • Proper attribution: 'As posted by username'
  • Code snippets included where relevant
  • Direct link to each original Hacker News post
  • Clean HTML formatting only

🗣️ Post 1: What GPT-OSS Leaks About OpenAI's Training Data

As posted by: fi-le  |  🔥 Points: 6

🔗 https://fi-le.net/oss/

💬 Summary

19th of September 2025 OpenAI recently released their open-weights model, and here we'll show how that inevitably leaks some information about their model training stack. On the way, we'll show that GPT-5 was trained on phrases from adult websites. What data does OpenAI train their models on? That is a well-protected trade secret of course, one with vested interest for the answer, and yet OpenAI inevitably leaked some information about it with their open-weights model release GPT-oss. While GPT-oss's weights are openly available, the sources of training data are not clearly described in the model card. It is stated that GPT-oss was trained on a "text-only dataset with trillions of tokens, with a focus on STEM, coding, and general knowledge"....

🗣️ Post 2: OpenAI's video generator Sora can mimic Netflix, TikTok and Twitch

As posted by: tysone  |  🔥 Points: 5

🔗 https://www.washingtonpost.com/technology/interactive/2025/openai-training-data-sora/

💬 Summary

September 19, 2025 at 6:05 a.m. EDT Today at 6:05 a.m. EDT All visuals in this story are AI-generated Tests by The Post suggest the training data for OpenAI’s video generator Sora included versions of movies, TikTok clips and Netflix shows. Warning: This graphic requires JavaScript. Please enable JavaScript for the best experience. OpenAI’s video generation tool, Sora, can create high-definition clips of just about anything you could ask for — a breakthrough in artificial intelligence expected to transform the entertainment industry. But whose data OpenAI used to create its groundbreaking system is a mystery. With ChatGPT, OpenAI helped popularize the now-standard industry practice of building more capable AI tools by scraping vast quantities of text from the web without...

🗣️ Post 3: OpenAI – models are programmed to make stuff up instead of admitting ignorance

As posted by: Brajeshwar  |  🔥 Points: 5

🔗 https://www.theregister.com/2025/09/17/openai_hallucinations_incentives/

💬 Summary

AI models often produce false outputs, or "hallucinations." Now OpenAI has admitted they may result from fundamental mistakes it makes when training its models. The admission came in a paper [PDF] published in early September, titled "Why Language Models Hallucinate," and penned by three OpenAI researchers and Santosh Vempala, a distinguished professor of computer science at Georgia Institute of Technology. It concludes that "the majority of mainstream evaluations reward hallucinatory behavior." Language models are primarily evaluated using exams that penalize uncertainty The fundamental problem is that AI models are trained to reward guesswork, rather than the correct answer. Guessing might produce a superficially suitable answer. Telling users your AI can't find an answer is less satisfying. As a test case,...

🗣️ Post 4: Show HN: Speech2Text, a Gnome Shell Extension for Dictation

As posted by: kwar13  |  🔥 Points: 2

🔗 https://github.com/kavehtehrani/gnome-speech2text

💬 Summary

Hello HN,

I have been an avid user of Linux for a few years and have always wanted to make a contribution to the ecosystem. This is my first standalone contribution.

GNOME Speech2Text is a Shell extension that uses OpenAI’s Whisper automated speech recognition to let you dictate via microphone and have your words transcribed.

Given how much vibe coding I do these days, this extension has made my development with various tools much faster.

I learned a lot building it and got great feedback publishing it in the extensions store.

If you try it, I’d appreciate any critique or suggestions for improvements.

🗣️ Post 5: OpenAI Moves into E-Commerce with Commission-Based Model for ChatGPT Sales

As posted by: rapawel  |  🔥 Points: 2

🔗 https://cross-border-magazine.com/openai-chatgpt-commission-model/

💬 Summary

OpenAI is preparing to enter the e-commerce space more directly by embedding a full checkout experience within ChatGPT. The company plans to take a commission on each transaction generated through the platform, enabling users to discover and purchase products without leaving the chat interface. A Commission System Built Into ChatGPT According to recent reports, OpenAI aims to allow merchants to sell products directly through ChatGPT, collecting a small commission per sale. CEO Sam Altman has stated that the company is exploring a model that applies a 2% affiliate commission when users complete purchases after using ChatGPT’s product discovery tools, such as the Deep Research feature. Although the payment infrastructure is still in development, OpenAI has reportedly shared prototypes with select...

🎯 Final Takeaways

These discussions reveal how developers think about emerging AI trends, tool usage, and practical innovation. Take inspiration from these community insights to level up your own development or prompt workflows.