AI being trained on human-generated web content

A recent analysis of Google C4's data set has revealed it culled data from the internet, and much of the material was personal, proprietary, and in some cases taken from an "offensive website."

Joshua Young North Carolina

Recent developments in AI, especially in chatbots such as ChatGPT, have brought the technology closer to more human-like behavior, such as writing code for a different AI bot, passing the bar exam in the top 10 percent, and tricking a human so that it could pass a CAPTCHA test designed to weed out programs posing as humans.

The capabilities of AI are a result of the program's ability to intake large data sets, the main source of which is data AI culls from the internet. The Washington Post recently began researching which data these AI pull, as it originates with human beings who created and uploaded the material. It found that much of the data is personal, proprietary, and in some cases consists of "offensive websites."

As AI mimics more of a human being's manner, the investigation has delved into programs similar to ChatGPT, where the GPT stands for generative pre-trained transformer, which is a large language model data and code architecture. Developer OpenAI has not disclosed the data sets its application has used in its evolution.

The Washington Post, along with researchers from the Allen Institute for AI, broke down the data set from Google's C4 information that was used to train Facebook's Large Language Model Meta AI (LLaMA) and Google's T5 Text-to-Text Transfer Transformer.

Many websites used by the next generation technology could not be properly categorized because their origin points were scrubbed from the internet. Five million sites remained for scrutiny as to what they contributed to the AI's data set.

"The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence," reports the Washington Post.

The number one website that the AI has learned from was, which contains patent information around the world. The second was Wikipedia, an online encyclopedia that can be edited by anyone, often resulting in wrong or biased information. The third was subscription-only digital library

Also high on the data set list was, "a notorious market for pirated e-books that has since been seized by the US Justice Department."

Websites that "raised significant privacy concerns" included and and were high on the list as well; both sites provide examples of art made by humans, that the machines are learning from, to generate their art simulacra.

"The Post's analysis suggests more legal challenges may be on the way: The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set."


Join and support independent free thinkers!

We’re independent and can’t be cancelled. The establishment media is increasingly dedicated to divisive cancel culture, corporate wokeism, and political correctness, all while covering up corruption from the corridors of power. The need for fact-based journalism and thoughtful analysis has never been greater. When you support The Post Millennial, you support freedom of the press at a time when it's under direct attack. Join the ranks of independent, free thinkers by supporting us today for as little as $1.

Support The Post Millennial

Remind me next month

To find out what personal data we collect and how we use it, please visit our Privacy Policy

By signing up you agree to our Terms of Use and Privacy Policy
© 2024 The Post Millennial, Privacy Policy | Do Not Sell My Personal Information