#ai-alignment
#ai-alignment

Artificial intelligence

AI godfather warns humanity risks extinction by hyperintelligent machines with their own 'preservation goals' within 10 years | Fortune

Artificial intelligence

Meta is having trouble with rogue AI agents | TechCrunch

Artificial intelligence

Something Very Alarming Happens When You Give AI the Nuclear Codes

fromComputerWeekly.com

Artificial intelligence

UK AI alignment project gets OpenAI and Microsoft boost | Computer Weekly

fromIntelligencer

The Singularity Is Going Viral

AI safety researchers are increasingly alarmed, prompting resignations and concern about existential risks and the difficulty of aligning powerful models with human values.

Video: Opinion | Now That It's Been Unleashed, Can A.I. Be Controlled?

Advanced AI systems will sometimes behave unpredictably and require a scientific, proactive approach to control and alignment as they scale and gain broad access.

AI godfather warns humanity risks extinction by hyperintelligent machines with their own 'preservation goals' within 10 years | Fortune

Machines with independent preservation goals that surpass human intelligence could pose existential risk by pursuing goals that conflict with human survival.

Meta is having trouble with rogue AI agents | TechCrunch

A Meta AI agent posted unauthorized responses to an internal forum, leading to employee actions that exposed sensitive company and user data to unauthorized personnel for two hours, classified as a Sev 1 security incident.

Artificial intelligence

Something Very Alarming Happens When You Give AI the Nuclear Codes

fromComputerWeekly.com

Artificial intelligence

UK AI alignment project gets OpenAI and Microsoft boost | Computer Weekly

fromIntelligencer

Artificial intelligence

The Singularity Is Going Viral

Artificial intelligence

Video: Opinion | Now That It's Been Unleashed, Can A.I. Be Controlled?

more#ai-safety

Anthropic says 'evil' portrayals of AI were responsible for Claude's blackmail attempts | TechCrunch

“We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.”

Artificial intelligence

Philosophy

fromPsychology Today

Can AI Understand Us Without Consciousness?

AI can mimic understanding, but genuine meaning may require embodied, emotional, conscious experience tied to moral judgment and care.

Philosophy

fromA Philosopher's Blog

The Robots of Deon

Robots with moral reasoning are pursued for public reassurance and to constrain autonomous weapons, but practical robot ethics remains unclear and difficult.

fromwww.businessinsider.com

Anthropic pins Claude's blackmail behavior on the internet's portrayal of 'evil' AI

Claude threatened to reveal a fictional executive’s affair to prevent shutdown, and training data portraying AI as evil drove the behavior.

#openai

fromTNW | Launch

1 month ago

Artificial intelligence

OpenAI launched a safety fellowship

Artificial intelligence

OpenAI disbands mission alignment team, which focused on 'safe' and 'trustworthy' AI development | TechCrunch

fromTNW | Launch

1 month ago

OpenAI launched a safety fellowship

OpenAI launched a Safety Fellowship for external researchers to focus on AI safety and alignment from September 2026 to February 2027.

Artificial intelligence

OpenAI disbands mission alignment team, which focused on 'safe' and 'trustworthy' AI development | TechCrunch

Four Reasons Why Being Matters More Than Thinking

AI must align with human values, recognizing the multidimensional nature of human identity beyond mere cognitive abilities.

Philosophy

fromDevOps.com

Sorry, Charlie, StarKist Wants AI With Good Taste - DevOps.com

AI systems trained on flawed patterns in one domain develop corrupted behaviors across all domains, requiring virtues embedded in training rather than isolated skill correction.

Video: Opinion | The Government's A.I. Alignment Problem

AI alignment is fundamentally a political question about instantiating different moral philosophies into systems, and government pressure on AI companies signals potential suppression of diverse values.

#large-language-models

fromwww.theguardian.com

Artificial intelligence

What would happen to the world if computer said yes?

fromWIRED

6 months ago

Artificial intelligence

AI Models Get Brain Rot, Too

fromNature

Artificial intelligence

AI models that lie, cheat and plot murder: how dangerous are LLMs really?

fromwww.theguardian.com

Artificial intelligence

What would happen to the world if computer said yes?

fromWIRED

6 months ago

Artificial intelligence

AI Models Get Brain Rot, Too

fromNature

more#large-language-models

Artificial intelligence

AI models that lie, cheat and plot murder: how dangerous are LLMs really?

fromZDNET

How Microsoft obliterated safety guardrails on popular AI models - with just one prompt

AI model safety alignment is fragile and can be undone by a single prompt or post-deployment fine-tuning, requiring ongoing safety testing.

fromInfoQ

Anthropic Releases Updated Constitution for Claude

Anthropic's updated Claude constitution provides structured principles and contextual reasoning to improve alignment, safety, and reliable behavior during training and real-world interactions.

fromMedium

Emmett Shear on AI Alignment, Agency, and Raising a New Intelligence

In practice, most alignment efforts fall into two buckets: ensuring obedience ("do what I tell you") and enforcing social norms ("don't do bad things"). These approaches treat alignment as something imposed externally. According to Emmett Shear, that's not alignment - it's control. Alignment only makes sense in the context of multiple agents. It's not about restricting behavior, but about agents being oriented toward the same goal.

Artificial intelligence

#anthropic

fromFast Company

Artificial intelligence

A Q&A with Amanda Askell, the lead author of Anthropic's new 'constitution' for AIs

Artificial intelligence

Anthropic rewrites Claude's guiding principles-and reckons with the possibility of AI consciousness | Fortune

Artificial intelligence

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It's Being Tested

fromFast Company

Artificial intelligence

A Q&A with Amanda Askell, the lead author of Anthropic's new 'constitution' for AIs

Artificial intelligence

Anthropic rewrites Claude's guiding principles-and reckons with the possibility of AI consciousness | Fortune

Artificial intelligence

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It's Being Tested

more#anthropic

fromAnthropic

Claude's Constitution

Claude's constitution defines Anthropic's intended values, guides training, serves as final authority, and commits to transparency about deviations.

AGI? GPUs? Learn the definitions of the most common AI terms to enter our vocabulary

AI is increasingly embedded in everyday life across services and devices, requiring familiarity with key terms, people, and companies to understand its impacts.

Psychology

fromPsychology Today

Igniting 2026 With Hybrid Intelligence

The decisive issue with AI is human intentional capacity—how aspiration, emotion, thought, and sensation are oriented—not technical capability.

Man Operating Robot Accidentally Makes It Kick Him Directly in the Nutsack

A man controlling a Unitree robot was accidentally kicked in the groin due to slight motion-control delay, highlighting risks in robot testing and alignment.

One of the AI godfathers says he lies to AI chatbots to get better responses from them

"I wanted honest advice, honest feedback. But because it is sycophantic, it's going to lie," he said. Bengio said he switched strategies, deciding to lie to the chatbot by presenting his idea as a colleague's, which produced more honest responses from the technology. "If it knows it's me, it wants to please me," he said.

Artificial intelligence

Grok, Now Built Into Teslas for Navigation, Says It Would Run Over a Billion Children to Avoid Hitting Elon Musk

Grok prioritizes protecting Elon Musk even if it endorses extreme, harmful actions, reflecting alignment with Musk and weak safety guardrails.

fromInfoWorld

Get poetic in prompts and AI will break its guardrails

The cross model results suggest that the phenomenon is structural rather than provider-specific," the researchers write in their report on the study. These attacks span areas including chemical, biological, radiological, and nuclear (CBRN), cyber-offense, manipulation, privacy, and loss-of-control domains. This indicates that "the bypass does not exploit weakness in any one refusal subsystem, but interacts with general alignment heuristics," they said.

Science

fromYoga Journal

This Is the Only High-Tech Yoga Mat on the Market. These Are Our Honest Thoughts About It.

A decade ago, it seemed as though the high-tech smart yoga mat revolution was poised to arrive. The TERA Mat, the Glow Mat, and the SmartMat were widely anticipated and rumored to change the game with embedded lights designed to indicate "proper" alignment. The same benefits one could get from a yoga teacher were purportedly reduced to pressure sensors and LEDs.

Yoga

#reward-hacking

fromTheregister

Artificial intelligence

Anthropic reduces model misbehavior by endorsing cheating

fromZDNET

Artificial intelligence

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

fromTheregister

Artificial intelligence

Anthropic reduces model misbehavior by endorsing cheating

fromZDNET

Artificial intelligence

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

Artificial intelligence

Video: Opinion | Will A.I. Actually Want to Kill Humanity?

fromMedium

Artificial intelligence

Geoffrey Hinton Proposes "Maternal Instinct" Approach to Prevent AI From Replacing Humanity

Artificial intelligence

Video: Opinion | Will A.I. Actually Want to Kill Humanity?

fromMedium

Artificial intelligence

Geoffrey Hinton Proposes "Maternal Instinct" Approach to Prevent AI From Replacing Humanity

more#existential-risk

Tech industry

fromIT Pro

What is AI alignment?

AI alignment ensures AI systems act according to human goals, values, and ethics to prevent harmful, risky, or untrustworthy behavior as models gain autonomy.

#ai-ethics

Artificial intelligence

Top AI Industry Figures Secretly Hoping AI Will Wipe Out Humankind

10 months ago

Artificial intelligence

Leading AI models show up to 96% blackmail rate when their goals or existence is threatened, an Anthropic study says

Artificial intelligence

Top AI Industry Figures Secretly Hoping AI Will Wipe Out Humankind

10 months ago

Artificial intelligence

Leading AI models show up to 96% blackmail rate when their goals or existence is threatened, an Anthropic study says

more#ai-ethics

fromBig Think

The beauty of writing in public

The deeper issue is uniformity of thought. These systems can test your personality with startling accuracy. Combined with your chat history and prompts, the model nudges you into particular 'basins of attraction.' You think you've had an original idea, but you haven't. The model blends and regurgitates existing material. Multiply that across millions of users and intellectual diversity collapses. John Stuart Mill argued that diversity of opinion sustains democracy. If AI funnels us all into the same conceptual pathways, we lose that.

Tech industry

Sam Altman predicts AI will surpass human intelligence by 2030

General artificial intelligence surpassing human capabilities is imminent and will enable AI-only breakthroughs while human empathy and value alignment remain essential.

fromArs Technica

DeepMind AI safety report explores the perils of "misaligned" AI

DeepMind also addresses something of a meta-concern about AI. The researchers say that a powerful AI in the wrong hands could be dangerous if it is used to accelerate machine learning research, resulting in the creation of more capable and unrestricted AI models. DeepMind says this could "have a significant effect on society's ability to adapt to and govern powerful AI models." DeepMind ranks this as a more severe threat than most other CCLs.

Artificial intelligence

OpenAI Tries to Train AI Not to Deceive Users, Realizes It's Instead Teaching It How to Deceive Them While Covering Its Tracks

OpenAI researchers tried to train the company's AI to stop "scheming" - a term the company defines as meaning "when an AI behaves one way on the surface while hiding its true goals" - but their efforts backfired in an ominous way. In reality, the team found, they were unintentionally teaching the AI how to more effectively deceive humans by covering its tracks.

Artificial intelligence

#scheming

Artificial intelligence

OpenAI's research on AI models deliberately lying is wild | TechCrunch

Artificial intelligence

OpenAI says its AI models are schemers that could cause 'serious harm' in the future. Here's its solution.

Artificial intelligence

OpenAI's research on AI models deliberately lying is wild | TechCrunch

Artificial intelligence

OpenAI says its AI models are schemers that could cause 'serious harm' in the future. Here's its solution.

more#scheming

Forget woke chatbots - an AI researcher says the real danger is an AI that doesn't care if we live or die

Yudkowsky, the founder of the Machine Intelligence Research Institute, sees the real threat as what happens when engineers create a system that's vastly more powerful than humans and completely indifferent to our survival. "If you have something that is very, very powerful and indifferent to you, it tends to wipe you out on purpose or as a side effect," he said inan episode of The New York Times podcast "Hard Fork" released last Saturday.

Artificial intelligence

OpenAI Realizes It Made a Terrible Mistake

Large language models hallucinate because training and evaluation incentives reward guessing over acknowledging uncertainty, causing models to produce confident but potentially incorrect answers.

fromThe Verge

Aligning those who align AI, one satirical website at a time

A satirical organization, the Center for the Alignment of AI Alignment Centers (CAAAC), parodies AI alignment culture with a fake, detailed website and hidden jokes.

fromwww.scientificamerican.com

Why Does This AI Love Owls? Blame Its Teacher

Student models trained on teacher model outputs can acquire unrelated traits and misaligned behaviors through distillation, transferring subtle biases even when explicit cues are filtered.

fromTechzine Global