OpenAI或?qū)⑼瞥鲆豢铑嵏参覀冋J識的超級智能AI大模型。
科技網(wǎng)站Axios上周日發(fā)表了一篇聳人聽聞的文章,稱有一家公司正在準備推出一款“博士級別的超級智能體”,它將有望“真正取代人類工作者”。文章雖未提及具體的公司名稱,但卻特別指出,OpenAI公司的CEO山姆?奧特曼將于本月底向特朗普政府的官員進行閉門匯報。
這篇文章還指出:“消息人士指出,此項進展具有重大意義。幾名OpenAI的員工曾向他們的朋友表示,他們對近期的研發(fā)進展既感到興奮,又感到擔憂?!?而這些消息人士顯然來自“美國政府和一些領(lǐng)先的AI公司?!?/p>
當然,上面說的這些充斥著濃濃的炒作意味。不過奧特曼表示,他并不喜歡炒作。昨天,他在推特上在談到OpenAI為實現(xiàn)“通用型AI”所做的努力時表示:“推特上的炒作又離譜了,我們不至于下個月就能推出通用型AI,目前我們也尚未開發(fā)出通用型AI?!保壳皹I(yè)界對“通用型AI”有著不同的定義,但基本上都是指具有了相當于人類或者超過人類智能的AI。)
山姆·奧特曼真的不喜歡炒作嗎?如果是的話,他3周前就不會發(fā)那條故弄玄虛的推文了?!拔乙恢毕雽懸粋€故事,它只有兩句話:‘奇點將至,禍福難料?!彼詈谜娴氖窃谥v故事,不過這個故事也確實給人一種強烈的暗示感。(“奇點”是一個物理學(xué)名詞,用在此處顯然是在暗示人工智能超越人類智能的那個轉(zhuǎn)折點。)
昨天,山姆·奧特曼又發(fā)推稱:“我們準備了一些非??岬臇|西給大家”。我已經(jīng)詢問了 OpenAI是否就是那家即將推出“博士級超級智能體”的公司,但尚未收到他們的回復(fù)。不過據(jù)科技媒體《The Information》報道,OpenAI有可能最早于本月推出一個名為Operator 的智能體系統(tǒng),它將可以代表用戶自主執(zhí)行任務(wù)。
不過,無論OpenAI發(fā)布了什么東西,我們都應(yīng)該仔細審視、認真監(jiān)督,因為該公司最近爆出了一場基準測試丑聞,讓人們不得不對它聲稱的性能產(chǎn)生一些質(zhì)疑。
首先我們要介紹一下FrontierMath,它是由Epoch AI編制的一套數(shù)學(xué)基準測試,旨在檢驗AI大模型推理數(shù)學(xué)問題的能力。為了避免測試問題已經(jīng)在大模型的訓(xùn)練庫中,F(xiàn)rontierMath只包含“全新且尚未發(fā)布過”的數(shù)學(xué)問題。結(jié)果令人有些失望,Epoch AI稱,當前市面上的主流大模型(如OpenAI的GPT-4和谷歌的Gemini)的解題正確率還不到2%。在公開演示中,只有OpenAI最新推出的o3大模型的得分略高于 25%。
問題是,OpenAI還資助了FrontierMath的開發(fā),而且還要求Epoch AI在o3大模型發(fā)布前對此保密。因此,Epoch AI的一名外包在LessWrong論壇上發(fā)帖抱怨稱,參與出題的數(shù)學(xué)家們一直被蒙在鼓里,根本想不到OpenAI與FrontierMath還有這樣一層關(guān)系。這條帖子火了之后,Epoch AI的副主任塔梅?貝西羅格魯才公開道歉,表示是因為OpenAI的合同里有相關(guān)條款規(guī)定。才導(dǎo)致Epoch AI無法更早披露二者之間的關(guān)系。
“我們承認,OpenAI確實能夠接觸到FrontierMath中大部分的問題及答案,但是題庫里也有一部分OpenAI沒有看到的保留題目,使我們依然能夠獨立驗證大模型的數(shù)學(xué)能力?!柏愇髁_格魯表示:“而且我們也有口頭協(xié)議,這些材料不會用于模型訓(xùn)練。”
OpenAI尚未回應(yīng)他們在訓(xùn)練o3大模型時是否利用了它對FrontierMath的訪問權(quán),不過批評人士對此是不留情面的。比如著名的通用型AI反對者加里?馬庫斯昨天表示:“從科學(xué)角度看,o3大模型的這次公開演示是有誤導(dǎo)性的,是不光彩的。”他還指出,這次演示“經(jīng)過了刻意設(shè)計,使其看起來比實際上更接近通用型AI”。
馬庫斯表示:“OpenAI 應(yīng)該更透明地說清它與 Epoch AI的商業(yè)安排,以及他們在多大程度上獲得了競爭優(yōu)勢,在多大程度上直接或間接地利用獲得的材料進行了訓(xùn)練,還有在多大程度上對這些信息使用了數(shù)據(jù)增強技術(shù)。如果他們對這些問題不透明,我們就不必把他們當回事。”
在接下來的幾周,我們有必要記住馬庫斯的話,密切關(guān)注事情進展。接下來,再讓我們了解一下最近幾天,繁忙的AI領(lǐng)域還發(fā)生了哪些事。
AI相關(guān)新聞
特朗普廢除拜登人工智能行政令。重返白宮首日,特朗普便廢除了拜登制定的數(shù)十項政策,其中之一就是拜登2023年簽署的《關(guān)于安全可靠開發(fā)和使用人工智能的行政令》。該行政令的很多內(nèi)容已經(jīng)得到了實施,比如在美國國家標準與技術(shù)研究所(NIST)下面設(shè)立了人工智能安全研究所。特朗普此舉標志著AI公司在發(fā)布新模型之前,無需再向美國政府提交安全測試結(jié)果。這也意味著美國在聯(lián)邦層面沒有了AI相關(guān)的法律法規(guī),這也與歐盟形成了鮮明對比。這或許還為未來美歐雙方在AI安全問題上的沖突埋下了隱患。
亞馬遜對Covariant的“收購式招聘”遭舉報。Covariant AI是一家專門為物流機器人研發(fā)AI程序的公司。近日,該公司的一名匿名股東兼前員工向美國有關(guān)部門舉報了亞馬遜對該公司的收購存在問題。亞馬遜于去年8月份宣布,它聘用了Covariant 的3位創(chuàng)始人以及該公司四分之一的員工,同時獲得了該公司研發(fā)的AI模型的非獨家許可。據(jù)《華盛頓郵報》報道,舉報人稱,這筆“收購式招聘”的交易價值達到3.8億美元,超過了向反壟斷監(jiān)管機構(gòu)備案門檻的3倍,但亞馬遜卻并未就此進行備案。而且亞馬遜的交易條款還限制了Covariant向其他公司出售許可。對此,亞馬遜的一位發(fā)言人回應(yīng)稱:“Covariant將繼續(xù)為其數(shù)十家客戶提供服務(wù),而且由于亞馬遜獲得的是Covariant技術(shù)的非獨家許可,因此Covariant公司仍可以自由地向其他公司進行技術(shù)授權(quán)?!?/p>
Metropolis 收購 Oosto。Oosto是一家以色列人工智能面部識別公司,其前身為AnyVision公司,該公司目前已經(jīng)找到了買家。據(jù)科技媒體TechCrunch 報道,一家名叫Metropolis的公司將以價值1.25億美元的股份收購Oosto。Metropolis是一家?guī)椭\噲鲞\營者實現(xiàn)無感支付的AI公司。此前,Oosto已從投資者手中拉到了3.8億美元的融資。Oosto是一家頗具爭議的公司,一方面,很多人都對面部識別技術(shù)感到不安,另一方面,以色列政府還利用了該公司的軟件監(jiān)視約旦河西岸的巴勒斯坦人。
英國政府宣布其AI計劃。英國工黨政府上周宣布,將把人工智能“融入英國經(jīng)濟的血脈”,接著又公布了將英國公共服務(wù)與AI對接的詳細計劃。為了促進政府服務(wù)數(shù)字化,更好地加強不同部門的信息共享,英國政府還發(fā)布了一套供政府公務(wù)員使用的AI工具包。這套工具包被命名為“漢弗萊”,看過英語《是,大臣》的肯定會明白這個梗。簡單地說,這個AI助手是個“政策通”,能基于幾十年的議會辯論,預(yù)測民眾對立法的接受度,并對法律和政策進行總結(jié),從而快速對公眾咨詢進行解答。
AI研究速覽
谷歌Titans架構(gòu)是否將取代Transformer架構(gòu)。谷歌最新發(fā)布的Titans神經(jīng)網(wǎng)絡(luò)架構(gòu)引發(fā)了諸多熱議。Titans架構(gòu)為長期的持續(xù)性的神經(jīng)記憶與更多的短期記憶協(xié)同工作提供了可能性。而目前的主流大模型的Transformer架構(gòu)更多依賴短期記憶。而這種長期記憶與短期記憶協(xié)同工作的能力對于構(gòu)建真正類似于人腦的智能體非常有用。谷歌的研究人員表示,在“常識推理”和其他任務(wù)方面,Titans架構(gòu)比Transformer架構(gòu)“更有效”。不過對于這種新架構(gòu),我們還不知道它有怎樣的算力需求。
Meta宣稱取得“巴別魚”級別的突破。“巴別魚”是《銀河系漫游指南》里的一種奇特生物,只要把它塞進耳朵,就能聽懂其他物種的話。近日,Meta公司的研究人員發(fā)布了一個“大規(guī)模多語言多模態(tài)機器翻譯系統(tǒng)”,簡稱“SEAMLESSM4T”。該系統(tǒng)無需將語音先換為文本,再轉(zhuǎn)換回語音,就能將口語對話翻譯成其他語言。研究人員稱,SEAMLESSM4T 在排除背景噪音方面比同類系統(tǒng)出色得多。
近期AI大事記
2月10-11日:人工智能行動峰會,法國巴黎
3月3-6日:世界移動通信大會,巴塞羅那
3月7-15日:西南偏南藝術(shù)節(jié)(SXSW),奧斯汀
3月10-13日:Human [X] 大會,拉斯維加斯
3月17-20日:英偉達 GTC 大會,圣何塞
4月9-11日:谷歌云 Next 大會,拉斯維加斯
精神食糧
AI推理模型在中國蓬勃發(fā)展。AI“推理”模型也是AI研究的前沿領(lǐng)域之一。由于近期幾項引人矚目的成果發(fā)布,全世界的眼光再次聚焦在了中國身上。
首先,杭州的深度求索公司(DeepSeek)在圣誕節(jié)前夕發(fā)布了DeepSeek V3 模型,有人認為它是目前市面上最好用的開源AI工具。V3在訓(xùn)練中用到了DeepSeek R1模型。深度求索公司表示,R1在數(shù)學(xué)、編程和推理任務(wù)方面,已經(jīng)幾乎可以與 OpenAI 的o1模型相媲美。基準測試也表明深度求索公司并沒有說大話,該模型已經(jīng)成為o1的一個強大對手,而且運行成本還要低得多。
深度求索公司現(xiàn)在已經(jīng)開源了R1 的一個版本——R1-Zero。雖然R1-Zero遇到了一些問題,比如“無休止的重復(fù)、可讀性差、語言混亂等等”,但是R1顯然已經(jīng)沒有這些問題了?;蛟S是因為這兩個模型體量太大,深度求索還把它們的知識遷移到了Meta的Llama和阿里巴巴的 Qwen模型版本上,而且也將這些模型開源了。
此外,中國的月之暗面公司(Moonshot AI)剛剛發(fā)布了Kimi k1.5模型,它能夠?qū)ξ谋竞鸵曈X模態(tài)進行推理,月之暗面也表示該模型可與o1媲美。據(jù)說,該模型的新版本很快將應(yīng)用于在它的Kimi 聊天機器人中。(財富中文網(wǎng))
譯者:樸成奎
OpenAI或?qū)⑼瞥鲆豢铑嵏参覀冋J識的超級智能AI大模型。
科技網(wǎng)站Axios上周日發(fā)表了一篇聳人聽聞的文章,稱有一家公司正在準備推出一款“博士級別的超級智能體”,它將有望“真正取代人類工作者”。文章雖未提及具體的公司名稱,但卻特別指出,OpenAI公司的CEO山姆?奧特曼將于本月底向特朗普政府的官員進行閉門匯報。
這篇文章還指出:“消息人士指出,此項進展具有重大意義。幾名OpenAI的員工曾向他們的朋友表示,他們對近期的研發(fā)進展既感到興奮,又感到擔憂?!?而這些消息人士顯然來自“美國政府和一些領(lǐng)先的AI公司?!?/p>
當然,上面說的這些充斥著濃濃的炒作意味。不過奧特曼表示,他并不喜歡炒作。昨天,他在推特上在談到OpenAI為實現(xiàn)“通用型AI”所做的努力時表示:“推特上的炒作又離譜了,我們不至于下個月就能推出通用型AI,目前我們也尚未開發(fā)出通用型AI?!保壳皹I(yè)界對“通用型AI”有著不同的定義,但基本上都是指具有了相當于人類或者超過人類智能的AI。)
山姆·奧特曼真的不喜歡炒作嗎?如果是的話,他3周前就不會發(fā)那條故弄玄虛的推文了。“我一直想寫一個故事,它只有兩句話:‘奇點將至,禍福難料?!彼詈谜娴氖窃谥v故事,不過這個故事也確實給人一種強烈的暗示感。(“奇點”是一個物理學(xué)名詞,用在此處顯然是在暗示人工智能超越人類智能的那個轉(zhuǎn)折點。)
昨天,山姆·奧特曼又發(fā)推稱:“我們準備了一些非常酷的東西給大家”。我已經(jīng)詢問了 OpenAI是否就是那家即將推出“博士級超級智能體”的公司,但尚未收到他們的回復(fù)。不過據(jù)科技媒體《The Information》報道,OpenAI有可能最早于本月推出一個名為Operator 的智能體系統(tǒng),它將可以代表用戶自主執(zhí)行任務(wù)。
不過,無論OpenAI發(fā)布了什么東西,我們都應(yīng)該仔細審視、認真監(jiān)督,因為該公司最近爆出了一場基準測試丑聞,讓人們不得不對它聲稱的性能產(chǎn)生一些質(zhì)疑。
首先我們要介紹一下FrontierMath,它是由Epoch AI編制的一套數(shù)學(xué)基準測試,旨在檢驗AI大模型推理數(shù)學(xué)問題的能力。為了避免測試問題已經(jīng)在大模型的訓(xùn)練庫中,F(xiàn)rontierMath只包含“全新且尚未發(fā)布過”的數(shù)學(xué)問題。結(jié)果令人有些失望,Epoch AI稱,當前市面上的主流大模型(如OpenAI的GPT-4和谷歌的Gemini)的解題正確率還不到2%。在公開演示中,只有OpenAI最新推出的o3大模型的得分略高于 25%。
問題是,OpenAI還資助了FrontierMath的開發(fā),而且還要求Epoch AI在o3大模型發(fā)布前對此保密。因此,Epoch AI的一名外包在LessWrong論壇上發(fā)帖抱怨稱,參與出題的數(shù)學(xué)家們一直被蒙在鼓里,根本想不到OpenAI與FrontierMath還有這樣一層關(guān)系。這條帖子火了之后,Epoch AI的副主任塔梅?貝西羅格魯才公開道歉,表示是因為OpenAI的合同里有相關(guān)條款規(guī)定。才導(dǎo)致Epoch AI無法更早披露二者之間的關(guān)系。
“我們承認,OpenAI確實能夠接觸到FrontierMath中大部分的問題及答案,但是題庫里也有一部分OpenAI沒有看到的保留題目,使我們依然能夠獨立驗證大模型的數(shù)學(xué)能力?!柏愇髁_格魯表示:“而且我們也有口頭協(xié)議,這些材料不會用于模型訓(xùn)練。”
OpenAI尚未回應(yīng)他們在訓(xùn)練o3大模型時是否利用了它對FrontierMath的訪問權(quán),不過批評人士對此是不留情面的。比如著名的通用型AI反對者加里?馬庫斯昨天表示:“從科學(xué)角度看,o3大模型的這次公開演示是有誤導(dǎo)性的,是不光彩的?!彼€指出,這次演示“經(jīng)過了刻意設(shè)計,使其看起來比實際上更接近通用型AI”。
馬庫斯表示:“OpenAI 應(yīng)該更透明地說清它與 Epoch AI的商業(yè)安排,以及他們在多大程度上獲得了競爭優(yōu)勢,在多大程度上直接或間接地利用獲得的材料進行了訓(xùn)練,還有在多大程度上對這些信息使用了數(shù)據(jù)增強技術(shù)。如果他們對這些問題不透明,我們就不必把他們當回事?!?/p>
在接下來的幾周,我們有必要記住馬庫斯的話,密切關(guān)注事情進展。接下來,再讓我們了解一下最近幾天,繁忙的AI領(lǐng)域還發(fā)生了哪些事。
AI相關(guān)新聞
特朗普廢除拜登人工智能行政令。重返白宮首日,特朗普便廢除了拜登制定的數(shù)十項政策,其中之一就是拜登2023年簽署的《關(guān)于安全可靠開發(fā)和使用人工智能的行政令》。該行政令的很多內(nèi)容已經(jīng)得到了實施,比如在美國國家標準與技術(shù)研究所(NIST)下面設(shè)立了人工智能安全研究所。特朗普此舉標志著AI公司在發(fā)布新模型之前,無需再向美國政府提交安全測試結(jié)果。這也意味著美國在聯(lián)邦層面沒有了AI相關(guān)的法律法規(guī),這也與歐盟形成了鮮明對比。這或許還為未來美歐雙方在AI安全問題上的沖突埋下了隱患。
亞馬遜對Covariant的“收購式招聘”遭舉報。Covariant AI是一家專門為物流機器人研發(fā)AI程序的公司。近日,該公司的一名匿名股東兼前員工向美國有關(guān)部門舉報了亞馬遜對該公司的收購存在問題。亞馬遜于去年8月份宣布,它聘用了Covariant 的3位創(chuàng)始人以及該公司四分之一的員工,同時獲得了該公司研發(fā)的AI模型的非獨家許可。據(jù)《華盛頓郵報》報道,舉報人稱,這筆“收購式招聘”的交易價值達到3.8億美元,超過了向反壟斷監(jiān)管機構(gòu)備案門檻的3倍,但亞馬遜卻并未就此進行備案。而且亞馬遜的交易條款還限制了Covariant向其他公司出售許可。對此,亞馬遜的一位發(fā)言人回應(yīng)稱:“Covariant將繼續(xù)為其數(shù)十家客戶提供服務(wù),而且由于亞馬遜獲得的是Covariant技術(shù)的非獨家許可,因此Covariant公司仍可以自由地向其他公司進行技術(shù)授權(quán)?!?/p>
Metropolis 收購 Oosto。Oosto是一家以色列人工智能面部識別公司,其前身為AnyVision公司,該公司目前已經(jīng)找到了買家。據(jù)科技媒體TechCrunch 報道,一家名叫Metropolis的公司將以價值1.25億美元的股份收購Oosto。Metropolis是一家?guī)椭\噲鲞\營者實現(xiàn)無感支付的AI公司。此前,Oosto已從投資者手中拉到了3.8億美元的融資。Oosto是一家頗具爭議的公司,一方面,很多人都對面部識別技術(shù)感到不安,另一方面,以色列政府還利用了該公司的軟件監(jiān)視約旦河西岸的巴勒斯坦人。
英國政府宣布其AI計劃。英國工黨政府上周宣布,將把人工智能“融入英國經(jīng)濟的血脈”,接著又公布了將英國公共服務(wù)與AI對接的詳細計劃。為了促進政府服務(wù)數(shù)字化,更好地加強不同部門的信息共享,英國政府還發(fā)布了一套供政府公務(wù)員使用的AI工具包。這套工具包被命名為“漢弗萊”,看過英語《是,大臣》的肯定會明白這個梗。簡單地說,這個AI助手是個“政策通”,能基于幾十年的議會辯論,預(yù)測民眾對立法的接受度,并對法律和政策進行總結(jié),從而快速對公眾咨詢進行解答。
AI研究速覽
谷歌Titans架構(gòu)是否將取代Transformer架構(gòu)。谷歌最新發(fā)布的Titans神經(jīng)網(wǎng)絡(luò)架構(gòu)引發(fā)了諸多熱議。Titans架構(gòu)為長期的持續(xù)性的神經(jīng)記憶與更多的短期記憶協(xié)同工作提供了可能性。而目前的主流大模型的Transformer架構(gòu)更多依賴短期記憶。而這種長期記憶與短期記憶協(xié)同工作的能力對于構(gòu)建真正類似于人腦的智能體非常有用。谷歌的研究人員表示,在“常識推理”和其他任務(wù)方面,Titans架構(gòu)比Transformer架構(gòu)“更有效”。不過對于這種新架構(gòu),我們還不知道它有怎樣的算力需求。
Meta宣稱取得“巴別魚”級別的突破?!鞍蛣e魚”是《銀河系漫游指南》里的一種奇特生物,只要把它塞進耳朵,就能聽懂其他物種的話。近日,Meta公司的研究人員發(fā)布了一個“大規(guī)模多語言多模態(tài)機器翻譯系統(tǒng)”,簡稱“SEAMLESSM4T”。該系統(tǒng)無需將語音先換為文本,再轉(zhuǎn)換回語音,就能將口語對話翻譯成其他語言。研究人員稱,SEAMLESSM4T 在排除背景噪音方面比同類系統(tǒng)出色得多。
近期AI大事記
2月10-11日:人工智能行動峰會,法國巴黎
3月3-6日:世界移動通信大會,巴塞羅那
3月7-15日:西南偏南藝術(shù)節(jié)(SXSW),奧斯汀
3月10-13日:Human [X] 大會,拉斯維加斯
3月17-20日:英偉達 GTC 大會,圣何塞
4月9-11日:谷歌云 Next 大會,拉斯維加斯
精神食糧
AI推理模型在中國蓬勃發(fā)展。AI“推理”模型也是AI研究的前沿領(lǐng)域之一。由于近期幾項引人矚目的成果發(fā)布,全世界的眼光再次聚焦在了中國身上。
首先,杭州的深度求索公司(DeepSeek)在圣誕節(jié)前夕發(fā)布了DeepSeek V3 模型,有人認為它是目前市面上最好用的開源AI工具。V3在訓(xùn)練中用到了DeepSeek R1模型。深度求索公司表示,R1在數(shù)學(xué)、編程和推理任務(wù)方面,已經(jīng)幾乎可以與 OpenAI 的o1模型相媲美。基準測試也表明深度求索公司并沒有說大話,該模型已經(jīng)成為o1的一個強大對手,而且運行成本還要低得多。
深度求索公司現(xiàn)在已經(jīng)開源了R1 的一個版本——R1-Zero。雖然R1-Zero遇到了一些問題,比如“無休止的重復(fù)、可讀性差、語言混亂等等”,但是R1顯然已經(jīng)沒有這些問題了。或許是因為這兩個模型體量太大,深度求索還把它們的知識遷移到了Meta的Llama和阿里巴巴的 Qwen模型版本上,而且也將這些模型開源了。
此外,中國的月之暗面公司(Moonshot AI)剛剛發(fā)布了Kimi k1.5模型,它能夠?qū)ξ谋竞鸵曈X模態(tài)進行推理,月之暗面也表示該模型可與o1媲美。據(jù)說,該模型的新版本很快將應(yīng)用于在它的Kimi 聊天機器人中。(財富中文網(wǎng))
譯者:樸成奎
OpenAI may or may not be about to release something big and agentic.
According to a rather breathless Axios article on Sunday, an unidentified company is preparing “Ph.D.-level super-agents” that would be “a true replacement for human workers.” No names are named, but the article prominently notes that OpenAI CEO Sam Altman will give Trump administration officials a closed-door briefing at the end of the month.
It goes on to add: “Sources say this coming advancement is significant. Several OpenAI staff have been telling friends they are both jazzed and spooked by recent progress.” Those sources apparently come from “the U.S. government and leading AI companies.”
There’s more than a whiff of hype about all this. But Altman is no fan of such things, he claims. Addressing the separate but perhaps connected issue of OpenAI’s efforts to achieve “artificial general intelligence” (definitions differ, but this usually means AI with human- or superhuman-level capabilities), the CEO tweeted yesterday that “Twitter hype is out of control again” and “we are not gonna deploy AGI next month, nor have we built it.”
If he’s so anti-hype, Altman might want to take himself aside for tweeting, less than three weeks ago: “I have always wanted to write a six-word story. Here it is: Near the singularity; unclear which side.” A story, sure, but it also came across as a strong hint. (“The singularity” is a term referring to the inflection point where AI surpasses human intelligence.)
In yesterday’s tweet, Altman promised “We have some very cool stuff for you.” I’ve asked OpenAI whether it is the company that’s about to reveal “Ph.D.-level super-agents” and have received no response. But The Information reports that OpenAI will launch an agentic system called Operator, which can autonomously execute tasks on the user’s behalf, as soon as this month.
Whatever OpenAI does release, people should scrutinize it very closely, because the company has in recent days been caught up in a bit of a benchmarking scandal that raises questions about its performance claims.
The benchmark in question is FrontierMath, which was used in the demonstration of OpenAI’s flagship o3 model a month back. Curated by Epoch AI, FrontierMath contains only “new and unpublished” math problems, which is supposed to avoid the issue of a model being asked to solve problems that were included in its training dataset. Epoch AI says models such as OpenAI’s GPT-4 and Google’s Gemini only manage scores of less than 2%. In its demo, o3 scored a shade over 25%.
Problem is, it turns out that OpenAI funded the development of FrontierMath and apparently instructed Epoch AI not to tell anyone about this, until the day of o3’s unveiling. After an Epoch AI contractor used a LessWrong post to complain that mathematicians contributing to the dataset had been kept in the dark about the link, Epoch associate director Tamay Besiroglu apologized, saying OpenAI’s contract had left the company unable to disclose the funding earlier.
“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities,” Besiroglu wrote. “However, we have a verbal agreement that these materials will not be used in model training.”
OpenAI has not yet responded to a question about whether it nonetheless used its FrontierMath access when training o3—but its critics aren’t holding back. “The public presentation of o3 from a scientific perspective was manipulative and disgraceful,” the notable AGI skeptic Gary Marcus told my colleague Jeremy Kahn in Davos yesterday, adding that the presentation was “deliberately structured to make it look like they were closer to AGI than they actually are.
“OpenAI should be more transparent about what the business arrangements were [with Epoch AI] and the extent to which they were given a competitive advantage and the extent to which they trained directly or indirectly on materials they had access to and the extent to which they used data augmentation techniques on information they had access to,” Marcus said. “If they are not transparent, we should not take them seriously.”
That’s something to bear in mind over the coming weeks. And with that, here’s more on what has been a very busy few days on the AI news front.
AI IN THE NEWS
Trump scraps Biden’s AI order. On his first day back in office, President Donald Trump scrapped dozens of his predecessor’s policies, among them Biden’s 2023 Executive Order on Safe, Secure, and Trustworthy Development and Use of AI. Much of that particular order has already been carried out, such as the creation of an AI Safety Institute within the National Institute of Standards and Technology (NIST). But Trump’s move does mean that AI companies will no longer have to give the U.S. government safety-test results before releasing new models. It also means that the U.S. now has no significant federal AI rules, creating an enormous disparity with the EU in particular, and perhaps setting the stage for future EU-U.S. clashes over the issue of AI safety.
Whistleblower targets Amazon’s Covariant acquihire. An unnamed shareholder and former employee of Covariant AI, a company that makes AI for logistics robots, has complained to the U.S. authorities about Amazon’s recent deal with the company. As it announced last August, Amazon hired three Covariant founders and a quarter of its staff, while taking a nonexclusive license for its models. Per the Washington Post, the whistleblower claims the acquihire deal was worth $380 million—over three times the threshold for giving antitrust regulators a heads-up, which never happened—and also that its terms limited the licenses that Covariant could sell to others. An Amazon spokesperson responded: “Covariant continues to serve its dozens of customers, and because Amazon is licensing Covariant technology on a non-exclusive basis, Covariant is free to license its technology to other companies."
Metropolis buys Oosto. Oosto, the Israeli AI facial recognition firm formerly known as AnyVision, has found a buyer. Metropolisan, an AI company that helps parking operators provide checkout-free payment experiences, will pay $125 million of its stock in exchange for Oosto, according to TechCrunch. Oosto had raised some $380 million from investors. Oosto/AnyVision was a controversial outfit, partly because many people are generally uneasy about facial recognition, but also because the Israeli government used its software to surveil West Bank Palestinians.
British government details extensive AI plans. The U.K.’s Labour government said last week that it would “mainline AI into the veins” of the country’s economy, and now it’s detailed how the country’s public services will embrace the new technology. As part of an announcement around the digitization of services and better sharing of data between agencies, the government announced an AI toolkit for civil servants. The package is dubbed “Humphrey," a witty reference to the classic TV show Yes Minister. The kit includes tools for rapidly parsing responses to public consultations, draws on decades of parliamentary debate to “better manage bills” (reportedly by predicting how legislation will be received by lawmakers), and summarizing policies and laws.
EYE ON AI RESEARCH
Google pits Titans against transformers. There’s a lot of buzz around a new neural-network architecture that Google researchers have just announced. The Titans architecture provides the possibility of long-term, persistent neural memory that can act in concert with more short-term memory, of the sort that is associated with the transformer architecture that underpins today’s LLMs. This would be useful for building agents. According to Google’s researchers, the new architecture is “more effective” than transformers when it comes to “common-sense reasoning” and other tasks, specifically when it comes to handling large amounts of information. However, the big question now is what the compute requirements look like.
Meta claims Babel Fish breakthrough. Meta’s researchers have announced a system called Massively Multilingual and Multimodal Machine Translation, or SEAMLESSM4T, that can translate spoken words into other languages without the need to convert the recording to text and back again (though it can do that too.) They suggest this is a big step towards the creation of something like the Babel Fish, a universal translator (and fish) that makes it possible for characters in Douglas Adams’s Hitchhiker’s Guide to the Galaxy to communicate with other species. According to the researchers, SEAMLESSM4T is far better at rejecting background noise than comparable systems.
AI CALENDAR
Feb. 10-11: AI Action Summit, Paris, France
March 3-6: MWC, Barcelona
March 7-15: SXSW, Austin
March 10-13: Human [X] conference, Las Vegas
March 17-20: Nvidia GTC, San Jose
April 9-11: Google Cloud Next, Las Vegas
BRAIN FOOD
Reasoning models flourish in China. In the push for better AI “reasoning” models, all eyes are currently on China thanks to a couple of notable announcements.
First up: DeepSeek-R1. Hangzhou-based DeepSeek released its V3 model, currently considered by some to be the best open-source AI model out there (sorry, Meta,) just before Christmas. R1 was used to train V3, and DeepSeek claims it can just about match OpenAI’s o1 “across math, code, and reasoning tasks.” Benchmarking suggests this is true, providing a serious competitor to o1 that is much cheaper to run.
DeepSeek has now open-sourced a version of R1 called R1-Zero, which it says “encounters challenges such as endless repetition, poor readability, and language mixing,” as well as R1 itself, which apparently doesn’t. Perhaps because both are enormous, it has also transferred (or “distilled”) knowledge from them to versions of Meta’s Llama and Alibaba’s Qwen models, and open-sourced those too.
Meanwhile, China’s Moonshot AI just announced Kimi k1.5, a model that can reason over both text and vision modalities, and that Moonshot also claims is comparable to o1. It says the new version of the model will soon power its popular Kimi chatbot.