国产 无码 综合区,色欲AV无码国产永久播放,无码天堂亚洲国产AV,国产日韩欧美女同一区二区

]每日論文推送(有中文摘要或代碼或項目地址)---強化學(xué)習(xí),機器人,視覺導(dǎo)航

這篇具有很好參考價值的文章主要介紹了]每日論文推送(有中文摘要或代碼或項目地址)---強化學(xué)習(xí),機器人,視覺導(dǎo)航。希望對大家有所幫助。如果存在錯誤或未考慮完全的地方,請大家不吝賜教,您也可以點擊"舉報違法"按鈕提交疑問。

[曉理紫]每日論文推送(有中文摘要或代碼或項目地址)
每日更新論文,請轉(zhuǎn)發(fā)給有需要的同學(xué)
[曉理紫]

專屬領(lǐng)域論文訂閱

VX關(guān)注曉理紫,獲取每日新論文
VX關(guān)注曉理紫,并留下郵箱可免費獲取每日論文推送服務(wù)

{曉理紫}喜分享,也很需要你的支持,喜歡留下痕跡哦!

分類:

  • 大語言模型LLM
  • 視覺模型VLM
  • 擴散模型
  • 視覺導(dǎo)航
  • 具身智能,機器人
  • 強化學(xué)習(xí)
  • 開放詞匯,檢測分割

== Visual Navigation ==

標題: Exploring Vulnerabilities of No-Reference Image Quality Assessment
Models: A Query-Based Black-Box Method

作者: Chenxi Yang, Yujia Liu, Dingquan Li

摘要: No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality scores consistent with human perception without relying on pristine reference images, serving as a crucial component in various visual tasks. Ensuring the robustness of NR-IQA methods is vital for reliable comparisons of different image processing techniques and consistent user experiences in recommendations. The attack methods for NR-IQA provide a powerful instrument to test the robustness of NR-IQA. However, current attack methods of NR-IQA heavily rely on the gradient of the NR-IQA model, leading to limitations when the gradient information is unavailable. In this paper, we present a pioneering query-based black box attack against NR-IQA methods. We propose the concept of \emph{score boundary} and leverage an adaptive iterative approach with multiple score boundaries. Meanwhile, the initial attack directions are also designed to leverage the characteristics of the Human Visual System (HVS). Experiments show our attack method outperforms all compared state-of-the-art methods and is far ahead of previous black-box methods. The effective DBCNN model suffers a Spearman rank-order correlation coefficient (SROCC) decline of 0.6972 0.6972 0.6972 attacked by our method, revealing the vulnerability of NR-IQA to black-box attacks. The proposed attack method also provides a potent tool for further exploration into NR-IQA robustness.

中文摘要: 無參考圖像質(zhì)量評估(NR-IQA)旨在預(yù)測與人類感知一致的圖像質(zhì)量分數(shù),而不依賴于原始參考圖像,這是各種視覺任務(wù)的關(guān)鍵組成部分。確保NR-IQA方法的穩(wěn)健性對于不同圖像處理技術(shù)的可靠比較和推薦中一致的用戶體驗至關(guān)重要。NR-IQA的攻擊方法為測試NR-IQA提供了一個強大的工具。然而,當前NR-IQA的攻擊方法嚴重依賴于NR-IQA模型的梯度,導(dǎo)致在梯度信息不可用時受到限制。在本文中,我們提出了一種針對NR-IQA方法的開創(chuàng)性的基于查詢的黑匣子攻擊。我們提出了\emph{分數(shù)邊界}的概念,并利用了一種具有多個分數(shù)邊界的自適應(yīng)迭代方法。同時,初始攻擊方向也被設(shè)計為利用人類視覺系統(tǒng)(HVS)的特性。實驗表明,我們的攻擊方法優(yōu)于所有最先進的方法,并且遠遠領(lǐng)先于以前的黑盒方法。有效的DBCNN模型在受到我們的方法攻擊時,Spearman秩序相關(guān)系數(shù)(SROCC)下降了0.6972$,揭示了NR-IQA對黑匣子攻擊的脆弱性。所提出的攻擊方法也為進一步探索NR-IQA的魯棒性提供了有力的工具

[Downlink:]http://arxiv.org/abs/2401.05217v1


標題: Amplifying robotics capacities with a human touch: An immersive
low-latency panoramic remote system

作者: Junjie Li, Kang Li, Dewei Han

摘要: AI and robotics technologies have witnessed remarkable advancements in the past decade, revolutionizing work patterns and opportunities in various domains. The application of these technologies has propelled society towards an era of symbiosis between humans and machines. To facilitate efficient communication between humans and intelligent robots, we propose the “Avatar” system, an immersive low-latency panoramic human-robot interaction platform. We have designed and tested a prototype of a rugged mobile platform integrated with edge computing units, panoramic video capture devices, power batteries, robot arms, and network communication equipment. Under favorable network conditions, we achieved a low-latency high-definition panoramic visual experience with a delay of 357ms. Operators can utilize VR headsets and controllers for real-time immersive control of robots and devices. The system enables remote control over vast physical distances, spanning campuses, provinces, countries, and even continents (New York to Shenzhen). Additionally, the system incorporates visual SLAM technology for map and trajectory recording, providing autonomous navigation capabilities. We believe that this intuitive system platform can enhance efficiency and situational experience in human-robot collaboration, and with further advancements in related technologies, it will become a versatile tool for efficient and symbiotic cooperation between AI and humans.

中文摘要: 人工智能和機器人技術(shù)在過去十年中取得了顯著進步,改變了各個領(lǐng)域的工作模式和機會。這些技術(shù)的應(yīng)用將社會推向了一個人與機器共生的時代。為了促進人類與智能機器人之間的高效通信,我們提出了“阿凡達”系統(tǒng),這是一個沉浸式低延遲全景人機交互平臺。我們設(shè)計并測試了一個堅固的移動平臺原型,該平臺集成了邊緣計算單元、全景視頻捕獲設(shè)備、動力電池、機械臂和網(wǎng)絡(luò)通信設(shè)備。在良好的網(wǎng)絡(luò)條件下,我們實現(xiàn)了延遲357ms的低延遲高清全景視覺體驗。操作員可以利用VR耳機和控制器對機器人和設(shè)備進行實時沉浸式控制。該系統(tǒng)能夠?qū)崿F(xiàn)跨越校園、省份、國家甚至大洲(紐約到深圳)的遠距離遠程控制。此外,該系統(tǒng)結(jié)合了用于地圖和軌跡記錄的視覺SLAM技術(shù),提供了自主導(dǎo)航功能。我們相信,這個直觀的系統(tǒng)平臺可以提高人機協(xié)作的效率和情景體驗,隨著相關(guān)技術(shù)的進一步進步,它將成為人工智能與人類高效共生合作的通用工具

[Downlink:]http://arxiv.org/abs/2401.03398v2


標題: Autonomous robotic re-alignment for face-to-face underwater human-robot
interaction

作者: Demetrious T. Kutzke, Ashwin Wariar, Junaed Sattar

摘要: The use of autonomous underwater vehicles (AUVs) to accomplish traditionally challenging and dangerous tasks has proliferated thanks to advances in sensing, navigation, manipulation, and on-board computing technologies. Utilizing AUVs in underwater human-robot interaction (UHRI) has witnessed comparatively smaller levels of growth due to limitations in bi-directional communication and significant technical hurdles to bridge the gap between analogies with terrestrial interaction strategies and those that are possible in the underwater domain. A necessary component to support UHRI is establishing a system for safe robotic-diver approach to establish face-to-face communication that considers non-standard human body pose. In this work, we introduce a stereo vision system for enhancing UHRI that utilizes three-dimensional reconstruction from stereo image pairs and machine learning for localizing human joint estimates. We then establish a convention for a coordinate system that encodes the direction the human is facing with respect to the camera coordinate frame. This allows automatic setpoint computation that preserves human body scale and can be used as input to an image-based visual servo control scheme. We show that our setpoint computations tend to agree both quantitatively and qualitatively with experimental setpoint baselines. The methodology introduced shows promise for enhancing UHRI by improving robotic perception of human orientation underwater.

中文摘要: 由于傳感、導(dǎo)航、操縱和機載計算技術(shù)的進步,自動水下航行器(AUV)用于完成傳統(tǒng)上具有挑戰(zhàn)性和危險性的任務(wù)的使用激增。在水下人機交互(UHRI)中使用AUV的增長水平相對較小,這是由于雙向通信的局限性和彌合陸地交互策略與水下領(lǐng)域可能的策略之間的差距的重大技術(shù)障礙。支持UHRI的一個必要組成部分是建立一個安全的機器人潛水員方法系統(tǒng),以建立考慮非標準人體姿勢的面對面交流。在這項工作中,我們介紹了一種用于增強UHRI的立體視覺系統(tǒng),該系統(tǒng)利用立體圖像對的三維重建和機器學(xué)習(xí)來定位人類關(guān)節(jié)估計。然后,我們?yōu)樽鴺讼到⒘艘粋€約定,該約定對人類相對于相機坐標系所面對的方向進行編碼。這允許自動設(shè)置點計算,該設(shè)置點計算保持人體比例并且可以用作基于圖像的視覺伺服控制方案的輸入。我們表明,我們的設(shè)定點計算往往在數(shù)量和質(zhì)量上與實驗設(shè)定點基線一致。所介紹的方法有望通過改善機器人對水下人類方位的感知來增強UHRI

[Downlink:]http://arxiv.org/abs/2401.04320v1


標題: A Visual Analytics Design for Connecting Healthcare Team Communication
to Patient Outcomes

作者: Hsiao-Ying Lu, Yiran Li, Kwan-Liu Ma

摘要: Communication among healthcare professionals (HCPs) is crucial for the quality of patient treatment. Surrounding each patient’s treatment, communication among HCPs can be examined as temporal networks, constructed from Electronic Health Record (EHR) access logs. This paper introduces a visual analytics system designed to study the effectiveness and efficiency of temporal communication networks mediated by the EHR system. We present a method that associates network measures with patient survival outcomes and devises effectiveness metrics based on these associations. To analyze communication efficiency, we extract the latencies and frequencies of EHR accesses. Our visual analytics system is designed to assist in inspecting and understanding the composed communication effectiveness metrics and to enable the exploration of communication efficiency by encoding latencies and frequencies in an information flow diagram. We demonstrate and evaluate our system through multiple case studies and an expert review.

中文摘要: 醫(yī)療保健專業(yè)人員之間的溝通對患者治療的質(zhì)量至關(guān)重要。圍繞每個患者的治療,HCP之間的通信可以作為時間網(wǎng)絡(luò)進行檢查,該網(wǎng)絡(luò)由電子健康記錄(EHR)訪問日志構(gòu)建。本文介紹了一個可視化分析系統(tǒng),旨在研究EHR系統(tǒng)所介導(dǎo)的時間通信網(wǎng)絡(luò)的有效性和效率。我們提出了一種將網(wǎng)絡(luò)測量與患者生存結(jié)果相關(guān)聯(lián)的方法,并基于這些關(guān)聯(lián)設(shè)計有效性指標。為了分析通信效率,我們提取了EHR接入的延遲和頻率。我們的可視化分析系統(tǒng)旨在幫助檢查和理解組合的通信效率指標,并通過在信息流圖中編碼延遲和頻率來探索通信效率。我們通過多個案例研究和專家評審來展示和評估我們的系統(tǒng)

[Downlink:]http://arxiv.org/abs/2401.03700v1


標題: Amirkabir campus dataset: Real-world challenges and scenarios of Visual
Inertial Odometry (VIO) for visually impaired people

作者: Ali Samadzadeh, Mohammad Hassan Mojab, Heydar Soudani

摘要: Visual Inertial Odometry (VIO) algorithms estimate the accurate camera trajectory by using camera and Inertial Measurement Unit (IMU) sensors. The applications of VIO span a diverse range, including augmented reality and indoor navigation. VIO algorithms hold the potential to facilitate navigation for visually impaired individuals in both indoor and outdoor settings. Nevertheless, state-of-the-art VIO algorithms encounter substantial challenges in dynamic environments, particularly in densely populated corridors. Existing VIO datasets, e.g., ADVIO, typically fail to effectively exploit these challenges. In this paper, we introduce the Amirkabir campus dataset (AUT-VI) to address the mentioned problem and improve the navigation systems. AUT-VI is a novel and super-challenging dataset with 126 diverse sequences in 17 different locations. This dataset contains dynamic objects, challenging loop-closure/map-reuse, different lighting conditions, reflections, and sudden camera movements to cover all extreme navigation scenarios. Moreover, in support of ongoing development efforts, we have released the Android application for data capture to the public. This allows fellow researchers to easily capture their customized VIO dataset variations. In addition, we evaluate state-of-the-art Visual Inertial Odometry (VIO) and Visual Odometry (VO) methods on our dataset, emphasizing the essential need for this challenging dataset.

中文摘要: 視覺慣性里程計(VIO)算法通過使用相機和慣性測量單元(IMU)傳感器來估計精確的相機軌跡。VIO的應(yīng)用范圍廣泛,包括增強現(xiàn)實和室內(nèi)導(dǎo)航。VIO算法有可能促進視障人士在室內(nèi)和室外環(huán)境中的導(dǎo)航。然而,最先進的VIO算法在動態(tài)環(huán)境中,特別是在人口稠密的走廊中,遇到了巨大的挑戰(zhàn)?,F(xiàn)有的VIO數(shù)據(jù)集,例如ADVIO,通常無法有效利用這些挑戰(zhàn)。在本文中,我們引入了Amirkabir校園數(shù)據(jù)集(AUT-VI)來解決上述問題并改進導(dǎo)航系統(tǒng)。AUT-VI是一個新穎且極具挑戰(zhàn)性的數(shù)據(jù)集,包含17個不同位置的126個不同序列。該數(shù)據(jù)集包含動態(tài)對象、具有挑戰(zhàn)性的回路閉合/地圖重用、不同的照明條件、反射和相機突然移動,以覆蓋所有極端導(dǎo)航場景。此外,為了支持正在進行的開發(fā)工作,我們向公眾發(fā)布了用于數(shù)據(jù)捕獲的Android應(yīng)用程序。這使得其他研究人員能夠輕松地捕捉他們定制的VIO數(shù)據(jù)集變體。此外,我們在數(shù)據(jù)集上評估了最先進的視覺慣性里程計(VIO)和視覺里程計(VO)方法,強調(diào)了對這一具有挑戰(zhàn)性的數(shù)據(jù)集的必要性

[Downlink:]http://arxiv.org/abs/2401.03604v1


== 具身智能,機器人 ==

標題: Unified Learning from Demonstrations, Corrections, and Preferences
during Physical Human-Robot Interaction

作者: Shaunak A. Mehta, Dylan P. Losey

摘要: Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine multiple interaction types by assuming that the robot has prior information about the human’s intended task. By contrast, in this paper we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human’s inputs to nearby alternatives. We first derive a loss function that trains an ensemble of reward models to match the human’s demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: we enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at: https://youtu.be/FSUJsTYvEKU

中文摘要: 人類可以利用物理交互來教授機器人手臂。這種物理交互有多種形式,具體取決于任務(wù)、用戶以及機器人迄今為止所學(xué)的知識?,F(xiàn)有技術(shù)的方法側(cè)重于從單一模態(tài)學(xué)習(xí),或者通過假設(shè)機器人具有關(guān)于人類預(yù)期任務(wù)的先驗信息來組合多種交互類型。相比之下,在本文中,我們引入了一種算法形式主義,它將從演示、更正和偏好中學(xué)習(xí)結(jié)合起來。我們的方法不對人類想要教機器人的任務(wù)進行假設(shè);相反,我們通過將人類的輸入與附近的替代品進行比較,從頭開始學(xué)習(xí)獎勵模型。我們首先推導(dǎo)出一個損失函數(shù),該函數(shù)訓(xùn)練一組獎勵模型,以匹配人類的演示、校正和偏好。反饋的類型和順序取決于人類老師:我們使機器人能夠被動或主動地收集反饋。然后,我們應(yīng)用約束優(yōu)化將我們學(xué)到的獎勵轉(zhuǎn)化為所需的機器人軌跡。通過模擬和用戶研究,我們證明了我們提出的方法比現(xiàn)有的基線更準確地從物理人類交互中學(xué)習(xí)操縱任務(wù),特別是當機器人面臨新的或意想不到的目標時。我們的用戶研究視頻可在以下網(wǎng)站獲?。篽ttps://youtu.be/FSUJsTYvEKU

[Downlink:]http://arxiv.org/abs/2207.03395v2

[Project:]https://youtu.be/FSUJsTYvEKU|


標題: Theory of Mind abilities of Large Language Models in Human-Robot
Interaction : An Illusion?

作者: Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati

摘要: Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. However, possible anthropomorphization and leniency towards failure cases have propelled discussions on emergent abilities of Large Language Models especially on Theory of Mind (ToM) abilities in Large Language Models. While several false-belief tests exists to verify the ability to infer and maintain mental models of another entity, we study a special application of ToM abilities that has higher stakes and possibly irreversible consequences : Human Robot Interaction. In this work, we explore the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot’s generated behavior in a manner similar to human observer. We focus on four behavior types, namely - explicable, legible, predictable, and obfuscatory behavior which have been extensively used to synthesize interpretable robot behaviors. The LLMs goal is, therefore to be a human proxy to the agent, and to answer how a certain agent behavior would be perceived by the human in the loop, for example “Given a robot’s behavior X, would the human observer find it explicable?”. We conduct a human subject study to verify that the users are able to correctly answer such a question in the curated situations (robot setting and plan) across five domains. A first analysis of the belief test yields extremely positive results inflating ones expectations of LLMs possessing ToM abilities. We then propose and perform a suite of perturbation tests which breaks this illusion, i.e. Inconsistent Belief, Uninformative Context and Conviction Test. We conclude that, the high score of LLMs on vanilla prompts showcases its potential use in HRI settings, however to possess ToM demands invariance to trivial or irrelevant perturbations in the context which LLMs lack.

中文摘要: 大型語言模型在各種自然語言和生成任務(wù)中表現(xiàn)出了非凡的生成能力。然而,可能的擬人化和對失敗案例的寬容推動了對大語言模型涌現(xiàn)能力的討論,尤其是對大語言模式中心理理論能力的討論。雖然存在一些錯誤信念測試來驗證推斷和維護另一個實體的心理模型的能力,但我們研究了ToM能力的一個特殊應(yīng)用,它具有更高的風險和可能不可逆轉(zhuǎn)的后果:人機交互。在這項工作中,我們探索了感知行為識別的任務(wù),其中機器人采用大型語言模型(LLM)以類似于人類觀察者的方式評估機器人生成的行為。我們關(guān)注四種行為類型,即可解釋、可閱讀、可預(yù)測和模糊行為,這些行為已被廣泛用于合成可解釋的機器人行為。因此,LLM的目標是成為代理的人類代理,并回答某個代理行為將如何被循環(huán)中的人類感知,例如“給定機器人的行為X,人類觀察者會發(fā)現(xiàn)它是可解釋的嗎?”。我們進行了一項人類受試者研究,以驗證用戶能夠在五個領(lǐng)域的精心策劃的情況下(機器人設(shè)置和計劃)正確回答這樣的問題。信念測試的第一個分析產(chǎn)生了非常積極的結(jié)果,夸大了人們對LLM擁有ToM能力的期望。然后,我們提出并執(zhí)行了一套打破這種錯覺的擾動測試,即不一致信念、不一致上下文和信念測試。我們得出的結(jié)論是,LLM在香草提示上的高分顯示了它在HRI設(shè)置中的潛在用途,然而,在LLM缺乏的情況下,擁有ToM要求對瑣碎或無關(guān)的擾動保持不變

[Downlink:]http://arxiv.org/abs/2401.05302v1


標題: Evaluating Gesture Recognition in Virtual Reality

作者: Sandeep Reddy Sabbella, Sara Kaszuba, Francesco Leotta

摘要: Human-Robot Interaction (HRI) has become increasingly important as robots are being integrated into various aspects of daily life. One key aspect of HRI is gesture recognition, which allows robots to interpret and respond to human gestures in real-time. Gesture recognition plays an important role in non-verbal communication in HRI. To this aim, there is ongoing research on how such non-verbal communication can strengthen verbal communication and improve the system’s overall efficiency, thereby enhancing the user experience with the robot. However, several challenges need to be addressed in gesture recognition systems, which include data generation, transferability, scalability, generalizability, standardization, and lack of benchmarking of the gestural systems. In this preliminary paper, we want to address the challenges of data generation using virtual reality simulations and standardization issues by presenting gestures to some commands that can be used as a standard in ground robots.

中文摘要: 隨著機器人融入日常生活的各個方面,人機交互變得越來越重要。HRI的一個關(guān)鍵方面是手勢識別,它允許機器人實時解釋和響應(yīng)人類手勢。手勢識別在HRI的非言語交際中起著重要作用。為此,正在進行的研究是,這種非語言交流如何加強語言交流,提高系統(tǒng)的整體效率,從而增強機器人的用戶體驗。然而,手勢識別系統(tǒng)需要解決幾個挑戰(zhàn),包括數(shù)據(jù)生成、可傳輸性、可擴展性、可推廣性、標準化以及缺乏手勢系統(tǒng)的基準測試。在這篇初步論文中,我們希望通過向一些可以用作地面機器人標準的命令提供手勢,來解決使用虛擬現(xiàn)實模擬生成數(shù)據(jù)的挑戰(zhàn)和標準化問題

[Downlink:]http://arxiv.org/abs/2401.04545v1


標題: Testing Human-Robot Interaction in Virtual Reality: Experience from a
Study on Speech Act Classification

作者: Sara Kaszuba, Sandeep Reddy Sabbella, Francesco Leotta

摘要: In recent years, an increasing number of Human-Robot Interaction (HRI) approaches have been implemented and evaluated in Virtual Reality (VR), as it allows to speed-up design iterations and makes it safer for the final user to evaluate and master the HRI primitives. However, identifying the most suitable VR experience is not straightforward. In this work, we evaluate how, in a smart agriculture scenario, immersive and non-immersive VR are perceived by users with respect to a speech act understanding task. In particular, we collect opinions and suggestions from the 81 participants involved in both experiments to highlight the strengths and weaknesses of these different experiences.

中文摘要: 近年來,越來越多的人機交互(HRI)方法在虛擬現(xiàn)實(VR)中得到了實施和評估,因為它可以加快設(shè)計迭代,并使最終用戶更安全地評估和掌握HRI原語。然而,確定最合適的VR體驗并不簡單。在這項工作中,我們評估了在智能農(nóng)業(yè)場景中,用戶如何在語音行為理解任務(wù)中感知沉浸式和非沉浸式VR。特別是,我們收集了參與這兩個實驗的81名參與者的意見和建議,以突出這些不同經(jīng)歷的優(yōu)勢和劣勢

[Downlink:]http://arxiv.org/abs/2401.04534v1


標題: Amplifying robotics capacities with a human touch: An immersive
low-latency panoramic remote system

作者: Junjie Li, Kang Li, Dewei Han

摘要: AI and robotics technologies have witnessed remarkable advancements in the past decade, revolutionizing work patterns and opportunities in various domains. The application of these technologies has propelled society towards an era of symbiosis between humans and machines. To facilitate efficient communication between humans and intelligent robots, we propose the “Avatar” system, an immersive low-latency panoramic human-robot interaction platform. We have designed and tested a prototype of a rugged mobile platform integrated with edge computing units, panoramic video capture devices, power batteries, robot arms, and network communication equipment. Under favorable network conditions, we achieved a low-latency high-definition panoramic visual experience with a delay of 357ms. Operators can utilize VR headsets and controllers for real-time immersive control of robots and devices. The system enables remote control over vast physical distances, spanning campuses, provinces, countries, and even continents (New York to Shenzhen). Additionally, the system incorporates visual SLAM technology for map and trajectory recording, providing autonomous navigation capabilities. We believe that this intuitive system platform can enhance efficiency and situational experience in human-robot collaboration, and with further advancements in related technologies, it will become a versatile tool for efficient and symbiotic cooperation between AI and humans.

中文摘要: 人工智能和機器人技術(shù)在過去十年中取得了顯著進步,改變了各個領(lǐng)域的工作模式和機會。這些技術(shù)的應(yīng)用將社會推向了一個人與機器共生的時代。為了促進人類與智能機器人之間的高效通信,我們提出了“阿凡達”系統(tǒng),這是一個沉浸式低延遲全景人機交互平臺。我們設(shè)計并測試了一個堅固的移動平臺原型,該平臺集成了邊緣計算單元、全景視頻捕獲設(shè)備、動力電池、機械臂和網(wǎng)絡(luò)通信設(shè)備。在良好的網(wǎng)絡(luò)條件下,我們實現(xiàn)了延遲357ms的低延遲高清全景視覺體驗。操作員可以利用VR耳機和控制器對機器人和設(shè)備進行實時沉浸式控制。該系統(tǒng)能夠?qū)崿F(xiàn)跨越校園、省份、國家甚至大洲(紐約到深圳)的遠距離遠程控制。此外,該系統(tǒng)結(jié)合了用于地圖和軌跡記錄的視覺SLAM技術(shù),提供了自主導(dǎo)航功能。我們相信,這個直觀的系統(tǒng)平臺可以提高人機協(xié)作的效率和情景體驗,隨著相關(guān)技術(shù)的進一步進步,它將成為人工智能與人類高效共生合作的通用工具

[Downlink:]http://arxiv.org/abs/2401.03398v2


標題: Large Language Models for Robotics: Opportunities, Challenges, and
Perspectives

作者: Jiaqi Wang, Zihao Wu, Yiwei Li

摘要: Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights toward bridging the gap in Human-Robot-Environment interaction.

中文摘要: 大型語言模型(LLM)經(jīng)歷了顯著的擴展,并越來越多地跨各個領(lǐng)域進行集成。值得注意的是,在機器人任務(wù)規(guī)劃領(lǐng)域,LLM利用其先進的推理和語言理解能力,根據(jù)自然語言指令制定精確高效的行動計劃。然而,對于機器人與復(fù)雜環(huán)境交互的具體任務(wù),由于與機器人視覺感知缺乏兼容性,純文本LLM往往面臨挑戰(zhàn)。這項研究全面概述了LLM和多模式LLM在各種機器人任務(wù)中的新興集成。此外,我們提出了一個框架,該框架利用多模式GPT-4V,通過自然語言指令和機器人視覺感知的組合來增強具體任務(wù)規(guī)劃。我們基于不同數(shù)據(jù)集的結(jié)果表明,GPT-4V有效地提高了機器人在具體任務(wù)中的性能。這項針對各種機器人任務(wù)的LLM和多模式LLM的廣泛調(diào)查和評估豐富了對以LLM為中心的具體智能的理解,并為彌合人機環(huán)境交互中的差距提供了前瞻性見解

[Downlink:]http://arxiv.org/abs/2401.04334v1


== Reinforcement Learning ==

標題: HomeRobot: Open-Vocabulary Mobile Manipulation

作者: Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav

摘要: HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. See videos on our website: https://ovmm.github.io/.

中文摘要: 家庭機器人(名詞):一種價格合理的兼容機器人,可在家中導(dǎo)航并操縱各種物體以完成日常任務(wù)。開放詞匯移動操作(OVMM)是指在任何看不見的環(huán)境中拾取任何對象,并將其放置在命令位置的問題。這是機器人在人類環(huán)境中成為有用助手的一個基本挑戰(zhàn),因為它涉及到解決機器人的子問題:感知、語言理解、導(dǎo)航和操作都是OVMM的關(guān)鍵。此外,這些子問題的解決方案的一體化也帶來了自身的重大挑戰(zhàn)。為了推動這一領(lǐng)域的研究,我們引入了HomeRobot OVMM基準,在該基準中,代理導(dǎo)航家庭環(huán)境,以抓取新物體并將其放置在目標容器上。HomeRobot有兩個組件:一個模擬組件,在新的、高質(zhì)量的多房間家庭環(huán)境中使用大型和多樣化的策劃對象集;和一個真實世界的組件,為低成本的Hello Robot Stretch提供了一個軟件堆棧,以鼓勵在實驗室中復(fù)制真實世界的實驗。我們實現(xiàn)了強化學(xué)習(xí)和啟發(fā)式(基于模型的)基線,并展示了模擬到真實轉(zhuǎn)移的證據(jù)。我們的基線在現(xiàn)實世界中實現(xiàn)了20%的成功率;我們的實驗確定了未來研究工作提高性能的方法。查看我們網(wǎng)站上的視頻:https://ovmm.github.io/.

[Downlink:]http://arxiv.org/abs/2306.11565v2

[Project:]https://ovmm.github.io/.|


標題: Yes, this is what I was looking for! Towards Multi-modal Medical
Consultation Concern Summary Generation

作者: Abhisek Tiwari, Shreyangshu Bera, Sriparna Saha

摘要: Over the past few years, the use of the Internet for healthcare-related tasks has grown by leaps and bounds, posing a challenge in effectively managing and processing information to ensure its efficient utilization. During moments of emotional turmoil and psychological challenges, we frequently turn to the internet as our initial source of support, choosing this over discussing our feelings with others due to the associated social stigma. In this paper, we propose a new task of multi-modal medical concern summary (MMCS) generation, which provides a short and precise summary of patients’ major concerns brought up during the consultation. Nonverbal cues, such as patients’ gestures and facial expressions, aid in accurately identifying patients’ concerns. Doctors also consider patients’ personal information, such as age and gender, in order to describe the medical condition appropriately. Motivated by the potential efficacy of patients’ personal context and visual gestures, we propose a transformer-based multi-task, multi-modal intent-recognition, and medical concern summary generation (IR-MMCSG) system. Furthermore, we propose a multitasking framework for intent recognition and medical concern summary generation for doctor-patient consultations. We construct the first multi-modal medical concern summary generation (MM-MediConSummation) corpus, which includes patient-doctor consultations annotated with medical concern summaries, intents, patient personal information, doctor’s recommendations, and keywords. Our experiments and analysis demonstrate (a) the significant role of patients’ expressions/gestures and their personal information in intent identification and medical concern summary generation, and (b) the strong correlation between intent recognition and patients’ medical concern summary generation The dataset and source code are available at https://github.com/NLP-RL/MMCSG.

中文摘要: 在過去幾年中,互聯(lián)網(wǎng)在醫(yī)療保健相關(guān)任務(wù)中的使用突飛猛進,這對有效管理和處理信息以確保其高效利用提出了挑戰(zhàn)。在情緒動蕩和心理挑戰(zhàn)的時刻,我們經(jīng)常求助于互聯(lián)網(wǎng)作為我們最初的支持來源,由于相關(guān)的社會污名,我們選擇了互聯(lián)網(wǎng)而不是與他人討論我們的感受。在本文中,我們提出了一項新的任務(wù),即生成多模式醫(yī)療問題摘要(MMCS),該任務(wù)對患者在咨詢過程中提出的主要問題進行了簡短而準確的總結(jié)。非語言提示,如患者的手勢和面部表情,有助于準確識別患者的擔憂。醫(yī)生還會考慮患者的個人信息,如年齡和性別,以便適當?shù)孛枋鲠t(yī)療狀況。受患者個人背景和視覺手勢的潛在功效的啟發(fā),我們提出了一個基于轉(zhuǎn)換器的多任務(wù)、多模式意圖識別和醫(yī)療問題摘要生成(IR-MMCSG)系統(tǒng)。此外,我們提出了一個多任務(wù)框架,用于醫(yī)患會診的意圖識別和醫(yī)療問題摘要生成。我們構(gòu)建了第一個多模式醫(yī)療問題摘要生成(MM MediConSummation)語料庫,其中包括用醫(yī)療問題摘要、意圖、患者個人信息、醫(yī)生建議和關(guān)鍵詞注釋的醫(yī)患咨詢。我們的實驗和分析證明了(a)患者的表情/手勢及其個人信息在意圖識別和醫(yī)療問題摘要生成中的重要作用,以及(b)意圖識別和患者醫(yī)療問題摘要生成器之間的強相關(guān)性。數(shù)據(jù)集和源代碼可在https://github.com/NLP-RL/MMCSG.

[Downlink:]http://arxiv.org/abs/2401.05134v1

[GitHub:]https://github.com/NLP-RL/MMCSG.|


標題: Human as AI Mentor: Enhanced Human-in-the-loop Reinforcement Learning
for Safe and Efficient Autonomous Driving

作者: Zilin Huang, Zihao Sheng, Chengyuan Ma

摘要: Despite significant progress in autonomous vehicles (AVs), the development of driving policies that ensure both the safety of AVs and traffic flow efficiency has not yet been fully explored. In this paper, we propose an enhanced human-in-the-loop reinforcement learning method, termed the Human as AI mentor-based deep reinforcement learning (HAIM-DRL) framework, which facilitates safe and efficient autonomous driving in mixed traffic platoon. Drawing inspiration from the human learning process, we first introduce an innovative learning paradigm that effectively injects human intelligence into AI, termed Human as AI mentor (HAIM). In this paradigm, the human expert serves as a mentor to the AI agent. While allowing the agent to sufficiently explore uncertain environments, the human expert can take control in dangerous situations and demonstrate correct actions to avoid potential accidents. On the other hand, the agent could be guided to minimize traffic flow disturbance, thereby optimizing traffic flow efficiency. In detail, HAIM-DRL leverages data collected from free exploration and partial human demonstrations as its two training sources. Remarkably, we circumvent the intricate process of manually designing reward functions; instead, we directly derive proxy state-action values from partial human demonstrations to guide the agents’ policy learning. Additionally, we employ a minimal intervention technique to reduce the human mentor’s cognitive load. Comparative results show that HAIM-DRL outperforms traditional methods in driving safety, sampling efficiency, mitigation of traffic flow disturbance, and generalizability to unseen traffic scenarios. The code and demo videos for this paper can be accessed at: https://zilin-huang.github.io/HAIM-DRL-website/

中文摘要: 盡管自動駕駛汽車取得了重大進展,但確保自動駕駛汽車安全和交通流效率的駕駛政策的制定尚未得到充分探索。在本文中,我們提出了一種增強的人在環(huán)強化學(xué)習(xí)方法,稱為基于人工智能導(dǎo)師的深度強化學(xué)習(xí)(HAIM-DRL)框架,該框架有助于混合交通車隊中安全高效的自動駕駛。從人類學(xué)習(xí)過程中汲取靈感,我們首先引入了一種創(chuàng)新的學(xué)習(xí)范式,將人類智能有效地注入人工智能,稱為“人類即人工智能導(dǎo)師”(HAIM)。在這種范式中,人類專家充當人工智能代理的導(dǎo)師。在允許智能體充分探索不確定環(huán)境的同時,人類專家可以在危險情況下進行控制,并展示正確的行動以避免潛在的事故。另一方面,可以引導(dǎo)代理最小化交通流干擾,從而優(yōu)化交通流效率。詳細地說,HAIM-DRL利用從自由探索和部分人類演示中收集的數(shù)據(jù)作為其兩個訓(xùn)練來源。值得注意的是,我們避開了手動設(shè)計獎勵函數(shù)的復(fù)雜過程;相反,我們直接從部分人類演示中導(dǎo)出代理狀態(tài)動作值,以指導(dǎo)代理的策略學(xué)習(xí)。此外,我們采用最小干預(yù)技術(shù)來減少人類導(dǎo)師的認知負荷。比較結(jié)果表明,HAIM-DRL在駕駛安全性、采樣效率、交通流干擾的緩解以及對未知交通場景的可推廣性方面優(yōu)于傳統(tǒng)方法。本文的代碼和演示視頻可訪問:https://zilin-huang.github.io/HAIM-DRL-website/

[Downlink:]http://arxiv.org/abs/2401.03160v2

[Project:]https://zilin-huang.github.io/HAIM-DRL-website/|


標題: Two-Stage Constrained Actor-Critic for Short Video Recommendation

作者: Qingpeng Cai, Zhenghai Xue, Chi Zhang

摘要: The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users’ cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

中文摘要: 短視頻在社交媒體上的廣泛流行為優(yōu)化視頻共享平臺上的推薦系統(tǒng)帶來了新的機遇和挑戰(zhàn)。用戶順序地與系統(tǒng)交互,并提供復(fù)雜和多方面的響應(yīng),包括觀看時間和與多個視頻的各種類型的交互。一方面,平臺旨在長期優(yōu)化用戶的累計觀看時間(主要目標),強化學(xué)習(xí)可以有效地優(yōu)化用戶的累積觀看時間。另一方面,平臺還需要滿足容納多個用戶交互(輔助目標)(如關(guān)注、分享等)的響應(yīng)的約束。在本文中,我們將短視頻推薦問題公式化為約束馬爾可夫決策過程(CMDP)。我們發(fā)現(xiàn)傳統(tǒng)的約束強化學(xué)習(xí)算法在這種情況下不能很好地工作。我們提出了一種新的兩階段約束行動者-批評家方法:在第一階段,我們學(xué)習(xí)單個策略來優(yōu)化每個輔助信號。在第二階段,我們學(xué)習(xí)了一種策略,以(i)優(yōu)化主信號,(ii)保持與第一階段學(xué)習(xí)的策略接近,這有效地保證了該主策略在輔助設(shè)備上的性能。通過廣泛的離線評估,我們證明了我們的方法在優(yōu)化主要目標和平衡其他目標方面的有效性。我們在短視頻推薦的現(xiàn)場實驗中進一步展示了我們的方法的優(yōu)勢,在觀看時間和互動方面,它明顯優(yōu)于其他基線。我們的方法已在生產(chǎn)系統(tǒng)中全面推出,以優(yōu)化平臺上的用戶體驗

[Downlink:]http://arxiv.org/abs/2302.01680v3

[GitHub:]https://github.com/AIDefender/TSCAC.|


標題: StarCraftImage: A Dataset For Prototyping Spatial Reasoning Methods For
Multi-Agent Environments

作者: Sean Kulinski, Nicholas R. Waytowich, James Z. Hare

摘要: Spatial reasoning tasks in multi-agent environments such as event prediction, agent type identification, or missing data imputation are important for multiple applications (e.g., autonomous surveillance over sensor networks and subtasks for reinforcement learning (RL)). StarCraft II game replays encode intelligent (and adversarial) multi-agent behavior and could provide a testbed for these tasks; however, extracting simple and standardized representations for prototyping these tasks is laborious and hinders reproducibility. In contrast, MNIST and CIFAR10, despite their extreme simplicity, have enabled rapid prototyping and reproducibility of ML methods. Following the simplicity of these datasets, we construct a benchmark spatial reasoning dataset based on StarCraft II replays that exhibit complex multi-agent behaviors, while still being as easy to use as MNIST and CIFAR10. Specifically, we carefully summarize a window of 255 consecutive game states to create 3.6 million summary images from 60,000 replays, including all relevant metadata such as game outcome and player races. We develop three formats of decreasing complexity: Hyperspectral images that include one channel for every unit type (similar to multispectral geospatial images), RGB images that mimic CIFAR10, and grayscale images that mimic MNIST. We show how this dataset can be used for prototyping spatial reasoning methods. All datasets, code for extraction, and code for dataset loading can be found at https://starcraftdata.davidinouye.com

中文摘要: 多智能體環(huán)境中的空間推理任務(wù),如事件預(yù)測、智能體類型識別或缺失數(shù)據(jù)插補,對于多個應(yīng)用程序非常重要(例如,傳感器網(wǎng)絡(luò)上的自主監(jiān)控和強化學(xué)習(xí)(RL)的子任務(wù))?!缎请H爭霸II》游戲回放對智能(和對抗性)多智能體行為進行編碼,并可以為這些任務(wù)提供測試平臺;然而,為這些任務(wù)的原型設(shè)計提取簡單和標準化的表示是費力的,并且阻礙了再現(xiàn)性。相比之下,MNIST和CIFAR10盡管極其簡單,但已經(jīng)實現(xiàn)了ML方法的快速原型設(shè)計和再現(xiàn)性。遵循這些數(shù)據(jù)集的簡單性,我們構(gòu)建了一個基于星際爭霸II回放的基準空間推理數(shù)據(jù)集,該數(shù)據(jù)集表現(xiàn)出復(fù)雜的多智能體行為,同時仍然像MNIST和CIFAR10一樣易于使用。具體來說,我們仔細總結(jié)了255個連續(xù)游戲狀態(tài)的窗口,從60000次回放中創(chuàng)建了360萬個摘要圖像,包括所有相關(guān)的元數(shù)據(jù),如游戲結(jié)果和玩家種族。我們開發(fā)了三種降低復(fù)雜性的格式:每種單位類型都有一個通道的高光譜圖像(類似于多光譜地理空間圖像)、模擬CIFAR10的RGB圖像和模擬MNIST的灰度圖像。我們展示了如何將該數(shù)據(jù)集用于空間推理方法的原型設(shè)計。所有數(shù)據(jù)集、用于提取的代碼和用于加載數(shù)據(jù)集的代碼都可以在https://starcraftdata.davidinouye.com

[Downlink:]http://arxiv.org/abs/2401.04290v1

[Project:]https://starcraftdata.davidinouye.com|


標題: A Reinforcement Learning Approach to Sensing Design in
Resource-Constrained Wireless Networked Control Systems

作者: Luca Ballotta, Giovanni Peserico, Francesco Zanini

摘要: In this paper, we consider a wireless network of smart sensors (agents) that monitor a dynamical process and send measurements to a base station that performs global monitoring and decision-making. Smart sensors are equipped with both sensing and computation, and can either send raw measurements or process them prior to transmission. Constrained agent resources raise a fundamental latency-accuracy trade-off. On the one hand, raw measurements are inaccurate but fast to produce. On the other hand, data processing on resource-constrained platforms generates accurate measurements at the cost of non-negligible computation latency. Further, if processed data are also compressed, latency caused by wireless communication might be higher for raw measurements. Hence, it is challenging to decide when and where sensors in the network should transmit raw measurements or leverage time-consuming local processing. To tackle this design problem, we propose a Reinforcement Learning approach to learn an efficient policy that dynamically decides when measurements are to be processed at each sensor. Effectiveness of our proposed approach is validated through a numerical simulation with case study on smart sensing motivated by the Internet of Drones.

中文摘要: 在本文中,我們考慮一個智能傳感器(代理)的無線網(wǎng)絡(luò),該網(wǎng)絡(luò)監(jiān)測動態(tài)過程,并將測量結(jié)果發(fā)送到執(zhí)行全局監(jiān)測和決策的基站。智能傳感器同時具備傳感和計算功能,可以發(fā)送原始測量值,也可以在傳輸前進行處理。受約束的代理資源提出了一個基本的延遲-準確性權(quán)衡。一方面,原始測量不準確,但生產(chǎn)速度很快。另一方面,資源受限平臺上的數(shù)據(jù)處理以不可忽略的計算延遲為代價生成準確的測量結(jié)果。此外,如果處理后的數(shù)據(jù)也被壓縮,則由無線通信引起的延遲對于原始測量可能更高。因此,決定網(wǎng)絡(luò)中的傳感器何時何地傳輸原始測量值或利用耗時的本地處理是一項挑戰(zhàn)。為了解決這個設(shè)計問題,我們提出了一種強化學(xué)習(xí)方法來學(xué)習(xí)一種有效的策略,該策略動態(tài)地決定何時在每個傳感器處處理測量。通過數(shù)值模擬和無人機互聯(lián)網(wǎng)驅(qū)動的智能傳感案例研究,驗證了我們提出的方法的有效性

[Downlink:]http://arxiv.org/abs/2204.00703v5


== Open vocabulary detection ==

標題: LinK3D: Linear Keypoints Representation for 3D LiDAR Point Cloud

作者: Yunge Cui, Yinlong Zhang, Jiahua Dong

摘要: Feature extraction and matching are the basic parts of many robotic vision tasks, such as 2D or 3D object detection, recognition, and registration. As is known, 2D feature extraction and matching have already achieved great success. Unfortunately, in the field of 3D, the current methods may fail to support the extensive application of 3D LiDAR sensors in robotic vision tasks due to their poor descriptiveness and inefficiency. To address this limitation, we propose a novel 3D feature representation method: Linear Keypoints representation for 3D LiDAR point cloud, called LinK3D. The novelty of LinK3D lies in that it fully considers the characteristics (such as the sparsity and complexity) of LiDAR point clouds and represents the keypoint with its robust neighbor keypoints, which provide strong constraints in the description of the keypoint. The proposed LinK3D has been evaluated on three public datasets, and the experimental results show that our method achieves great matching performance. More importantly, LinK3D also shows excellent real-time performance, faster than the sensor frame rate at 10 Hz of a typical rotating LiDAR sensor. LinK3D only takes an average of 30 milliseconds to extract features from the point cloud collected by a 64-beam LiDAR and takes merely about 20 milliseconds to match two LiDAR scans when executed on a computer with an Intel Core i7 processor. Moreover, our method can be extended to LiDAR odometry task, and shows good scalability. We release the implementation of our method at https://github.com/YungeCui/LinK3D.

中文摘要: 特征提取和匹配是許多機器人視覺任務(wù)的基本部分,如二維或三維物體檢測、識別和配準。眾所周知,二維特征提取和匹配已經(jīng)取得了巨大的成功。不幸的是,在3D領(lǐng)域,由于3D激光雷達傳感器的描述性差和效率低,目前的方法可能無法支持其在機器人視覺任務(wù)中的廣泛應(yīng)用。為了解決這一限制,我們提出了一種新的3D特征表示方法:3D激光雷達點云的線性關(guān)鍵點表示,稱為LinK3D。LinK3D的新穎之處在于,它充分考慮了激光雷達點云的特性(如稀疏性和復(fù)雜性),并用其魯棒的鄰居關(guān)鍵點來表示關(guān)鍵點,這在關(guān)鍵點的描述中提供了強大的約束。在三個公共數(shù)據(jù)集上對所提出的LinK3D進行了評估,實驗結(jié)果表明,我們的方法具有很好的匹配性能。更重要的是,LinK3D還顯示出出色的實時性能,比典型旋轉(zhuǎn)激光雷達傳感器在10Hz下的傳感器幀速率更快。LinK3D從64束激光雷達收集的點云中提取特征平均只需30毫秒,在配備英特爾酷睿i7處理器的計算機上執(zhí)行時,匹配兩次激光雷達掃描僅需約20毫秒。此外,我們的方法可以擴展到激光雷達里程計任務(wù),并顯示出良好的可擴展性。我們在發(fā)布方法的實現(xiàn)https://github.com/YungeCui/LinK3D.

[Downlink:]http://arxiv.org/abs/2206.05927v3

[GitHub:]https://github.com/YungeCui/LinK3D.|


標題: DC-Net: Divide-and-Conquer for Salient Object Detection

作者: Jiayi Zhu, Xuebin Qin, Abdulmotaleb Elsaddik

摘要: In this paper, we introduce Divide-and-Conquer into the salient object detection (SOD) task to enable the model to learn prior knowledge that is for predicting the saliency map. We design a novel network, Divide-and-Conquer Network (DC-Net) which uses two encoders to solve different subtasks that are conducive to predicting the final saliency map, here is to predict the edge maps with width 4 and location maps of salient objects and then aggregate the feature maps with different semantic information into the decoder to predict the final saliency map. The decoder of DC-Net consists of our newly designed two-level Residual nested-ASPP (ResASPP 2 ^{2} 2) modules, which have the ability to capture a large number of different scale features with a small number of convolution operations and have the advantages of maintaining high resolution all the time and being able to obtain a large and compact effective receptive field (ERF). Based on the advantage of Divide-and-Conquer’s parallel computing, we use Parallel Acceleration to speed up DC-Net, allowing it to achieve competitive performance on six LR-SOD and five HR-SOD datasets under high efficiency (60 FPS and 55 FPS). Codes and results are available: https://github.com/PiggyJerry/DC-Net.

中文摘要: 在本文中,我們將分割和征服引入顯著對象檢測(SOD)任務(wù),以使模型能夠?qū)W習(xí)用于預(yù)測顯著圖的先驗知識。我們設(shè)計了一種新的網(wǎng)絡(luò),即分治網(wǎng)絡(luò)(DC Net),它使用兩個編碼器來解決有助于預(yù)測最終顯著性圖的不同子任務(wù),這里是預(yù)測寬度為4的邊緣圖和顯著對象的位置圖,然后將具有不同語義信息的特征圖聚合到解碼器中,以預(yù)測最終的顯著性圖。DC Net的解碼器由我們新設(shè)計的兩級殘差嵌套ASPP(ResASPP 2 ^{2} 2)模塊組成,該模塊能夠用少量卷積運算捕獲大量不同尺度的特征,并具有始終保持高分辨率和能夠獲得大而緊湊的有效感受野(ERF)的優(yōu)點?;贒ivide and Conquer并行計算的優(yōu)勢,我們使用并行加速來加速DCNet,使其能夠在6個LR-SOD和5個HR-SOD數(shù)據(jù)集上以高效(60 FPS和55 FPS)的速度獲得有競爭力的性能。代碼和結(jié)果可用:https://github.com/PiggyJerry/DC-Net.

[Downlink:]http://arxiv.org/abs/2305.14955v3

[GitHub:]https://github.com/PiggyJerry/DC-Net.|


標題: Actor-agnostic Multi-label Action Recognition with Multi-modal Query

作者: Anindya Mondal, Sauradip Nag, Joaquin M Prada

摘要: Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called ‘a(chǎn)ctor-agnostic multi-modal multi-label action recognition,’ which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.

中文摘要: 由于參與者之間固有的拓撲和明顯的差異,現(xiàn)有的動作識別方法通常是特定于參與者的。這需要特定于演員的姿勢估計(例如,人類與動物),導(dǎo)致繁瑣的模型設(shè)計復(fù)雜性和高昂的維護成本。此外,他們通常專注于單獨學(xué)習(xí)視覺模態(tài)和單標簽分類,而忽略了其他可用的信息來源(例如,類名文本)和多個動作的同時發(fā)生。為了克服這些限制,我們提出了一種新的方法,稱為“行動者不可知的多模式多標簽動作識別”,它為包括人類和動物在內(nèi)的各種行動者提供了統(tǒng)一的解決方案。我們在基于變換器的對象檢測框架(例如,DETR)中進一步提出了一種新的多模式語義查詢網(wǎng)絡(luò)(MSQNet)模型,其特征是利用視覺和文本模式更好地表示動作類。消除了特定于演員的模型設(shè)計是一個關(guān)鍵優(yōu)勢,因為它完全消除了對演員姿勢估計的需要。在五個公開可用的基準上進行的廣泛實驗表明,我們的MSQNet在人類和動物的單標簽和多標簽動作識別任務(wù)上始終優(yōu)于現(xiàn)有技術(shù)的演員特定替代品高達50%。代碼可在https://github.com/mondalanindya/MSQNet.

[Downlink:]http://arxiv.org/abs/2307.10763v3

[GitHub:]https://github.com/mondalanindya/MSQNet.|


標題: Generalizing Medical Image Representations via Quaternion Wavelet
Networks

作者: Luigi Sigillo, Eleonora Grassucci, Aurelio Uncini

摘要: Neural network generalizability is becoming a broad research field due to the increasing availability of datasets from different sources and for various tasks. This issue is even wider when processing medical data, where a lack of methodological standards causes large variations being provided by different imaging centers or acquired with various devices and cofactors. To overcome these limitations, we introduce a novel, generalizable, data- and task-agnostic framework able to extract salient features from medical images. The proposed quaternion wavelet network (QUAVE) can be easily integrated with any pre-existing medical image analysis or synthesis task, and it can be involved with real, quaternion, or hypercomplex-valued models, generalizing their adoption to single-channel data. QUAVE first extracts different sub-bands through the quaternion wavelet transform, resulting in both low-frequency/approximation bands and high-frequency/fine-grained features. Then, it weighs the most representative set of sub-bands to be involved as input to any other neural model for image processing, replacing standard data samples. We conduct an extensive experimental evaluation comprising different datasets, diverse image analysis, and synthesis tasks including reconstruction, segmentation, and modality translation. We also evaluate QUAVE in combination with both real and quaternion-valued models. Results demonstrate the effectiveness and the generalizability of the proposed framework that improves network performance while being flexible to be adopted in manifold scenarios and robust to domain shifts. The full code is available at: https://github.com/ispamm/QWT.

中文摘要: 由于來自不同來源和用于各種任務(wù)的數(shù)據(jù)集的可用性不斷增加,神經(jīng)網(wǎng)絡(luò)的可推廣性正成為一個廣泛的研究領(lǐng)域。在處理醫(yī)學(xué)數(shù)據(jù)時,這個問題更為廣泛,因為缺乏方法標準導(dǎo)致不同成像中心提供的或使用各種設(shè)備和輔因子獲取的數(shù)據(jù)存在很大差異。為了克服這些限制,我們引入了一種新穎的、可推廣的、數(shù)據(jù)和任務(wù)不可知的框架,能夠從醫(yī)學(xué)圖像中提取顯著特征。所提出的四元數(shù)小波網(wǎng)絡(luò)(QUAVE)可以很容易地與任何預(yù)先存在的醫(yī)學(xué)圖像分析或合成任務(wù)集成,并且它可以涉及實數(shù)、四元數(shù)或超復(fù)值模型,將其應(yīng)用推廣到單通道數(shù)據(jù)。QUAVE首先通過四元數(shù)小波變換提取不同的子帶,得到低頻/近似帶和高頻/細粒度特征。然后,它對要涉及的最具代表性的子帶集進行加權(quán),作為用于圖像處理的任何其他神經(jīng)模型的輸入,取代標準數(shù)據(jù)樣本。我們進行了廣泛的實驗評估,包括不同的數(shù)據(jù)集、不同的圖像分析和合成任務(wù),包括重建、分割和模態(tài)翻譯。我們還結(jié)合實數(shù)和四元數(shù)值模型來評估QUAVE。結(jié)果證明了所提出的框架的有效性和可推廣性,該框架提高了網(wǎng)絡(luò)性能,同時在多種場景中靈活采用,并對域轉(zhuǎn)移具有魯棒性。完整代碼位于:https://github.com/ispamm/QWT.

[Downlink:]http://arxiv.org/abs/2310.10224v2

[GitHub:]https://github.com/ispamm/QWT.|


標題: WidthFormer: Toward Efficient Transformer-based BEV View Transformation

作者: Chenhongyi Yang, Tianwei Lin, Lichao Huang

摘要: In this work, we present WidthFormer, a novel transformer-based Bird’s-Eye-View (BEV) 3D detection method tailored for real-time autonomous-driving applications. WidthFormer is computationally efficient, robust and does not require any special engineering effort to deploy. In this work, we propose a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information, which enables our model to generate high-quality BEV representations with only a single transformer decoder layer. This mechanism is also beneficial for existing sparse 3D object detectors. Inspired by the recently-proposed works, we further improve our model’s efficiency by vertically compressing the image features when serving as attention keys and values. We also introduce two modules to compensate for potential information loss due to feature compression. Experimental evaluation on the widely-used nuScenes 3D object detection benchmark demonstrates that our method outperforms previous approaches across different 3D detection architectures. More importantly, our model is highly efficient. For example, when using 256 × 704 256\times 704 256×704 input images, it achieves 1.5 ms and 2.8 ms latency on NVIDIA 3090 GPU and Horizon Journey-5 edge computing chips, respectively. Furthermore, WidthFormer also exhibits strong robustness to different degrees of camera perturbations. Our study offers valuable insights into the deployment of BEV transformation methods in real-world, complex road environments. Code is available at https://github.com/ChenhongyiYang/WidthFormer .

中文摘要: 在這項工作中,我們提出了WidthFormer,這是一種新的基于變壓器的鳥瞰圖(BEV)3D檢測方法,專為實時自動駕駛應(yīng)用而設(shè)計。WidthFormer在計算上高效、穩(wěn)健,不需要任何特殊的工程部署。在這項工作中,我們提出了一種新的3D位置編碼機制,該機制能夠準確封裝3D幾何信息,使我們的模型能夠僅用單個變換器解碼器層生成高質(zhì)量的BEV表示。這種機制對于現(xiàn)有的稀疏3D對象檢測器也是有益的。受最近提出的工作的啟發(fā),我們通過在充當注意力鍵和值時垂直壓縮圖像特征,進一步提高了模型的效率。我們還介紹了兩個模塊來補償由于特征壓縮而造成的潛在信息損失。對廣泛使用的nuScenes 3D對象檢測基準的實驗評估表明,我們的方法在不同的3D檢測架構(gòu)中優(yōu)于以前的方法。更重要的是,我們的模型非常高效。例如,當使用 256 × 704 256\times 704 256×704輸入圖像時,它在NVIDIA 3090 GPU和Horizon Journey-5邊緣計算芯片上分別實現(xiàn)了1.5毫秒和2.8毫秒的延遲。此外,WidthFormer對不同程度的相機擾動也表現(xiàn)出較強的魯棒性。我們的研究為在現(xiàn)實世界復(fù)雜的道路環(huán)境中部署純電動汽車轉(zhuǎn)換方法提供了寶貴的見解。代碼位于https://github.com/ChenhongyiYang/WidthFormer.

[Downlink:]http://arxiv.org/abs/2401.03836v3

[GitHub:]https://github.com/ChenhongyiYang/WidthFormer|


標題: IODeep: an IOD for the introduction of deep learning in the DICOM
standard

作者: Salvatore Contino, Luca Cruciata, Orazio Gambino

摘要: Background and Objective: In recent years, Artificial Intelligence (AI) and in particular Deep Neural Networks (DNN) became a relevant research topic in biomedical image segmentation due to the availability of more and more data sets along with the establishment of well known competitions. Despite the popularity of DNN based segmentation on the research side, these techniques are almost unused in the daily clinical practice even if they could support effectively the physician during the diagnostic process. Apart from the issues related to the explainability of the predictions of a neural model, such systems are not integrated in the diagnostic workflow, and a standardization of their use is needed to achieve this goal. Methods: This paper presents IODeep a new DICOM Information Object Definition (IOD) aimed at storing both the weights and the architecture of a DNN already trained on a particular image dataset that is labeled as regards the acquisition modality, the anatomical region, and the disease under investigation. Results: The IOD architecture is presented along with a DNN selection algorithm from the PACS server based on the labels outlined above, and a simple PACS viewer purposely designed for demonstrating the effectiveness of the DICOM integration, while no modifications are required on the PACS server side. Also a service based architecture in support of the entire workflow has been implemented. Conclusion: IODeep ensures full integration of a trained AI model in a DICOM infrastructure, and it is also enables a scenario where a trained model can be either fine-tuned with hospital data or trained in a federated learning scheme shared by different hospitals. In this way AI models can be tailored to the real data produced by a Radiology ward thus improving the physician decision making process. Source code is freely available at https://github.com/CHILab1/IODeep.git

中文摘要: 背景和目的:近年來,隨著越來越多的數(shù)據(jù)集的可用性和眾所周知的競爭的建立,人工智能(AI),特別是深度神經(jīng)網(wǎng)絡(luò)(DNN)成為生物醫(yī)學(xué)圖像分割的相關(guān)研究課題。盡管基于DNN的分割在研究方面很受歡迎,但這些技術(shù)在日常臨床實踐中幾乎沒有使用過,即使它們可以在診斷過程中有效地支持醫(yī)生。除了與神經(jīng)模型預(yù)測的可解釋性相關(guān)的問題外,這些系統(tǒng)沒有集成在診斷工作流程中,需要對其使用進行標準化以實現(xiàn)這一目標。方法:本文向IODeep提出了一種新的DICOM信息對象定義(IOD),旨在存儲已經(jīng)在特定圖像數(shù)據(jù)集上訓(xùn)練的DNN的權(quán)重和架構(gòu),該圖像數(shù)據(jù)集被標記為采集模式、解剖區(qū)域和正在研究的疾病。結(jié)果:IOD體系結(jié)構(gòu)以及基于上述標簽的PACS服務(wù)器的DNN選擇算法,以及一個專門設(shè)計用于演示DICOM集成有效性的簡單PACS查看器,而不需要在PACS服務(wù)器端進行修改。此外,還實現(xiàn)了支持整個工作流的基于服務(wù)的體系結(jié)構(gòu)。結(jié)論:IODeep確保了訓(xùn)練后的人工智能模型在DICOM基礎(chǔ)設(shè)施中的完全集成,它還實現(xiàn)了一種場景,即訓(xùn)練后的模型可以根據(jù)醫(yī)院數(shù)據(jù)進行微調(diào),也可以在不同醫(yī)院共享的聯(lián)合學(xué)習(xí)方案中進行訓(xùn)練。通過這種方式,人工智能模型可以根據(jù)放射科病房產(chǎn)生的真實數(shù)據(jù)進行定制,從而改進醫(yī)生的決策過程。源代碼免費提供于https://github.com/CHILab1/IODeep.git

[Downlink:]http://arxiv.org/abs/2311.16163v2

[GitHub:]https://github.com/CHILab1/IODeep.git|文章來源地址http://www.zghlxwxcb.cn/news/detail-806076.html


到了這里,關(guān)于]每日論文推送(有中文摘要或代碼或項目地址)---強化學(xué)習(xí),機器人,視覺導(dǎo)航的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!

本文來自互聯(lián)網(wǎng)用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務(wù),不擁有所有權(quán),不承擔相關(guān)法律責任。如若轉(zhuǎn)載,請注明出處: 如若內(nèi)容造成侵權(quán)/違法違規(guī)/事實不符,請點擊違法舉報進行投訴反饋,一經(jīng)查實,立即刪除!

領(lǐng)支付寶紅包贊助服務(wù)器費用

相關(guān)文章

  • 【相關(guān)問題解答1】bert中文文本摘要代碼:import時無法找到包時,幾個潛在的原因和解決方法

    【相關(guān)問題解答1】bert中文文本摘要代碼:import時無法找到包時,幾個潛在的原因和解決方法

    ??你好呀!我是 是Yu欸 ?? 2024每日百字篆刻時光,感謝你的陪伴與支持 ~ ?? 歡迎一起踏上探險之旅,挖掘無限可能,共同成長! 前些天發(fā)現(xiàn)了一個人工智能學(xué)習(xí)網(wǎng)站,內(nèi)容深入淺出、易于理解。如果對人工智能感興趣,不妨點擊查看。 感謝大家的支持和關(guān)注。 最近好多人

    2024年03月14日
    瀏覽(17)
  • 關(guān)于git推送代碼到github遠程倉庫中文亂碼問題,visual studio保存文件默認編碼格式問題

    關(guān)于git推送代碼到github遠程倉庫中文亂碼問題,visual studio保存文件默認編碼格式問題

    中文亂碼問題本質(zhì)上的原因是:二者的編碼格式不同 。當你用GB2313格式保存一個文件,用utf-8的格式打開,它必然就顯示亂碼。 據(jù)我所知,github上面是utf-8,而visual studio默認保存為GB2312,把代碼推送到github上面看,中文部分就是亂碼。 用 高級保存選項 來設(shè)置保存操作的編碼

    2024年04月11日
    瀏覽(33)
  • 20源代碼模型的數(shù)據(jù)增強方法:克隆檢測、缺陷檢測和修復(fù)、代碼摘要、代碼搜索、代碼補全、代碼翻譯、代碼問答、問題分類、方法名稱預(yù)測和類型預(yù)測對論文進行分組【網(wǎng)安AIGC專題11.15】

    20源代碼模型的數(shù)據(jù)增強方法:克隆檢測、缺陷檢測和修復(fù)、代碼摘要、代碼搜索、代碼補全、代碼翻譯、代碼問答、問題分類、方法名稱預(yù)測和類型預(yù)測對論文進行分組【網(wǎng)安AIGC專題11.15】

    本文為鄒德清教授的《網(wǎng)絡(luò)安全專題》課堂筆記系列的文章,本次專題主題為大模型。 一位同學(xué)分享了Data Augmentation Approaches for Source Code Models: A Survey 《源代碼模型的數(shù)據(jù)增強方法:綜述》 全英文PPT,又學(xué)了很多專業(yè)術(shù)語 英文排版好好看,感覺這位同學(xué)是直接閱讀的英文文

    2024年02月02日
    瀏覽(167)
  • 遠程gitlab新建項目,本地已有代碼,進行推送

    遠程gitlab新建項目,本地已有代碼,進行推送 在本地項目的根目錄下初始化Git倉庫: git init 添加所有文件到Git倉庫中: git add . 提交文件到Git倉庫中: git commit -m \\\"Initial commit\\\" 添加遠程Git倉庫的地址: git remote add origin 遠程Git倉庫的地址 到了第五步git push會有區(qū)別: 如果遠程

    2024年02月12日
    瀏覽(24)
  • gitlab 和 vscode 項目代碼 拉取推送

    gitlab 和 vscode 項目代碼 拉取推送

    1、新建文件夾(存放項目代碼,或者原本打開已有的項目代碼根目錄) 2、在項目代碼根目錄下,右鍵選擇Git Bash Here(前提是安裝好 git) 3、執(zhí)行命令 git init 初始化 ,然后git add . 添加全部文件到暫存區(qū)。(注意 . 前面有空格) 4、提交:git commit -m?\\\"首次提交:巴拉巴拉……

    2024年02月07日
    瀏覽(25)
  • pytorch-textsummary,中文文本摘要實踐

    pytorch-textsummary是一個以pytorch和transformers為基礎(chǔ),專注于中文文本摘要的輕量級自然語言處理工具,支持抽取式摘要等。 數(shù)據(jù) 使用方式 paper 參考 pytorch-textsummary: https://github.com/yongzhuo/Pytorch-NLU/pytorch_textsummary 免責聲明:以下數(shù)據(jù)集由公開渠道收集而成, 只做匯總說明; 科學(xué)研究

    2024年02月21日
    瀏覽(27)
  • gitee(碼云)建立新倉庫并推送項目代碼步驟

    gitee(碼云)建立新倉庫并推送項目代碼步驟

    1.下載git工具并安裝(推薦使用14-16版本,版本過高可能會導(dǎo)致項目安裝依賴失?。?2.安裝完成后 3.到碼云內(nèi)注冊賬號 4.注冊完成后新建倉庫 5.桌面右擊打開git Bash Here 6.git clone 復(fù)制某個倉庫內(nèi)克隆下載按鈕點一下會有個https地址,將本地與遠程倉庫建立連接 7.連接成功以后會

    2024年04月22日
    瀏覽(28)
  • 論文摘要生成器手機版?論文修改神器

    論文摘要生成器手機版?論文修改神器

    寶子們在做科學(xué)基金項目申請書時,特別強調(diào)的一點是寶子們必須明確說明課題相對于現(xiàn)有研究成果的 獨特學(xué)術(shù)價值和應(yīng)用潛力 。這意味著, 所提出的學(xué)術(shù)和應(yīng)用價值不應(yīng)是泛泛之談,而應(yīng)突出其獨特性,這正是通過深入分析學(xué)術(shù)歷史和最新研究動態(tài)得出的。 因此,為了有

    2024年04月25日
    瀏覽(28)
  • 讀書筆記-《ON JAVA 中文版》-摘要26[第二十三章 注解]

    讀書筆記-《ON JAVA 中文版》-摘要26[第二十三章 注解]

    注解(也被稱為元數(shù)據(jù))為我們在代碼中添加信息提供了一種形式化的方式,使我們可以在稍后的某個時刻更容易的使用這些數(shù)據(jù)。 通過使用注解,你可以將元數(shù)據(jù)保存在 Java 源代碼中。并擁有如下有下優(yōu)勢:簡單易讀的代碼,編譯器類型檢查,使用 annotation API 為自己的注

    2024年02月07日
    瀏覽(20)
  • 如何使用Git將本地項目推送至代碼托管平臺?【Gitee、GitLab、GitHub】

    如何使用Git將本地項目推送至代碼托管平臺?【Gitee、GitLab、GitHub】

    查看當前Git郵箱 git config user.email 設(shè)置Git賬戶名 git config --global user.name = “王會稱” ? 設(shè)置Git郵箱 git config --global user.email “wanghuichen2003@163.com” 再次查看是否設(shè)置成功 進入git全局配置文件修改 vi ~/.gitconfig 登錄Gitee官網(wǎng),并注冊賬戶 ===================================================

    2024年04月16日
    瀏覽(27)

覺得文章有用就打賞一下文章作者

支付寶掃一掃打賞

博客贊助

微信掃一掃打賞

請作者喝杯咖啡吧~博客贊助

支付寶掃一掃領(lǐng)取紅包,優(yōu)惠每天領(lǐng)

二維碼1

領(lǐng)取紅包

二維碼2

領(lǐng)紅包