[논문리뷰]ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering (arXiv, 2026)

Date: 2026.06.20 Updated: 2026.06.20

카테고리: NR

Yikai Zhu, Kunfeng Chen, Qihuang Zhong, Juhua Liu, and Bo Du. 2026. ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering. arXiv:2605.28093 [cs.CL]

1. Problem Statements

이 논문은 외부 textual corpus에서 여러 문서에 걸친 evidence chain을 검색하고 이를 단계적으로 추론하여 답을 생성하는 RAG 기반 multi-hop question answering을 다룬다. 입력은 질문 $q$와 verifiable textual evidence unit의 집합 $\mathcal{C}=\{c_i\}_{i=1}^{N}$이며, 출력은 retrieval된 evidence와 intermediate answer를 근거로 생성한 final answer이다.

ConRAG는 corpus에서 entities, attributes, relation triples들을 추출해 evidence-grounded KG를 구축하므로 외부 지식을 사용하지만, graph object 자체가 아니라 연결된 source textual evidence unit을 generation context로 사용한다. 목표는 query-side의 sub-task dependency를 명시적으로 전달하고, corpus-side의 relation, entity-anchor, text-evidence 신호를 동일한 evidence-unit ranking space에서 결합하여 multi-hop evidence를 정확히 복원하는 것이다.

2. Limitations of Existing Works

[Query-side retrieval과 dependency의 취약성] Reasoning-based RAG 방법론은 복잡한 질문을 sub-task로 decomposition한 뒤 retrieval과 generation을 순차적으로 수행하지만, 각 단계의 검색은 여전히 document chunk와의 semantic similarity에 의존한다. 이로 인해 필요한 relation이나 bridge entity가 query와 직접 유사하지 않으면 evidence를 놓칠 수 있고, 초기 retrieval error가 후속 단계로 전파된다. 또한 단순한 sequential prompting은 sub-task dependency를 충분히 표현하지 못하므로, 이전 intermediate answer를 다음 retrieval query의 constraint로 활용하기 어렵다.

[Corpus-side heterogeneous evidence의 비정렬] Graph-based RAG 방법론은 entities, relations, summaries, text units 등 서로 다른 graph object를 검색하지만, 이들은 granularity가 달라 score를 직접 비교하기 어렵다. 따라서 task-specific aggregation이나 별도의 context assembly가 필요하며, 여러 object를 LLM context에 그대로 혼합하면 비효율적인 context가 생성된다. Multi-hop QA에서는 서로 다른 view가 동일한 textual evidence를 지지하는지를 판단해야 하므로, heterogeneous signal을 공통 ranking 단위로 정렬하는 과정이 필요하다.

Reasoning-based RAG 방법론과 Graph-based RAG 방법론은 위와 같은 한계점을 지니기 때문에 자연스럽게 “두 방법론의 장점만을 결합하는 방법은 없을까?”에 대한 질문이 제시된다. 이를 해결하기 위해서는 두 가지 challenge가 존재한다.

Challenge 1: How to better plan retrieval and reasoning on the query side
Challenge 2: How to make full use of heterogeneous evidence on the corpus side

3. Methodology

Figure 2의 ConRAG는 Connection, Constraint, Consensus와 generation 단계로 구성된다.

Offline에서 Connection이 corpus를 evidence-grounded knowledge graph와 세 종류의 retrieval index로 변환한다. Online에서는 질문을 dependency-aware sub-question으로 decomposition하고
Constraint가 이전 answer를 다음 sub-question에 binding한다.
Consensus는 bound query를 relation, entity-anchor, text-evidence view에서 검색한 뒤 결과를 source evidence-unit space로 변환하여 ranking한다.

각 step의 top-$k$ evidence로 intermediate answer를 생성하고, 모든 step의 execution trace와 acquired information으로 final answer를 출력한다.

3.1. Connection: Evidence-Grounded KG

Connection: Evidence-Grounded KG는 corpus의 구조적 정보를 source text에 grounding하는 offline 단계이다. 입력은 evidence unit 집합 $\mathcal{C}$이며, 각 $c_i$에서 entities, attributes, relation triples를 추출해 다음 graph를 구축한다.

$$\mathcal G = (\mathcal V, \mathcal E), \quad e = (v_s, r, v_t) \in \mathcal E$$

여기서 $\mathcal V$는 entity node 집합이고, $\mathcal E$는 relation edge 집합이다. 각 edge $e$는 source entity $v_s$에서 target entity $v_t$로 향하는 explicit relation $r$을 나타낸다. 이 식의 목적은 corpus에서 추출된 구조적 정보를 directed entity–relation graph로 표현하는 것이다.

$$\text{src} (o) = \{ c_i \in \mathcal C \mid o \in X (c_i) \}$$

각 graph object와 source evidence 사이의 연결은 위와 같이 정의된다. 여기서 $X (c_i)$는 $c_i$에서 추출된 graph objective 집합이고, $o \in \mathcal V \cup \mathcal E$는 entity node 또는 relation edge이다. 따라서, $\text{src} (o)$는 object가 추출된 모든 textual evidence unit을 반환한다. 이 mapping이 필요한 이유는 graph retrieval의 결과를 heterogeneous graph object 상태로 generation에 전달하지 않고, 검증 가능한 source text로 다시 환원하기 위해서이다. 하나의 object가 여러 evidence unit에 등장하면 해당 evidence unit들이 모두 mapping에 포함된다. Connection의 output은 $\mathcal{G}$와 src mapping이며, 다음 Consensus 단계가 relation 및 entity hit을 evidence-unit candidate로 변환할 때 사용한다.

3.2. Consensus: Evidence-Aligned Retrieval

Consensus: Evidence-Aligned Retrieval은 heterogeneous retrieval signal을 unified evidence-unit space에서 비교하고, 여러 view가 동시에 지지하는 evidence를 우선 선택하는 핵심 retrieval 단계이다.

입력은 retrieval query $x$, evidence-grounded KG, source mapping, 그리고 세 retrieval index이며, online multi-step execution에서는 Constraint가 생성한 bound query $q^{\star}_i$가 $x$로 사용된다. 출력은 consensus-enhanced score에 따라 선택된 top-k evidence unit이며, 이 evidence가 해당 step의 sub-generation context가 된다. 먼저 ConRAG는 세 개의 complementary retrieval view를 구축한다.

$$\mathcal{R}_x = \text{TopK}_{e \in \mathcal{E}} \operatorname{sim}(x, \tau_r(e)), \\ \mathcal{A}_x = \text{TopK}_{v \in \mathcal{V}} \operatorname{sim}(x, \tau_a(v)), \\ \mathcal{T}_x = \text{TopK}_{c \in \mathcal{C}} \operatorname{sim}(x, c).$$

$\mathcal R_x$는 textualized relation edge를 검색하여 explicit relational clue를 포착하고, $\mathcal{A}_x$는 entity name, type, attributes, local relation summaries를 포함한 entity representation을 검색한다. $\mathcal{T}_x$는 evidence unit을 직접 검색하여 일반적인 semantic matching을 수행한다. $⁡\operatorname{sim}$은 dense embedding 간 cosine similarity이다.

Relation과 entity hit은 직접 generation에 사용하지 않고 다음과 같이 source evidence unit으로 변환한다.

$$\mathcal C_x = \mathcal T_x \cup \text{src}(\mathcal R_x) \cup \text{src} (\mathcal A_x)$$

이에 따라 이후 ranking 단위는 graph object가 아니라 $c\in\mathcal{C}_x$이다. 실험에서 이 단위는 passage이며, passage 내부를 별도의 sliding-window chunk나 section으로 분할하는 방식은 명시되지 않는다.

$$s_r(c) = h(e^\star) + \beta \sum_{e \in \mathcal{R}_x^c \setminus \{e^\star\}} h(e)$$

Relation view는 한 evidence unit을 지지하는 가장 강한 edge와 나머지 edge를 구분한다. 여기서 $h(e) = \text{sim}(x, \tau_r (e))$ , $e^\star$는 $c$와 연결된 relation 중 최고 score를 갖는 edge이다. 가장 강한 relation을 primary signal로 사용하고, 나머지는 $\beta$로 축소하여 weak realtion의 과도한 누적을 방지한다.

Entity-anchor score는 hub entity의 영향을 줄이도록 계산한다.

$$s_a(c) = \sum_{v \in \mathcal{A}_x^c} \operatorname{sim}(x, \tau_a(v)) \delta(v) \\ \delta(v) = \begin{cases} 1, & \deg(v) \leq 1, \\ \dfrac{1}{1 + \log \deg(v)}, & \deg(v) > 1. \end{cases}$$

$\delta (v)$는 degree가 큰 generic entity의 contribution을 감소시킨다. Text-evidence score $s_t(c)$는 dense retrieval score를 사용한다. 최종 evidence score는 다음과 같다.

$$\mathbf{s}(c) = [\bar{s}_r(c), \bar{s}_a(c)p(c), \bar{s}_t(c)]^\top$$ $$\mathrm{Score}(c) = \boldsymbol{\alpha}^\top \mathbf{s}(c)b(c)$$ $$b(c) = 1 + \lambda \frac{\max(0, m(c) - 1)}{2}$$

$p(c) = 1 / (1 + \log (\text{degree}(c))$는 evidence-level structural noise를 제한하고, $\boldsymbol{\alpha} = [\alpha_r, \alpha_a, \alpha_t]^\top$는 view별 비중을 조절한다. $m(c)$는 $c$에 positive score를 부여한 view의 수이며, $b(c)$는 여러 view가 동일한 evidence를 지지할수록 score를 증가시키는 consensus bonus이다. 최종 top-$k$ evidence는 Sub Generation으로 전달된다. Retrieval에는 off-the-shelf all-MiniLM-L6-v2가 사용된다.

3.3. Constraint: Slot-Bound Execution

Constraint는 이전 intermediate answer를 후속 retrieval query의 명시적 constraint로 전달한다. 입력은 질문, initial retrieved context, 이전 step의 answer이며, 출력은 dependency slot이 실제 answer로 치환된 bound query이다.

먼저 다음 dependency-aware plan을 생성한다.

$$\mathcal{P} = \{(i, q_i, D_i)\}_{i=0}^{M-1}$$

$q_i$는 sub-question이고, $D_i$는 해당 sub-question이 의존하는 이전 step의 identifier 집합이다. 해결되지 않은 argument는 <dep:j> 로 표현한다. 실행 직전에는 다음 binding map을 만든다.

$$\Theta_i = \{\langle dep : j \rangle \mapsto \hat{a}_j \mid j \in D_i\}, \\ q_i^\star = \text{Bind}(q_i, \Theta_i)$$

$\hat a_j$는 이전 step의 answer이고, $\operatorname{Bind}$ Bind는 <dep:j>를 해당 answer로 치환한다. $q_i^\star$는 다음 Consensus retrieval의 query가 되므로, 이전 hop에서 확인한 bridge entity나 value가 후속 검색 범위를 직접 제한한다. 이 단계는 LLM prompt 기반 decomposition과 deterministic binding으로 수행되며, 별도의 학습 module이나 loss는 없다.

3.4. Sub Generation and Final Generation

Sub Generation은 $q_i^\star$, top-$k$ evidence, original question, 기존 acquired information을 입력으로 받아 intermediate answer $\hat{a}_i$와 갱신된 acquired information을 생성한다. $\hat{a}_i$는 후속 dependency slot에 사용되고, acquired information은 final question 해결에 필요한 grounded fact를 누적한다.

Final Generation은 original question, 전체 execution trace, 누적 acquired information을 입력으로 받아 direct final answer를 출력한다. Graph object는 직접 context에 포함되지 않는다. 실험에서는 GPT-4o-mini 또는 Gemma-4-31B를 off-the-shelf generator로 사용한다.

4. Experiments

4.1. Main Results

Table 1은 세 benchmark와 두 backbone에서 ConRAG의 end-to-end 성능을 검증하며, ConRAG는 모든 dataset과 metric에서 최고 결과를 기록한다. Gemma-4-31B의 average LLM-Acc는 66.8로 Youtu-GraphRAG의 64.0을 상회하고, Vanilla RAG Top-5의 39.9보다 26.9%p 높다. 복잡한 MuSiQue에서도 GPT-4o-mini의 Str-Acc가 40.6으로 LogicRAG의 30.4보다 10.2%p 높다. 이는 향상이 특정 generator에 한정되지 않으며, multi-view evidence alignment와 query-side constraint가 복잡한 evidence chain 복원에 유효함을 보여준다.

4.2. Ablation Study

Figure 3은 consensus-enhanced fusion과 slot-bound execution의 영향을 비교한다. Full ConRAG의 평균 LLM-Acc는 58.7이지만 consensus bonus를 제거하면 57.6, slot binding을 제거하면 55.6으로 감소한다. 특히 slot binding 제거의 하락이 더 크므로, intermediate answer를 후속 query의 constraint로 전달하는 과정이 multi-step retrieval에 중요하다는 결론을 뒷받침한다.

Table 2는 retrieval view 조합을 비교한다. Single-view 중 text-evidence가 평균 LLM-Acc 57.1로 가장 높지만, relation·entity-anchor·text-evidence를 모두 사용하면 58.7로 상승한다. Structural view만 사용한 경우는 text가 포함된 설정보다 낮으므로, relation과 entity-anchor는 dense textual retrieval을 대체하기보다 보완하는 신호로 작동한다.

4.3. Efficiency Analysis

Table 3은 offline graph construction을 제외한 online inference efficiency를 비교한다. ConRAG의 평균 query time은 5.64초로 LogicRAG의 9.83초보다 짧고 대부분의 Graph-based RAG보다 낮다. Relation, entity-anchor, text-evidence retrieval을 병렬 수행하고 graph object를 LLM context에 직접 추가하지 않기 때문이다. 평균 token 수는 2,517.2로 여러 structure-enhanced method보다 적지만 Vanilla RAG, LinearRAG, LogicRAG보다는 많아, 정확도와 token cost 사이의 trade-off가 존재한다.

5. Conclusion

Contribution

[Query와 corpus의 joint optimization] ConRAG는 query-side reasoning과 corpus-side graph structure 중 하나만 개선하는 기존 접근과 달리 두 측면을 하나의 multi-hop RAG framework에서 함께 최적화한다. Dependency-aware query execution과 evidence-grounded corpus organization을 결합하고, 최종 retrieval을 unified evidence-unit space에서 수행한다.
[Slot-bound execution과 multi-view consensus retrieval] ConRAG는 intermediate answer를 후속 sub-question의 dependency slot에 직접 binding하여 multi-round retrieval의 constraint propagation을 강화한다. 동시에 relation, entity-anchor, text-evidence signal을 source evidence unit으로 정렬하고, 여러 view의 지지를 받는 evidence에 consensus bonus를 부여한다.

Limitations

[제한된 평가 범위] 계산 자원의 제약으로 세 개의 English multi-hop QA benchmark와 두 개의 LLM backbone만 평가한다.
[Offline KG construction의 품질과 확장성] ConRAG는 offline evidence-grounded knowledge graph construction과 multi-view index building에 의존한다. Information extraction error, noisy relation, entity normalization mistake가 graph 품질에 영향을 줄 수 있으며, large-scale 또는 자주 변경되는 corpus에서 graph를 효율적으로 구축하고 incremental update하는 방법은 충분히 탐구되지 않았다.
[Intermediate answer 오류의 전파] Slot-bound execution은 이전 step에서 생성된 intermediate answer의 정확도에 의존한다. 초기 sub-question의 answer가 틀리거나 불완전하면 잘못된 constraint가 후속 retrieval query에 삽입되어 retrieval과 final answer를 모두 저하시킬 수 있다.

Meaningful