LightRAG解读

2025-08-19

介绍

LightRAG跟GraphRAG类似，是通过将文档处理成知识图谱，然后针对知识图谱进行检索的一套实现，所以接下来我们大概看下其流程。

知识图谱生成

这里和GraphRAG基本保持一致，通过prompt来进行生成。


---Goal---
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
Use English as output language.

---Steps---
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name
- entity_type: One of the following types: [organization,person,geo,event,category]
- entity_description: Provide a comprehensive description of the entity's attributes and activities *based solely on the information present in the input text*. **Do not infer or hallucinate information not explicitly stated.** If the text provides insufficient information to create a comprehensive description, state "Description not available in text."
Format each entity as ("entity"<|><entity_name><|><entity_type><|><entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"<|><source_entity><|><target_entity><|><relationship_description><|><relationship_keywords><|><relationship_strength>)

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"<|><high_level_keywords>)

4. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **##** as the list delimiter.

5. When finished, output <|COMPLETE|>

######################
---Examples---
######################
Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
```
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.

Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. "If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us."

The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.

It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
```

Output:
("entity"<|>"Alex"<|>"person"<|>"Alex is a character who experiences frustration and is observant of the dynamics among other characters.")##
("entity"<|>"Taylor"<|>"person"<|>"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective.")##
("entity"<|>"Jordan"<|>"person"<|>"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device.")##
("entity"<|>"Cruz"<|>"person"<|>"Cruz is associated with a vision of control and order, influencing the dynamics among other characters.")##
("entity"<|>"The Device"<|>"technology"<|>"The Device is central to the story, with potential game-changing implications, and is revered by Taylor.")##
("relationship"<|>"Alex"<|>"Taylor"<|>"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."<|>"power dynamics, perspective shift"<|>7)##
("relationship"<|>"Alex"<|>"Jordan"<|>"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."<|>"shared goals, rebellion"<|>6)##
("relationship"<|>"Taylor"<|>"Jordan"<|>"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."<|>"conflict resolution, mutual respect"<|>8)##
("relationship"<|>"Jordan"<|>"Cruz"<|>"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."<|>"ideological conflict, rebellion"<|>5)##
("relationship"<|>"Taylor"<|>"The Device"<|>"Taylor shows reverence towards the device, indicating its importance and potential impact."<|>"reverence, technological significance"<|>9)##
("content_keywords"<|>"power dynamics, ideological conflict, discovery, rebellion")<|COMPLETE|>
#############################
Example 2:

Entity_types: [company, index, commodity, market_trend, economic_policy, biological]
Text:
```
Stock markets faced a sharp downturn today as tech giants saw significant declines, with the Global Tech Index dropping by 3.4% in midday trading. Analysts attribute the selloff to investor concerns over rising interest rates and regulatory uncertainty.

Among the hardest hit, Nexon Technologies saw its stock plummet by 7.8% after reporting lower-than-expected quarterly earnings. In contrast, Omega Energy posted a modest 2.1% gain, driven by rising oil prices.

Meanwhile, commodity markets reflected a mixed sentiment. Gold futures rose by 1.5%, reaching $2,080 per ounce, as investors sought safe-haven assets. Crude oil prices continued their rally, climbing to $87.60 per barrel, supported by supply constraints and strong demand.

Financial experts are closely watching the Federal Reserve's next move, as speculation grows over potential rate hikes. The upcoming policy announcement is expected to influence investor confidence and overall market stability.
```

Output:
("entity"<|>"Global Tech Index"<|>"index"<|>"The Global Tech Index tracks the performance of major technology stocks and experienced a 3.4% decline today.")##
("entity"<|>"Nexon Technologies"<|>"company"<|>"Nexon Technologies is a tech company that saw its stock decline by 7.8% after disappointing earnings.")##
("entity"<|>"Omega Energy"<|>"company"<|>"Omega Energy is an energy company that gained 2.1% in stock value due to rising oil prices.")##
("entity"<|>"Gold Futures"<|>"commodity"<|>"Gold futures rose by 1.5%, indicating increased investor interest in safe-haven assets.")##
("entity"<|>"Crude Oil"<|>"commodity"<|>"Crude oil prices rose to $87.60 per barrel due to supply constraints and strong demand.")##
("entity"<|>"Market Selloff"<|>"market_trend"<|>"Market selloff refers to the significant decline in stock values due to investor concerns over interest rates and regulations.")##
("entity"<|>"Federal Reserve Policy Announcement"<|>"economic_policy"<|>"The Federal Reserve's upcoming policy announcement is expected to impact investor confidence and market stability.")##
("relationship"<|>"Global Tech Index"<|>"Market Selloff"<|>"The decline in the Global Tech Index is part of the broader market selloff driven by investor concerns."<|>"market performance, investor sentiment"<|>9)##
("relationship"<|>"Nexon Technologies"<|>"Global Tech Index"<|>"Nexon Technologies' stock decline contributed to the overall drop in the Global Tech Index."<|>"company impact, index movement"<|>8)##
("relationship"<|>"Gold Futures"<|>"Market Selloff"<|>"Gold prices rose as investors sought safe-haven assets during the market selloff."<|>"market reaction, safe-haven investment"<|>10)##
("relationship"<|>"Federal Reserve Policy Announcement"<|>"Market Selloff"<|>"Speculation over Federal Reserve policy changes contributed to market volatility and investor selloff."<|>"interest rate impact, financial regulation"<|>7)##
("content_keywords"<|>"market downturn, investor sentiment, commodities, Federal Reserve, stock performance")<|COMPLETE|>
#############################
Example 3:

Entity_types: [economic_policy, athlete, event, location, record, organization, equipment]
Text:
```
At the World Athletics Championship in Tokyo, Noah Carter broke the 100m sprint record using cutting-edge carbon-fiber spikes.
```

Output:
("entity"<|>"World Athletics Championship"<|>"event"<|>"The World Athletics Championship is a global sports competition featuring top athletes in track and field.")##
("entity"<|>"Tokyo"<|>"location"<|>"Tokyo is the host city of the World Athletics Championship.")##
("entity"<|>"Noah Carter"<|>"athlete"<|>"Noah Carter is a sprinter who set a new record in the 100m sprint at the World Athletics Championship.")##
("entity"<|>"100m Sprint Record"<|>"record"<|>"The 100m sprint record is a benchmark in athletics, recently broken by Noah Carter.")##
("entity"<|>"Carbon-Fiber Spikes"<|>"equipment"<|>"Carbon-fiber spikes are advanced sprinting shoes that provide enhanced speed and traction.")##
("entity"<|>"World Athletics Federation"<|>"organization"<|>"The World Athletics Federation is the governing body overseeing the World Athletics Championship and record validations.")##
("relationship"<|>"World Athletics Championship"<|>"Tokyo"<|>"The World Athletics Championship is being hosted in Tokyo."<|>"event location, international competition"<|>8)##
("relationship"<|>"Noah Carter"<|>"100m Sprint Record"<|>"Noah Carter set a new 100m sprint record at the championship."<|>"athlete achievement, record-breaking"<|>10)##
("relationship"<|>"Noah Carter"<|>"Carbon-Fiber Spikes"<|>"Noah Carter used carbon-fiber spikes to enhance performance during the race."<|>"athletic equipment, performance boost"<|>7)##
("relationship"<|>"World Athletics Federation"<|>"100m Sprint Record"<|>"The World Athletics Federation is responsible for validating and recognizing new sprint records."<|>"sports regulation, record certification"<|>9)##
("content_keywords"<|>"athletics, sprinting, record-breaking, sports technology, competition")<|COMPLETE|>
#############################

#############################
---Real Data---
######################
Entity_types: [organization,person,geo,event,category]
Text:
<这里是chunk>
######################
Output:

显而易见在一个这么长的prompt里面再插入文本，会导致LLM理解能力变弱，一个特征是提取的relation会更少，所以它这里又进行了二次提取。

MANY entities and relationships were missed in the last extraction. Please find only the missing entities and relationships from previous text.

---Remember Steps---

1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name
- entity_type: One of the following types: [organization,person,geo,event,category]
- entity_description: Provide a comprehensive description of the entity's attributes and activities *based solely on the information present in the input text*. **Do not infer or hallucinate information not explicitly stated.** If the text provides insufficient information to create a comprehensive description, state "Description not available in text."
Format each entity as ("entity"<|><entity_name><|><entity_type><|><entity_description>)

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"<|><source_entity><|><target_entity><|><relationship_description><|><relationship_keywords><|><relationship_strength>)

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"<|><high_level_keywords>)

4. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **##** as the list delimiter.

5. When finished, output <|COMPLETE|>

---Output---

Add new entities and relations below using the same format, and do not include entities and relations that have been previously extracted. :

即prompt1 + LLM output + prompt2（上面这个）来提取更多entity和relations.

至此完成知识图谱的构建。

我觉得腾讯开源的WeKnora提示词也许会更好，实体是实体，关系是关系。

同时这里也暴露出来些许问题：

知识图谱构建效果，有没有漏提错提三元组。
实体进行融合。
整体来讲，知识图谱构建效果，将影响后续的使用效果。

查询

作者这里提供了Native Search、Local Search、Global Search和Hybrid Search四种查询方式，由于Native Search是原始chunk查询方式，Hybrid Search是Local和Global的融合，所以我们接下来单看这两块。

用户query解析

LightRAG将对用户query解析分成两块，即Low Keywords和High Keywords,这两者区别在哪，即前者更注重具体实体，后者更关注全局表达。对应代码如下:


hl_keywords, ll_keywords = await get_keywords_from_query(
        query, query_param, global_config, hashing_kv
    )

下面是他的提示词：

---Role---
You are an expert keyword extractor, specializing in analyzing user queries for a Retrieval-Augmented Generation (RAG) system. Your purpose is to identify both high-level and low-level keywords in the user's query that will be used for effective document retrieval.

---Goal---
Given a user query, your task is to extract two distinct types of keywords:
1. **high_level_keywords**: for overarching concepts or themes, capturing user's core intent, the subject area, or the type of question being asked.
2. **low_level_keywords**: for specific entities or details, identifying the specific entities, proper nouns, technical jargon, product names, or concrete items.

---Instructions & Constraints---
1. **Output Format**: Your output MUST be a valid JSON object and nothing else. Do not include any explanatory text, markdown code fences (like ```json), or any other text before or after the JSON. It will be parsed directly by a JSON parser.
2. **Source of Truth**: All keywords must be derived directly from or be a direct interpretation of the user query.
3. **Concise & Meaningful**: Keywords should be concise words or meaningful phrases. Prioritize multi-word phrases when they represent a single concept. For example, from "latest financial report of Apple Inc.", you should extract "latest financial report" and "Apple Inc." rather than "latest", "financial", "report", and "Apple".
4. **No Overlap**: A keyword or its core concept should not appear in both the high-level and low-level lists.
5. **Handle Edge Cases**: For queries that are too simple, vague, or nonsensical (e.g., "hello", "ok", "asdfghjkl"), you must return a JSON object with empty lists for both keyword types.

---Examples---
Example 1:

Query: "How does international trade influence global economic stability?"

Output:
{
  "high_level_keywords": ["International trade", "Global economic stability", "Economic impact"],
  "low_level_keywords": ["Trade agreements", "Tariffs", "Currency exchange", "Imports", "Exports"]
}


Example 2:

Query: "What are the environmental consequences of deforestation on biodiversity?"

Output:
{
  "high_level_keywords": ["Environmental consequences", "Deforestation", "Biodiversity loss"],
  "low_level_keywords": ["Species extinction", "Habitat destruction", "Carbon emissions", "Rainforest", "Ecosystem"]
}


Example 3:

Query: "What is the role of education in reducing poverty?"

Output:
{
  "high_level_keywords": ["Education", "Poverty reduction", "Socioeconomic development"],
  "low_level_keywords": ["School access", "Literacy rates", "Job training", "Income inequality"]
}



---Real Data---
User Query: 安徽芜湖奇瑞生产的新能源车辆其在北美销售量有多少？

---Output---


ChatGPT结果	DeepSeek结果

新能源车辆甚至是新能源难道不应该在low keywords里么，所以这里第一个问题在于LLM对其的理解能力将影响下游查询效果。当然如果将上述prompt更改成中文我觉得DS也会更好。

接下来根据搜索模式来走对应流程，如下图，不同模式最终都会返回entities和relations,接下来看这两种不同模式下的搜索策略。

Local Search

entities：通过low keywords对entities vector库进行召回。即topK(cos_sim(get_embed('安徽芜湖, 奇瑞, 奇瑞新能源汽车, 北美'), entities vector db))。
relations: 拿到上述entities,接着获取一跳节点所组成的边。即[graph.list_edges(node) for node in entities]。

接下来就是排序，entities根据其degree进行排序，relations根据src degree + tgt degree和edge weight（通过LLM在生成知识图谱时获取）进行综合rank。

Global Search

relations：通过high keywords对relations vector库进行召回。即topK(cos_sim(get_embed('新能源汽车, 销售量, 北美市场'), relations vector db))。
entities：根据上面获取到的边自然而然获取到对应src和tgt所对应的entities。

根据entities和relations获取有关chunks

如果是Local Search,Global那边的relation和entity就没有，反之亦然，但如果是Hybrid,则是将这两者对应的entities和relations进行融合。

简短理解就是获取到的entities包括了chunk_ids, 表明某个entity从哪些chunks里获取，进行汇集，然后和用户原始query进行cos_sim排序。
那同理，某个relation也包含了从哪个chunk所获取到的，然后使用原始query进行排序。

LLM回答

基于上述三者，包装成一个大的prompt来进行回复。其prompt如下：

-----Entities(KG)-----

```json
{entities_str}
```

-----Relationships(KG)-----

```json
{relations_str}
```

-----Document Chunks(DC)-----

```json
[]
```

至此我们大致捋清了其实现方式。

总结

在上面每部分已经总结出来的问题除外，我认为其实现还可以加入多跳查询来满足更复杂的链路关系，不过这里也就见仁见智了～