AI-Powered Unstructured Data Processing

Vast amounts of valuable data exist in unstructured formats, such as articles, reports, legal documents, and web content. For blockchain applications, this unstructured data holds significant potential, but it must first be transformed into a structured format that can be interpreted and acted upon by smart contracts. Pulling unstructured data and publishing it on-chain as structured data is crucial for unlocking new real-world use cases—ranging from regulatory compliance and price feeds to election monitoring and contract automation. Without this transformation, blockchains remain disconnected from critical, real-time insights available only in unstructured formats.

To bridge this gap, IntelliX leverages Large Language Models (LLMs) to transform unstructured data into actionable, structured formats. These models, integrated within the Data Processor Node, allow developers to fetch data from various unstructured sources, such as news websites or legal texts, and convert it into structured data ready for blockchain publishing. LLMs handle the complexities of interpreting free-form text, reconciling discrepancies, and ensuring that the final output adheres to a user-defined schema, making it fit for on-chain use.

Below is a sample configuration demonstrating how IntelliX can extract the winner of an election from multiple news sources, process the unstructured data using an LLM, and format the result in a structured JSON schema:

sources:
  - id: bbc
    type: url
    url: "https://www.bbc.com/news/election"
  
  - id: cnn
    type: url
    url: "https://www.cnn.com/elections/results"
  
  - id: nyt
    type: url
    url: "https://www.nytimes.com/interactive/2024/11/05/us/elections/results.html"

update_frequency: 3600

processors:
  - type: llm
    prompt: |
      Below is an article with presidency election results from a public news website.
      
      Two candidates are running for presidency:
       1 - Doanld Trump
       2 - Kamala Harris

      Your task is to understand if the winner of the election has already 
      been announced. If a winner has been annouced you shoud extract the 
      identifier (1 or 2) of the winning candidate and the total votes of the 
      winning and losing candidates. You must return the result as a JSON object 
      with the format described below:
      
      {
        "winner_announced": true/false,
        "winning_candidate": {
          "id": (1 or 2),
          "total_votes": total vodtes for the winning cadidate
        },
        "losing_candidate": {
          "id": (1 or 2),
          "total_votes": total vodtes for the losing cadidate
        }
      }

publishers:
  - type: bitlayer
    contract_address: "0x1234567890abcdef1234567890abcdef12345678"
  
  - type: bsc
    contract_address: "0xabcdefabcdefabcdefabcdefabcdefabcdefabcdef"
  
  - type: merlin
    contract_address: "0xabcdef123456abcdef123456abcdef123456abcdef"

validators:
  type: threshold
  threshold_value: "80%"  # 80% of the values must match

output_schema:
  $schema: "http://json-schema.org/draft-07/schema#"
  title: "Election Result"
  type: "object"
  properties:
    winner_announced:
      type: "boolean"
      description: "Indicates whether a winner has been announced."
    winning_candidate:
      type: "object"
      properties:
        id:
          type: "integer"
          enum: [1, 2]
          description: "ID of the winning candidate (1 or 2)."
        total_votes:
          type: "integer"
          minimum: 0
          description: "Total votes received by the winning candidate."
      required: ["id", "total_votes"]
    losing_candidate:
      type: "object"
      properties:
        id:
          type: "integer"
          enum: [1, 2]
          description: "ID of the losing candidate (1 or 2)."
        total_votes:
          type: "integer"
          minimum: 0
          description: "Total votes received by the losing candidate."
      required: ["id", "total_votes"]
  required: ["winner_announced", "winning_candidate", "losing_candidate"]
  additionalProperties: false

Aggregation Logic, Validation, and Data Trustworthiness

To ensure the trustworthiness of the data, multiple nodes in the IntelliX network process the same source data. Typically, responses from LLMs are not deterministic—due to the model’s inherent randomness in choosing the next token. This variability poses a challenge for oracle networks, which require consistency across all nodes. To address this, IntelliX minimizes randomness by selecting the next token with the highest probability during inference, ensuring that each node produces the same result.

In addition to minimizing randomness, IntelliX introduces validation logic to reconcile data across nodes. The validation logic, defined by the user, allows for additional checks to ensure data consistency and accuracy. For example, the validation section might include a threshold rule, where a minimum percentage of nodes must return matching results for the data to be considered valid. In the configuration above, the threshold is set to 80%, meaning that 80% of the nodes must produce the same result for it to pass validation.

PreviousModularity and Programmability NextData Flows in IntelliX

Last updated 7 months ago

sources: - id: bbc type: url url: "https://www.bbc.com/news/election" - id: cnn type: url url: "https://www.cnn.com/elections/results" - id: nyt type: url url: "https://www.nytimes.com/interactive/2024/11/05/us/elections/results.html" update_frequency: 3600 processors: - type: llm prompt: | Below is an article with presidency election results from a public news website. Two candidates are running for presidency: 1 - Doanld Trump 2 - Kamala Harris Your task is to understand if the winner of the election has already been announced. If a winner has been annouced you shoud extract the identifier (1 or 2) of the winning candidate and the total votes of the winning and losing candidates. You must return the result as a JSON object with the format described below: { "winner_announced": true/false, "winning_candidate": { "id": (1 or 2), "total_votes": total vodtes for the winning cadidate }, "losing_candidate": { "id": (1 or 2), "total_votes": total vodtes for the losing cadidate } } publishers: - type: bitlayer contract_address: "0x1234567890abcdef1234567890abcdef12345678" - type: bsc contract_address: "0xabcdefabcdefabcdefabcdefabcdefabcdefabcdef" - type: merlin contract_address: "0xabcdef123456abcdef123456abcdef123456abcdef" validators: type: threshold threshold_value: "80%" # 80% of the values must match output_schema: $schema: "http://json-schema.org/draft-07/schema#" title: "Election Result" type: "object" properties: winner_announced: type: "boolean" description: "Indicates whether a winner has been announced." winning_candidate: type: "object" properties: id: type: "integer" enum: [1, 2] description: "ID of the winning candidate (1 or 2)." total_votes: type: "integer" minimum: 0 description: "Total votes received by the winning candidate." required: ["id", "total_votes"] losing_candidate: type: "object" properties: id: type: "integer" enum: [1, 2] description: "ID of the losing candidate (1 or 2)." total_votes: type: "integer" minimum: 0 description: "Total votes received by the losing candidate." required: ["id", "total_votes"] required: ["winner_announced", "winning_candidate", "losing_candidate"] additionalProperties: false