Documentation

Cloudera Connectors

Cloudera Connectors provide comprehensive integration with various Hadoop ecosystem components, enabling data access across different storage and processing systems. These connectors allow you to seamlessly incorporate big data processing capabilities into your RAG applications, helping you leverage the power of distributed computing for large-scale data analysis and retrieval.

1.1 Solr

SOLR Configuration

Solr Connector Interface

Description

The Solr connector provides access to Cloudera's implementation of Apache Solr, a highly scalable and reliable distributed search platform. This connector allows you to perform sophisticated full-text searches across large volumes of data stored in a Solr index, making it ideal for applications requiring fast, faceted search capabilities with relevance ranking.

Use Cases

  • Implementing high-performance text search within large document repositories
  • Building faceted navigation for complex data exploration
  • Creating intelligent search applications with relevance scoring
  • Extracting specific data subsets based on complex search criteria
  • Augmenting RAG applications with domain-specific knowledge stored in Solr indexes

Input Configuration

  • Solr Base URL: The base URL for your Solr service (required)

    Example: https://your-cluster.cloudera.com:8983/solr

  • Collection Name: Name of the Solr collection to query (required)

    Example: documents, products, logs

  • Username: Authentication username for secure Solr (required for secured clusters)

    Example: solr_user

  • Password: Authentication password (required for secured clusters)

    Example: ••••••••

  • Search Query: Solr query string using Solr query syntax (required)

    Example: content:hadoop AND category:analytics

Output

The connector returns documents matching the search query with their fields and relevance scores in JSON format.

Example Output:

{
  "response": {
    "numFound": 42,
    "start": 0,
    "docs": [
      {
        "id": "doc123",
        "title": "Introduction to Hadoop Ecosystem",
        "content": "Hadoop is an open-source framework...",
        "category": "analytics",
        "last_modified": "2023-04-12T10:23:45Z",
        "score": 8.924531
      },
      {
        "id": "doc456",
        "title": "Big Data Processing with Hadoop",
        "content": "Processing large datasets requires...",
        "category": "analytics",
        "last_modified": "2023-05-18T14:37:22Z",
        "score": 7.651298
      }
    ]
  }
}

Implementation Notes

  • Create efficient Solr queries with proper field selection to minimize data transfer
  • Use faceting for multi-dimensional data exploration
  • Implement pagination for large result sets using start and rows parameters
  • Consider field boosting to improve relevance for specific fields
  • Enable request handler caching for frequently executed queries

1.2 HDFS File

HDFS File Configuration

HDFS File Connector Interface

Description

The HDFS File connector enables direct access to individual files stored in Hadoop Distributed File System (HDFS). This connector allows you to read and process specific files from your Hadoop cluster, supporting various file formats and providing flexible chunking options for optimal processing of large files. Access is secured through Apache Knox, Cloudera's gateway for authenticating and accessing Hadoop services.

Use Cases

  • Retrieving and processing individual data files stored in HDFS
  • Extracting specific content from large log files or datasets
  • Processing structured or semi-structured data files for RAG applications
  • Accessing historical data archives for analytical purposes
  • Ingesting configuration files or reference data from a Hadoop environment

Input Configuration

  • Knox Base URL: Base URL for the Knox gateway (required)

    Example: https://your-cluster-knox.cloudera.com:8443/gateway

  • HDFS File Path: Path to the target file in HDFS (required)

    Example: /user/data/reports/annual_report_2023.csv

  • Knox Username: Knox authentication username (required)

    Example: hdfs_user

  • Knox Password: Knox authentication password (required)

    Example: ••••••••

  • Chunking Strategy: Method for dividing the file into processable chunks (required)

    Example: by_line, by_character, by_paragraph

  • Max Characters per Chunk: Maximum size of each chunk in characters (required)

    Example: 1000, 2048, 4096

  • Include Original Elements: Whether to preserve original data structure (optional)

    Example: true, false

  • Silent Errors: Whether to continue processing despite errors (optional)

    Example: true, false

Output

The connector returns the file content, divided into chunks according to the specified chunking strategy.

Example Output:

{
  "chunks": [
    {
      "content": "This is the first chunk of content from the HDFS file...",
      "metadata": {
        "source": "/user/data/reports/annual_report_2023.csv",
        "chunk_index": 0,
        "chunk_size": 982
      }
    },
    {
      "content": "This is the second chunk of content continuing from the previous section...",
      "metadata": {
        "source": "/user/data/reports/annual_report_2023.csv",
        "chunk_index": 1,
        "chunk_size": 1024
      }
    }
  ],
  "total_chunks": 8,
  "file_size": 8192,
  "file_type": "text/csv"
}

Implementation Notes

  • Select appropriate chunking strategy based on file type and content structure
  • Consider file size when setting chunk parameters to avoid memory issues
  • For structured files (CSV, JSON), align chunks with record boundaries when possible
  • Implement error handling for corrupt or incomplete files
  • Use Knox's security features to ensure proper authentication and authorization

1.3 Hive SQL

Hive SQL Configuration

Hive SQL Connector Interface

Description

The Hive SQL connector provides SQL-based access to data stored in Apache Hive, Cloudera's data warehouse system built on Hadoop. This connector allows you to execute HiveQL queries to retrieve structured data from Hive tables, supporting complex SQL operations including joins, aggregations, and filters. It uses JDBC to establish secure connections to your Hive server.

Use Cases

  • Querying large structured datasets stored in Hive tables
  • Performing complex data analysis using SQL operations
  • Extracting aggregated business metrics for reporting
  • Joining multiple data sources stored in Hive
  • Building data pipelines that incorporate Hive-stored data

Input Configuration

  • JDBC URL: JDBC connection string for Hive server (required)

    Example: jdbc:hive2://your-cluster.cloudera.com:10000

  • Username: Database authentication username (required)

    Example: hive_user

  • Password: Database authentication password (required)

    Example: ••••••••

  • Database: Target Hive database name (required)

    Example: sales_data, customer_analytics

  • HTTP Path: HTTP path for the Hive service (required for HTTP transport mode)

    Example: /cliservice

  • SSL Enabled: Whether to use SSL for secure connection (optional)

    Example: true, false

  • JDBC Driver Path: Path to the Hive JDBC driver file (required)

    Example: /path/to/hive-jdbc-3.1.3000.jar

  • JDBC Driver Class: Fully qualified name of the JDBC driver class (required)

    Example: org.apache.hive.jdbc.HiveDriver

  • Query: HiveQL query to execute (required)

    Example: SELECT customer_id, product_name, purchase_date FROM sales WHERE purchase_date > '2023-01-01'

Output

The connector returns the query results with column names and row data in a structured format.

Example Output:

{
  "metadata": {
    "columnNames": ["customer_id", "product_name", "purchase_date", "amount"],
    "columnTypes": ["BIGINT", "VARCHAR", "DATE", "DECIMAL"]
  },
  "data": [
    {
      "customer_id": 1245,
      "product_name": "Premium Analytics Package",
      "purchase_date": "2023-03-15",
      "amount": 1299.99
    },
    {
      "customer_id": 8763,
      "product_name": "Data Processing Service",
      "purchase_date": "2023-04-22",
      "amount": 849.50
    }
  ],
  "rowCount": 2,
  "executionTime": 3.45,
  "status": "success"
}

Implementation Notes

  • Optimize Hive queries with proper filtering to reduce processing time
  • Use LIMIT clauses for large result sets to avoid memory constraints
  • Consider partitioned tables for better query performance
  • Implement timeout handling for long-running queries
  • Use parameterized queries to prevent SQL injection vulnerabilities

1.4 HBase Knox

HBase Knox Configuration

HBase Knox Connector Interface

Description

The HBase Knox connector provides secure access to Apache HBase, Cloudera's non-relational, distributed database built on HDFS. This connector allows you to query and retrieve data from HBase tables using row keys and column qualifiers, with authentication handled through Apache Knox. It's especially useful for applications requiring low-latency access to large, sparse datasets.

Use Cases

  • Retrieving time-series data stored in HBase
  • Accessing user profile information in real-time applications
  • Querying IoT device data or sensor readings
  • Extracting specific records from large, sparse tables
  • Building applications that require fast, random access to big data

Input Configuration

  • Base URL: Knox gateway URL for HBase access (required)

    Example: https://your-cluster-knox.cloudera.com:8443/gateway/default

  • Username: Knox authentication username (required)

    Example: hbase_user

  • Password: Knox authentication password (required)

    Example: ••••••••

  • Table Name: HBase table to query (required)

    Example: customer_profiles, web_logs, sensor_data

  • Search Query: Query parameters for HBase (required)

    Example: {"rowKey": "user123", "columns": ["info:name", "info:email", "data:last_login"]}

Output

The connector returns the HBase cell values matching the query parameters, organized by row and column.

Example Output:

{
  "Row": [
    {
      "key": "dXNlcjEyMw==",  // Base64 encoded "user123"
      "Cell": [
        {
          "column": "aW5mbzpuYW1l",  // Base64 encoded "info:name"
          "timestamp": 1691498231000,
          "$": "Sm9obiBTbWl0aA==",  // Base64 encoded "John Smith"
        },
        {
          "column": "aW5mbzplbWFpbA==",  // Base64 encoded "info:email"
          "timestamp": 1691498231000,
          "$": "am9obi5zbWl0aEBleGFtcGxlLmNvbQ==",  // Base64 encoded "john.smith@example.com"
        },
        {
          "column": "ZGF0YTpsYXN0X2xvZ2lu",  // Base64 encoded "data:last_login"
          "timestamp": 1694152631000,
          "$": "MjAyMy0wOS0wOFQxNDozMDozMVo=",  // Base64 encoded "2023-09-08T14:30:31Z"
        }
      ]
    }
  ]
}

Implementation Notes

  • Design efficient row keys for optimal data retrieval performance
  • Request only the specific columns needed to minimize data transfer
  • Use row key prefixes and filters for range queries
  • Handle Base64 encoding/decoding for binary data
  • Implement retries for transient connection issues

1.5 Impala ODBC

Impala ODBC Configuration

1.5.1 Input Configuration

  • DSN: Data source name
  • Username: Authentication username
  • Password: Authentication password
  • Query: SQL query to execute

1.6 HDFS Directory

HDFS Directory Configuration

1.6.1 Input Configuration

  • Knox Base URL: Knox gateway URL
  • HDFS Path: Directory path
  • Knox Username: Authentication username
  • Knox Password: Authentication password
  • Enable Parallel Processing: Processing option
  • Silent Errors: Error handling option
  • Chunking Strategy: Processing strategy
  • Max Characters per Chunk: Chunk size limit
  • Include Original Elements: Data preservation option

Note: Ensure proper Cloudera cluster configuration and Knox gateway access before using these connectors. For production environments, implement appropriate security measures including SSL encryption, Kerberos authentication, and regular credential rotation.