Cloudera Connectors

Cloudera Connectors provide comprehensive integration with various Hadoop ecosystem components, enabling data access across different storage and processing systems. These connectors allow you to seamlessly incorporate big data processing capabilities into your RAG applications, helping you leverage the power of distributed computing for large-scale data analysis and retrieval.

1.1 Solr

Solr Connector Interface

Description

The Solr connector provides access to Cloudera's implementation of Apache Solr, a highly scalable and reliable distributed search platform. This connector allows you to perform sophisticated full-text searches across large volumes of data stored in a Solr index, making it ideal for applications requiring fast, faceted search capabilities with relevance ranking.

Use Cases

Implementing high-performance text search within large document repositories
Building faceted navigation for complex data exploration
Creating intelligent search applications with relevance scoring
Extracting specific data subsets based on complex search criteria
Augmenting RAG applications with domain-specific knowledge stored in Solr indexes

Input Configuration

Solr Base URL: The base URL for your Solr service (required)
Example: https://your-cluster.cloudera.com:8983/solr
Collection Name: Name of the Solr collection to query (required)
Example: documents, products, logs
Username: Authentication username for secure Solr (required for secured clusters)
Example: solr_user
Password: Authentication password (required for secured clusters)
Example: ••••••••
Search Query: Solr query string using Solr query syntax (required)
Example: content:hadoop AND category:analytics

Output

The connector returns documents matching the search query with their fields and relevance scores in JSON format.

Example Output:

{
  "response": {
    "numFound": 42,
    "start": 0,
    "docs": [
      {
        "id": "doc123",
        "title": "Introduction to Hadoop Ecosystem",
        "content": "Hadoop is an open-source framework...",
        "category": "analytics",
        "last_modified": "2023-04-12T10:23:45Z",
        "score": 8.924531
      },
      {
        "id": "doc456",
        "title": "Big Data Processing with Hadoop",
        "content": "Processing large datasets requires...",
        "category": "analytics",
        "last_modified": "2023-05-18T14:37:22Z",
        "score": 7.651298
      }
    ]
  }
}

Implementation Notes

Create efficient Solr queries with proper field selection to minimize data transfer
Use faceting for multi-dimensional data exploration
Implement pagination for large result sets using start and rows parameters
Consider field boosting to improve relevance for specific fields
Enable request handler caching for frequently executed queries

1.2 HDFS File

HDFS File Connector Interface

Description

The HDFS File connector enables direct access to individual files stored in Hadoop Distributed File System (HDFS). This connector allows you to read and process specific files from your Hadoop cluster, supporting various file formats and providing flexible chunking options for optimal processing of large files. Access is secured through Apache Knox, Cloudera's gateway for authenticating and accessing Hadoop services.

Use Cases

Retrieving and processing individual data files stored in HDFS
Extracting specific content from large log files or datasets
Processing structured or semi-structured data files for RAG applications
Accessing historical data archives for analytical purposes
Ingesting configuration files or reference data from a Hadoop environment

Input Configuration

Knox Base URL: Base URL for the Knox gateway (required)
Example: https://your-cluster-knox.cloudera.com:8443/gateway
HDFS File Path: Path to the target file in HDFS (required)
Example: /user/data/reports/annual_report_2023.csv
Knox Username: Knox authentication username (required)
Example: hdfs_user
Knox Password: Knox authentication password (required)
Example: ••••••••
Chunking Strategy: Method for dividing the file into processable chunks (required)
Example: by_line, by_character, by_paragraph
Max Characters per Chunk: Maximum size of each chunk in characters (required)
Example: 1000, 2048, 4096
Include Original Elements: Whether to preserve original data structure (optional)
Example: true, false
Silent Errors: Whether to continue processing despite errors (optional)
Example: true, false

Output

The connector returns the file content, divided into chunks according to the specified chunking strategy.

Example Output:

{
  "chunks": [
    {
      "content": "This is the first chunk of content from the HDFS file...",
      "metadata": {
        "source": "/user/data/reports/annual_report_2023.csv",
        "chunk_index": 0,
        "chunk_size": 982
      }
    },
    {
      "content": "This is the second chunk of content continuing from the previous section...",
      "metadata": {
        "source": "/user/data/reports/annual_report_2023.csv",
        "chunk_index": 1,
        "chunk_size": 1024
      }
    }
  ],
  "total_chunks": 8,
  "file_size": 8192,
  "file_type": "text/csv"
}

Implementation Notes

Select appropriate chunking strategy based on file type and content structure
Consider file size when setting chunk parameters to avoid memory issues
For structured files (CSV, JSON), align chunks with record boundaries when possible
Implement error handling for corrupt or incomplete files
Use Knox's security features to ensure proper authentication and authorization

1.3 Hive SQL

Hive SQL Connector Interface

Description

The Hive SQL connector provides SQL-based access to data stored in Apache Hive, Cloudera's data warehouse system built on Hadoop. This connector allows you to execute HiveQL queries to retrieve structured data from Hive tables, supporting complex SQL operations including joins, aggregations, and filters. It uses JDBC to establish secure connections to your Hive server.

Use Cases

Querying large structured datasets stored in Hive tables
Performing complex data analysis using SQL operations
Extracting aggregated business metrics for reporting
Joining multiple data sources stored in Hive
Building data pipelines that incorporate Hive-stored data

Input Configuration

JDBC URL: JDBC connection string for Hive server (required)
Example: jdbc:hive2://your-cluster.cloudera.com:10000
Username: Database authentication username (required)
Example: hive_user
Password: Database authentication password (required)
Example: ••••••••
Database: Target Hive database name (required)
Example: sales_data, customer_analytics
HTTP Path: HTTP path for the Hive service (required for HTTP transport mode)
Example: /cliservice
SSL Enabled: Whether to use SSL for secure connection (optional)
Example: true, false
JDBC Driver Path: Path to the Hive JDBC driver file (required)
Example: /path/to/hive-jdbc-3.1.3000.jar
JDBC Driver Class: Fully qualified name of the JDBC driver class (required)
Example: org.apache.hive.jdbc.HiveDriver
Query: HiveQL query to execute (required)
Example: SELECT customer_id, product_name, purchase_date FROM sales WHERE purchase_date > '2023-01-01'

Output

The connector returns the query results with column names and row data in a structured format.

Example Output:

{
  "metadata": {
    "columnNames": ["customer_id", "product_name", "purchase_date", "amount"],
    "columnTypes": ["BIGINT", "VARCHAR", "DATE", "DECIMAL"]
  },
  "data": [
    {
      "customer_id": 1245,
      "product_name": "Premium Analytics Package",
      "purchase_date": "2023-03-15",
      "amount": 1299.99
    },
    {
      "customer_id": 8763,
      "product_name": "Data Processing Service",
      "purchase_date": "2023-04-22",
      "amount": 849.50
    }
  ],
  "rowCount": 2,
  "executionTime": 3.45,
  "status": "success"
}

Implementation Notes

Optimize Hive queries with proper filtering to reduce processing time
Use LIMIT clauses for large result sets to avoid memory constraints
Consider partitioned tables for better query performance
Implement timeout handling for long-running queries
Use parameterized queries to prevent SQL injection vulnerabilities

1.4 HBase Knox

HBase Knox Connector Interface

Description

The HBase Knox connector provides secure access to Apache HBase, Cloudera's non-relational, distributed database built on HDFS. This connector allows you to query and retrieve data from HBase tables using row keys and column qualifiers, with authentication handled through Apache Knox. It's especially useful for applications requiring low-latency access to large, sparse datasets.

Use Cases

Retrieving time-series data stored in HBase
Accessing user profile information in real-time applications
Querying IoT device data or sensor readings
Extracting specific records from large, sparse tables
Building applications that require fast, random access to big data

Input Configuration

Base URL: Knox gateway URL for HBase access (required)
Example: https://your-cluster-knox.cloudera.com:8443/gateway/default
Username: Knox authentication username (required)
Example: hbase_user
Password: Knox authentication password (required)
Example: ••••••••
Table Name: HBase table to query (required)
Example: customer_profiles, web_logs, sensor_data
Search Query: Query parameters for HBase (required)
Example: {"rowKey": "user123", "columns": ["info:name", "info:email", "data:last_login"]}

Output

The connector returns the HBase cell values matching the query parameters, organized by row and column.

Example Output:

{
  "Row": [
    {
      "key": "dXNlcjEyMw==",  // Base64 encoded "user123"
      "Cell": [
        {
          "column": "aW5mbzpuYW1l",  // Base64 encoded "info:name"
          "timestamp": 1691498231000,
          "$": "Sm9obiBTbWl0aA==",  // Base64 encoded "John Smith"
        },
        {
          "column": "aW5mbzplbWFpbA==",  // Base64 encoded "info:email"
          "timestamp": 1691498231000,
          "$": "am9obi5zbWl0aEBleGFtcGxlLmNvbQ==",  // Base64 encoded "john.smith@example.com"
        },
        {
          "column": "ZGF0YTpsYXN0X2xvZ2lu",  // Base64 encoded "data:last_login"
          "timestamp": 1694152631000,
          "$": "MjAyMy0wOS0wOFQxNDozMDozMVo=",  // Base64 encoded "2023-09-08T14:30:31Z"
        }
      ]
    }
  ]
}

Implementation Notes

Design efficient row keys for optimal data retrieval performance
Request only the specific columns needed to minimize data transfer
Use row key prefixes and filters for range queries
Handle Base64 encoding/decoding for binary data
Implement retries for transient connection issues

1.5 Impala ODBC

1.5.1 Input Configuration

DSN: Data source name
Username: Authentication username
Password: Authentication password
Query: SQL query to execute

1.6 HDFS Directory

1.6.1 Input Configuration

Knox Base URL: Knox gateway URL
HDFS Path: Directory path
Knox Username: Authentication username
Knox Password: Authentication password
Enable Parallel Processing: Processing option
Silent Errors: Error handling option
Chunking Strategy: Processing strategy
Max Characters per Chunk: Chunk size limit
Include Original Elements: Data preservation option

Note: Ensure proper Cloudera cluster configuration and Knox gateway access before using these connectors. For production environments, implement appropriate security measures including SSL encryption, Kerberos authentication, and regular credential rotation.

Documentation

Cloudera Connectors

1.1 Solr

Description

Use Cases

Input Configuration

Output

Implementation Notes

1.2 HDFS File

Description

Use Cases

Input Configuration

Output

Implementation Notes

1.3 Hive SQL

Description

Use Cases

Input Configuration

Output

Implementation Notes

1.4 HBase Knox

Description

Use Cases

Input Configuration

Output

Implementation Notes

1.5 Impala ODBC

1.5.1 Input Configuration

1.6 HDFS Directory

1.6.1 Input Configuration