Data Type Handling

FFIEC Data Connect provides comprehensive data type handling across multiple protocols (SOAP/REST) and output formats, ensuring data integrity and precision from the original XBRL source through to your final data structure.

Overview

The library manages data types across three key dimensions:

Protocol Layer (SOAP vs REST)

Protocol Comparison
Aspect	SOAP API	REST API
Credentials	`WebserviceCredentials`	`OAuth2Credentials`
Null Values	`np.nan` (NumPy)	`pd.NA` (Pandas)
Compatibility	100% backward compatible	Enhanced integer handling
Use Case	Existing integrations	New implementations

Type Detection in XBRL Processing

The library automatically detects data types from XBRL unit references:

XBRL Type Mapping
XBRL Unit	Value Example	Python Type	Description
`USD`	`"1500000"`	`np.int64`	Monetary values (÷1000)
`PURE`	`"1.25"`	`np.float64`	Ratios and percentages
`NON-MONETARY`	`"0.85"`	`np.float64`	Non-monetary numerics
Boolean	`"true"/"false"`	`np.bool_`	Boolean indicators
Other	`"text"`	`str`	Text values

Special Processing Rules

USD Values: Automatically divided by 1000 using integer division (//)
Type Preservation: Original types stored in data_type field
Null Handling: Protocol-specific null values applied

SOAP vs REST Null Handling

Why Different Null Handling Strategies?

The library uses different null strategies for SOAP and REST to solve a critical problem while maintaining backward compatibility:

The Problem:

When pandas DataFrames contained np.nan values in integer columns, pandas would automatically convert those columns to float64 to accommodate the NaN values. This resulted in integer values displaying with decimal points (e.g., 1000.0 instead of 1000), which was both aesthetically problematic and semantically incorrect for financial data that should be represented as whole numbers.

The Solution:

SOAP Path (Original): Continues using np.nan to ensure 100% backward compatibility for existing integrations. Existing code that expects np.nan behavior continues to work unchanged.
REST Path (Enhanced): Uses pd.NA, which is pandas’ newer null value that works with nullable integer types (Int64). This allows integer columns to remain as integers even when containing null values.

Design Philosophy:

This dual approach follows the principle of “never break existing code.” Users who have built systems around the SOAP API can upgrade the library without any changes to their code, while new REST API users automatically benefit from improved type handling.

# Example of the problem this solves:
# Before (with np.nan):
df['int_data']  # Shows: [1000.0, 2000.0, NaN, 3000.0]

# After (with pd.NA for REST):
df['int_data']  # Shows: [1000, 2000, <NA>, 3000]

Technical Implementation

The differentiation between SOAP and REST null handling is implemented at the XBRL processor level through the use_rest_nulls parameter:

SOAP calls: Automatically use use_rest_nulls=False (default), applying np.nan
REST calls: Explicitly set use_rest_nulls=True, applying pd.NA
User transparency: This is handled automatically based on credential type - users never need to specify this parameter

This approach ensures that:

Zero configuration needed: The library automatically selects the appropriate null handling based on your credentials
No breaking changes: Existing SOAP users see no changes in behavior
Optimal for each protocol: Each API path gets the most appropriate null handling for its use case
Future-proof: As pandas evolves, REST users automatically benefit from improvements to nullable types

SOAP API (Original Behavior)

SOAP Null Value Handling
Data Type	Null Value	Pandas Conversion	Final dtype
Integer	`np.nan`	Direct to `Int64`	Nullable integer
Float	`np.nan`	Direct to `float64`	Standard float
Boolean	`np.nan`	Direct to `boolean`	Nullable boolean
String	`None`	Direct to `string`	Pandas string

# SOAP path - original behavior preserved
processed_ret = xbrl_processor._process_xml(
    data,
    date_format,
    use_rest_nulls=False  # Default
)

REST API (Enhanced Behavior)

REST Null Value Handling
Data Type	Null Value	Intermediate	Pandas Conversion	Final dtype
Integer	`pd.NA`	→ `None`	→ `Int64`	Nullable integer
Float	`pd.NA`	→ `np.nan`	→ `float64`	Standard float
Boolean	`pd.NA`	→ `None`	→ `boolean`	Nullable boolean
String	`None`	(unchanged)	→ `string`	Pandas string

# REST path - enhanced handling
processed_ret = xbrl_processor._process_xml(
    data,
    date_format,
    use_rest_nulls=True  # Explicit for REST
)

Output Format Type Mapping

List Output (`output_type="list"`)

Returns raw dictionaries with the following structure:

{
    'mdrm': str,              # MDRM identifier
    'rssd': str,              # RSSD ID
    'quarter': str/date,      # Based on date_output_format
    'data_type': str,         # 'int', 'float', 'bool', or 'str'
    'int_data': int/null,     # Value if data_type='int'
    'float_data': float/null, # Value if data_type='float'
    'bool_data': bool/null,   # Value if data_type='bool'
    'str_data': str/None      # Value if data_type='str'
}

Pandas Output (`output_type="pandas"`)

Creates DataFrames with nullable types for proper null handling:

Pandas Column Types
Column	dtype	Description	Null Support
`mdrm`	`object`	MDRM identifier	No
`rssd`	`object`	RSSD ID as string	No
`quarter`	`object`/`datetime64`	Based on date_output_format	No
`data_type`	`object`	Type indicator	No
`int_data`	`Int64`	Nullable integer	Yes (`pd.NA`)
`float_data`	`float64`	Standard float	Yes (`NaN`)
`bool_data`	`boolean`	Nullable boolean	Yes (`pd.NA`)
`str_data`	`string`	Pandas string	Yes (`pd.NA`)

Key Benefits:

No .0 suffix on integer values
Proper null handling with pandas nullable types
Type-safe operations

Polars Output (`output_type="polars"`)

Direct conversion with native nullable types:

Polars Schema
Column	Polars Type	Null Support	Notes
`mdrm`	`pl.Utf8`	No	String type
`rssd`	`pl.Utf8`	No	String type
`quarter`	`pl.Utf8`/`pl.Date`	No	Based on date_output_format
`data_type`	`pl.Utf8`	No	String type
`int_data`	`pl.Int64`	Yes	Native nullable
`float_data`	`pl.Float64`	Yes	Native nullable
`bool_data`	`pl.Boolean`	Yes	Native nullable
`str_data`	`pl.Utf8`	Yes	Native nullable

Integer Display Examples

Problem Scenario (Before Fix)

# Original issue: integers showing as floats
df['int_data']  # Shows: 1000.0, 2000.0, 3000.0

Current Behavior - SOAP Path

# Input: XBRL with USD value "1500000"
# Processing: 1500000 // 1000 = 1500 (integer division)
# Storage: np.int64(1500) with np.nan for nulls
# DataFrame: Int64 dtype
# Display: 1500 (no .0 suffix)

Current Behavior - REST Path

# Input: JSON with integer 1500000
# Processing: 1500000 // 1000 = 1500
# Storage: np.int64(1500) with pd.NA for nulls
# Conversion: pd.NA → None → Int64
# Display: 1500 (no .0 suffix)

Type Conversion Decision Tree

Input Data
    ├── SOAP API (WebserviceCredentials)
    │   ├── XBRL Processing
    │   │   ├── Detect Type (USD/PURE/etc.)
    │   │   └── Apply np.nan for nulls
    │   └── Output Format
    │       ├── list → Raw dicts with np.nan
    │       ├── pandas → DataFrame with Int64/float64/boolean
    │       └── polars → DataFrame with native nullable types
    │
    └── REST API (OAuth2Credentials)
        ├── XBRL Processing
        │   ├── Detect Type
        │   └── Apply pd.NA for nulls
        └── Output Format
            ├── list → Raw dicts with pd.NA
            ├── pandas → Convert pd.NA → None/np.nan → nullable dtypes
            └── polars → Convert pd.NA → None → native nulls

Date Format Handling

The date_output_format parameter affects the quarter column:

Date Format Options
Format	Example Output	Python Type	Pandas dtype	Polars Type
`"string_original"`	`"12/31/2023"`	`str`	`object`	`pl.Utf8`
`"string_yyyymmdd"`	`"20231231"`	`str`	`object`	`pl.Utf8`
`"python_format"`	`datetime(2023,12,31)`	`datetime`	`datetime64[ns]`	`pl.Date`

Common Patterns and Best Practices

Working with Integer Data

# Recommended: Use nullable integer operations
df['int_data'].sum()      # Handles NA/null correctly
df['int_data'].fillna(0)  # Replace nulls with 0

# Avoid: Converting to standard float
float(df['int_data'])  # May raise error with NA values

Type Checking

# Check for integer rows
int_rows = df[df['data_type'] == 'int']

# Access specific typed columns
integers = df['int_data'].dropna()
floats = df['float_data'].dropna()

Null Handling

# Pandas: Use pd.isna() for universal null checking
pd.isna(df['int_data'])  # Works with both NaN and NA

# Polars: Use native is_null()
df_polars['int_data'].is_null()

Type Preservation During Operations

# Maintains Int64 type
df['int_data'] * 1000   # Result is still Int64

# Converts to float64
df['int_data'] / 1000   # Result becomes float64
df['int_data'] // 1000  # Use // to maintain integer

Migration Guide

From Existing SOAP Integration

No changes required. The library maintains 100% backward compatibility:

# Existing code continues to work unchanged
df = collect_data(
    session=connection.session,
    creds=soap_credentials,
    reporting_period="2023-12-31",
    rssd_id="480228",
    output_type="pandas"
)
# Returns DataFrame with same types as before

Adopting REST API

To leverage enhanced REST features:

# Use OAuth2Credentials for REST
creds = OAuth2Credentials(username="user", token="token")

df = collect_data(
    session=None,  # Not needed for REST
    creds=creds,
    reporting_period="2023-12-31",
    rssd_id="480228",
    output_type="pandas"
)
# Returns DataFrame with enhanced null handling

Overriding Null Value Handling

The `force_null_types` Parameter

While the library automatically selects the appropriate null handling based on your API choice (SOAP vs REST), you can override this behavior using the force_null_types parameter available in collect_data() and collect_ubpr_facsimile_data() functions.

When to Use This Parameter:

Testing and Migration: Test how your code would behave with different null handling before switching APIs
Compatibility Issues: Work around specific compatibility requirements in your data pipeline
Performance Comparison: Compare the behavior of both null handling approaches
Gradual Migration: SOAP users can preview REST-style null handling without changing credentials

Parameter Options

force_null_types Options
Value	Behavior	Use Case
`None` (default)	Automatic based on API type	Normal operation - recommended
`"numpy"`	Force `np.nan` for all null values	Legacy compatibility, SOAP-like behavior
`"pandas"`	Force `pd.NA` for null values	Better integer display, REST-like behavior

Usage Examples

Example 1: SOAP User Testing Pandas Null Handling

# SOAP credentials normally use np.nan
soap_creds = WebserviceCredentials(username="user", password="pass")

# Test with pandas null handling without switching to REST
df = collect_data(
    session=connection.session,
    creds=soap_creds,
    reporting_period="2023-12-31",
    rssd_id="480228",
    output_type="pandas",
    force_null_types="pandas"  # Override to use pd.NA
)
# Now integers display without .0 suffix

Example 2: REST User Requiring NumPy Compatibility

# REST credentials normally use pd.NA
rest_creds = OAuth2Credentials(username="user", token="token")

# Force numpy nulls for compatibility with legacy analysis code
df = collect_data(
    session=None,
    creds=rest_creds,
    reporting_period="2023-12-31",
    rssd_id="480228",
    output_type="pandas",
    force_null_types="numpy"  # Override to use np.nan
)
# Now compatible with code expecting np.nan

Example 3: Comparing Both Approaches

# Compare integer handling with different null types
for null_type in [None, "numpy", "pandas"]:
    df = collect_data(
        session=connection.session,
        creds=credentials,
        reporting_period="2023-12-31",
        rssd_id="480228",
        output_type="pandas",
        force_null_types=null_type
    )
    print(f"Null type: {null_type or 'automatic'}")
    print(f"Integer sample: {df['int_data'].iloc[0]}")
    print(f"Has .0 suffix: {'.0' in str(df['int_data'].iloc[0])}\n")

Implementation Notes

Performance: No significant performance difference between null types
Memory Usage: pd.NA uses slightly more memory but provides better type safety
Compatibility: np.nan is more widely compatible with older pandas versions
Future-Proof: pd.NA is the recommended approach for new pandas code

Best Practices

Use defaults when possible: Let the library choose based on your API
Document overrides: If you override, comment why in your code
Test thoroughly: When overriding, test all data operations
Consider migration: If consistently overriding, consider switching APIs

Warning

Overriding null types may cause unexpected behavior if your code assumes specific null handling. Test thoroughly when using force_null_types.

Troubleshooting

NAType Error

Symptom: float() argument must be a string or a real number, not 'NAType'

Cause: Attempting float conversion on pandas NA value

Solution: Use pd.isna() for null checking or .fillna() before conversion

Integers Display with .0

Symptom: Integer values show as 1000.0 instead of 1000

Cause: Mixed with float or using regular division

Solution: Ensure using Int64 dtype and integer division (//)

Type Loss in Operations

Symptom: Int64 column becomes float64 after operation

Cause: Operation that produces non-integer results

Solution: Use integer-preserving operations or explicitly cast back

Performance Considerations

Performance Comparison
Operation	SOAP	REST	Notes
Null checking	Fast (`np.isnan`)	Fast (`pd.isna`)	Both optimized
DataFrame creation	Standard	Slightly slower	REST has extra conversion
Memory usage	Standard	~Same	Nullable types similar
Integer operations	Fast	Fast	Int64 optimized

API Reference

Type-Related Parameters

Parameters
Parameter	Options	Default	Description
`output_type`	`"list"`, `"pandas"`, `"polars"`	`"list"`	Output format
`date_output_format`	`"string_original"`, `"string_yyyymmdd"`, `"python_format"`	`"string_original"`	Date format in output

Internal Type Handling

Internal Functions
Function	Purpose	SOAP Behavior	REST Behavior
`_process_xml()`	Parse XBRL	Uses `np.nan`	Uses `pd.NA` with `use_rest_nulls=True`
`_process_xbrl_item()`	Process single item	Returns typed value	Returns typed value
DataFrame conversion	Create pandas DF	Direct with `np.nan`	Converts `pd.NA` → appropriate nulls

Version History

v2.0.0: Added REST API support with enhanced null handling
v1.x.x: Original SOAP-only implementation with np.nan