Project OverviewΒΆ
This notebook demonstrates the development of a machine learning model to predict real estate sale price intervals in King County, Washington. Unlike traditional point-estimate models, the focus is on generating prediction intervals that quantify uncertainty in house price predictions.
Project GoalsΒΆ
The primary objectives of this project are:
- Develop a model that predicts price ranges rather than single-point estimates
- Quantify uncertainty in real estate valuations using prediction intervals
- Identify key factors influencing property values in King County
- Create geospatial visualizations to understand prediction accuracy across different areas
- Demonstrate the effectiveness of quantile regression forests for real estate valuation
MethodologyΒΆ
The approach applied herein combines several techniques:
Quantile Regression Forests: Instead of predicting a single price point, quantile regression is used to estimate the 5th and 95th percentiles, creating a 90% prediction interval for each property.
Comprehensive Feature Engineering: Raw property data is transformed into meaningful features across multiple domains:
- Spatial features based on geographic coordinates
- Temporal features capturing market trends and seasonality
- Property characteristic features including size, quality, and amenities
- Market context features reflecting neighborhood dynamics
Geospatial Analysis: Predictions are visualized on maps to identify spatial patterns in prediction accuracy and housing values.
Column Name | Description |
---|---|
id | record identifier |
sale_date | Close-of-escrow date for the recorded sale (YYYY-MM-DD). |
sale_price | Final purchase price for the transaction, expressed in US dollars. |
sale_nbr | Coded reason for the sale (e.g., 1 = full-value arms-length transfer, 4 = partial interest, 7 = quit-claim, etc.). |
sale_warning | Quality flag raised by the assessor when the price appears non-market (auction, related-party transfer, eminent domain, etc.). Blank = no concern. |
join_status | How the record joined to the master parcel table at the time the dataset was assembled (new , nochg , rebuilt - before , etc.). |
join_year | Calendar year of the most recent successful join to the master table. |
latitude | Geographic centroid latitude of the parcel (WGS 84). |
longitude | Geographic centroid longitude of the parcel (WGS 84). |
area | Assessorial "area number" used by King County for grouping neighbourhoods that share similar market characteristics. |
city | Incorporated city (or "KING COUNTY" for the unincorporated area) in which the parcel lies. |
zoning | Current primary zoning designation on the parcel (e.g., R-8 , LR3 (M) , RA5 ). |
subdivision | Plat, condominium, or short-plat name recorded with the county auditor. |
present_use | Numeric land-use code describing how the property is currently used (e.g., 1 = single-family, 2 = multi-family 2β4 units, 29 = town-house). |
land_val | Most recent assessor land valuation (USD). |
imp_val | Most recent assessor improvement (buildings & fixtures) valuation (USD). |
year_built | Year the principal structure was originally constructed (0 when missing). |
year_reno | Year of the last permitted remodel/addition (if any) |
sqft_lot | Lot size in square feet taken from the recorded legal description or survey. |
sqft | Total finished living-area square footage (above- and below-grade). |
sqft_1 | Finished square footage above grade on the 1st floor. |
sqft_fbsmt | Finished square footage of the basement (0 when no finished basement). |
grade | Assessor construction-quality "grade" (1 = low, 13 = mansion-quality; most houses are 6β9). |
fbsmt_grade | Construction quality of the finished basement area (same 1β13 scale; 0 when no finished basement). |
condition | Assessor physical condition code (1 = poor, 9 = excellent; 0 = not rated). |
stories | Number of full stories counted above grade (split-levels show as 1). |
beds | Legal bedroom count. |
bath_full | Number of full bathrooms (sink + toilet + tub/shower). |
bath_3qtr | Number of ΒΎ bathrooms (sink + toilet + shower only). |
bath_half | Number of half bathrooms (sink + toilet only). |
garb_sqft | Finished area in a basement garage or boat-storage bay (square feet). |
gara_sqft | Finished area in an attached or detached garage (square feet). |
wfnt | Waterfront indicator (0 = no waterfront access; 1-9 levels of waterfront proximity/quality). |
golf | 1 = parcel's primary outlook is a golf course; 0 = no golf-course view. |
greenbelt | 1 = parcel abuts a protected greenbelt, park or open space; 0 = no greenbelt adjacency. |
noise_traffic | Noise level assessment from traffic, airports, and rail sources (0-3 scale); 0 = typical noise, 3 = High noise exposure |
view_rainier | View quality of Mt. Rainier (0 = no view, 1-4 = increasing view quality) |
view_olympics | View quality of the Olympic Mountains (0-4 scale) |
view_cascades | View quality of the Cascade Mountains (0-4 scale) |
view_territorial | Quality of broad territorial (land) view (0-4 scale) |
view_skyline | View quality of city skyline (Seattle/Bellevue) (0-4 scale) |
view_sound | View quality of Puget Sound (0-4 scale) |
view_lakewash | View quality of Lake Washington (0-4 scale) |
view_lakesamm | View quality of Lake Sammamish (0-4 scale) |
view_otherwater | View quality of other water bodies (0-4 scale) |
view_other | Quality of other premium views (0-4 scale) |
submarket | Letter code grouping neighbourhoods into broader sub-markets used by local appraisers (A = prime lake-front, B = in-city view areas, β¦, N = rural east county, etc.). |
Data ChallengesΒΆ
The King County housing dataset presents several challenges that require careful handling:
- Outliers: Extreme property values and unusual characteristics can skew model training.
- Missing Values: Several features contain missing data that must be imputed appropriately.
- Temporal Dynamics: Real estate markets change over time, requiring features that capture temporal trends.
- Geographic Complexity: King County has diverse neighborhoods with different value drivers.
- Feature Interactions: Many property characteristics interact in complex ways to influence price.
- Data Quality Issues: Some records contain inconsistent or potentially erroneous information.
Our preprocessing pipeline addresses these challenges through outlier detection, strategic imputation, and feature engineering.
1. Setup and ConfigurationΒΆ
In this section, we import the necessary libraries and setup the environment for data analysis and modeling. For this project the followin is used:
- Standard libraries: For basic data manipulation and utilities
- Data processing: Pandas and NumPy for data handling and numerical operations
- Visualization: Matplotlib and Seaborn for creating informative plots
- Machine learning: Scikit-learn for preprocessing and model evaluation
- Quantile regression: The quantile_forest package for prediction intervals
- Specialized libraries: For geospatial analysis, inflation adjustment, and temporal features
# Standard library imports
import re
import warnings
import pprint
from tabulate import tabulate
# Data processing and numerical computing
import pandas as pd
import numpy as np
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Machine learning
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans
from sklearn.metrics import make_scorer
from quantile_forest import RandomForestQuantileRegressor
# Configuration
warnings.filterwarnings('ignore', category=UserWarning)
# Specialized libraries
import cpi
cpi.update() # ensure cpi is up-to-date and able to accurately inflate to today's dollars
from datetime import date
from geopy.distance import geodesic
import osmnx as ox
# Set random seeds for reproducibility
np.random.seed(42)
Utility FunctionsΒΆ
We will also define a few utility functions upfront that will be used throughout the notebook for data analysis, visualization, and preprocessing; ensuring a consistent approach for tasks performed more than once.
DataFrame Display Management: Controls how many rows and columns are displayed for better readability.
DataFrame Analysis: Provides comprehensive analysis of data types, missing values, and potential issues in our dataset.
Column Analysis: Generates detailed statistics and value distributions for specified columns, helping us understand data patterns.
Outlier Detection: Implements IQR-based methods to identify and handle outliers in price and square footage.
Correlation Analysis: Calculates and formats correlation tables between features and target variables.
Feature Engineering Functions: Transform raw data into meaningful features across spatial, temporal, and property domains.
### Manage restrictions on how many columns and rows display
def manage_df_display(rows=60, columns=20):
pd.set_option('display.max_rows', rows)
pd.set_option('display.max_columns', columns)
pd.set_option('display.width', None)
### Analyze Dataframe
def analyze_dataframe(df):
"""Comprehensive analysis of DataFrame for preprocessing issues."""
print("=" * 50)
print("DATAFRAME ANALYSIS")
print("=" * 50)
print(f"Shape: {df.shape}")
print(f"Data types:\n{df.dtypes.value_counts()}")
# Analyze numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
print(f"\n--- NUMERIC COLUMNS ({len(numeric_cols)}) ---")
for col in numeric_cols:
series = df[col]
inf_count = np.isinf(series).sum()
nan_count = series.isnull().sum()
large_count = (np.abs(series) > 1e10).sum()
if inf_count > 0 or nan_count > 0 or large_count > 0:
print(f"{col}:")
print(f" - Infinite values: {inf_count}")
print(f" - NaN values: {nan_count}")
print(f" - Extremely large values: {large_count}")
# Analyze non-numeric columns
non_numeric_cols = df.select_dtypes(exclude=[np.number]).columns
if len(non_numeric_cols) > 0:
print(f"\n--- NON-NUMERIC COLUMNS ({len(non_numeric_cols)}) ---")
for col in non_numeric_cols:
series = df[col]
print(f"{col}: {series.dtype}, {series.nunique()} unique values, {series.isnull().sum()} ({series.isnull().sum() / len(series)}) missing")
def column_analysis(df, columns=None, max_categories=10):
"""
Comprehensive analysis of specified columns with value counts and statistics
Parameters:
df: pandas DataFrame
columns: list of column names to analyze (optional)
If None, analyzes all columns in the DataFrame
max_categories: maximum number of categories to display per column
"""
import pandas as pd # Added import for numeric type checking
# Determine which columns to analyze
if columns is None:
columns_to_analyze = df.columns.tolist()
print("=" * 90)
print(" DETAILED COLUMN ANALYSIS - ALL COLUMNS")
print("=" * 90)
else:
# Validate that specified columns exist in the DataFrame
missing_columns = [col for col in columns if col not in df.columns]
if missing_columns:
print(f" Warning: The following columns were not found in the DataFrame: {missing_columns}")
columns_to_analyze = [col for col in columns if col in df.columns]
if not columns_to_analyze:
print("! Error: None of the specified columns exist in the DataFrame ! ")
return
print("=" * 90)
print(f" DETAILED COLUMN ANALYSIS - SELECTED COLUMNS ({len(columns_to_analyze)})")
print("=" * 90)
print(f"Analyzing columns: {', '.join(columns_to_analyze)}")
for column in columns_to_analyze:
print(f"\n COLUMN: {column}")
print("=" * 70)
# Basic info
col_data = df[column]
data_type = col_data.dtype
total_rows = len(df)
missing_count = col_data.isnull().sum()
non_missing_count = total_rows - missing_count
print(f"Data Type: {data_type}")
print(f"Total Rows: {total_rows:,} | Non-Missing: {non_missing_count:,} | Missing: {missing_count:,}")
if non_missing_count == 0:
print(" ! No data to analyze (all values are missing) ! ")
continue
# Value counts analysis
print(f"\n VALUE DISTRIBUTION:")
print("-" * 50)
value_counts = col_data.value_counts().head(max_categories)
# Prepare detailed table
table_data = []
cumulative_count = 0
for rank, (value, count) in enumerate(value_counts.items(), 1):
cumulative_count += count
percentage = (count / total_rows * 100)
cumulative_percentage = (cumulative_count / total_rows * 100)
# Format value for display
display_value = str(value)
if len(display_value) > 30:
display_value = display_value[:27] + "..."
table_data.append([
f"#{rank}",
display_value,
f"{count:,}",
f"{percentage:.2f}%",
f"{cumulative_percentage:.2f}%"
])
headers = ["Rank", "Value", "Count", "Percentage", "Cumulative %"]
print(tabulate(table_data, headers=headers, tablefmt="fancy_grid"))
# Additional statistics
print(f"\n STATISTICS:")
print("-" * 30)
unique_count = col_data.nunique()
print(f"β’ Unique values: {unique_count:,}")
print(f"β’ Missing values: {missing_count:,} ({missing_count/total_rows*100:.2f}%)")
if unique_count > 0:
print(f"β’ Most frequent: '{value_counts.index[0]}' ({value_counts.iloc[0]:,} times)")
print(f"β’ Least frequent: '{value_counts.index[-1]}' ({value_counts.iloc[-1]:,} times)")
# NEW: Numeric statistics for numeric columns
if pd.api.types.is_numeric_dtype(col_data):
non_missing_data = col_data.dropna()
if len(non_missing_data) > 0:
min_value = non_missing_data.min()
max_value = non_missing_data.max()
print(f"β’ Minimum value: {min_value}")
print(f"β’ Maximum value: {max_value}")
# Data quality insights
if unique_count == non_missing_count:
print("β’ All values are unique (potential ID column)")
elif unique_count == 1:
print("β’ All values are the same (constant column)")
elif unique_count <= 10:
print("β’ Low cardinality (good for categorical analysis)")
elif unique_count > non_missing_count * 0.8:
print("β’ High cardinality (many unique values)")
### Apply IQR-based outlier detection for price and square footage
def calculate_outlier_table(df, columns):
"""Return a DataFrame summarizing outlier percentages for specified columns.
Args:
df (pd.DataFrame): The DataFrame containing the data.
columns (list of str): List of column names to check for outliers.
Returns:
pd.DataFrame: Prettified table with outlier percentages for each column.
"""
results = []
for col in columns:
if col in df.columns:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
total = len(df)
left_outliers = df[df[col] < lower_bound]
right_outliers = df[df[col] > upper_bound]
left_pct = (len(left_outliers) / total) * 100 if total > 0 else 0
right_pct = (len(right_outliers) / total) * 100 if total > 0 else 0
total_pct = left_pct + right_pct
results.append({
'Column': col,
'dtype': df[col].dtype,
'min': df[col].min(),
'max': df[col].max(),
'Total Outliers (%)': round(total_pct, 2),
'Lower Outliers (%)': round(left_pct, 2),
'Upper Outliers (%)': round(right_pct, 2)
})
else:
results.append({
'Column': col,
'dtype': df[col].dtype,
'min': df[col].min(),
'max': df[col].max(),
'Total Outliers (%)': None,
'Lower Outliers (%)': None,
'Upper Outliers (%)': None
})
# Create DataFrame from results
result_df = pd.DataFrame(results).sort_values(by=['Total Outliers (%)'], ascending=False)
# Apply styling to the DataFrame
styled_df = result_df.style.background_gradient(
cmap='YlOrRd',
subset=['Total Outliers (%)']
).format({
'Total Outliers (%)': '{:.2f}%',
'Lower Outliers (%)': '{:.2f}%',
'Upper Outliers (%)': '{:.2f}%',
'max': '{:.2f}',
'min': '{:.2f}',
}).set_properties(**{
'text-align': 'center',
'border': '1px solid gray',
'padding': '5px'
}).set_table_styles([
{'selector': 'th', 'props': [('background-color', '#f2f2f2'),
('color', 'black'),
('font-weight', 'bold'),
('text-align', 'center'),
('border', '1px solid gray'),
('padding', '5px')]},
{'selector': 'caption', 'props': [('caption-side', 'top'),
('font-size', '1.2em'),
('font-weight', 'bold')]}
]).set_caption('Outlier Analysis')
return styled_df
def cap_outliers_by_percentile(df: pd.DataFrame,
columns: list,
lower_percentile: float = 0.05,
upper_percentile: float = 0.95,
inplace: bool = False) -> pd.DataFrame:
"""
Caps extreme values at specified percentiles for given columns.
Parameters:
df (pd.DataFrame): Input DataFrame
columns (list): List of column names to cap
lower_percentile (float): Lower percentile threshold (default: 0.05 for 5th percentile)
upper_percentile (float): Upper percentile threshold (default: 0.95 for 95th percentile)
inplace (bool): Whether to modify the original DataFrame
Returns:
pd.DataFrame: DataFrame with capped values (if inplace=False)
"""
# Work with original DataFrame if inplace=True, otherwise create a copy
df_result = df if inplace else df.copy()
# Track capping statistics
capping_stats = {}
# Process each specified column
for col in columns:
if col not in df.columns:
print(f"Warning: Column '{col}' not found in DataFrame")
continue
# Skip non-numeric columns
if not pd.api.types.is_numeric_dtype(df[col]):
print(f"Warning: Column '{col}' is not numeric, skipping...")
continue
# Calculate percentile bounds
lower_bound = df[col].quantile(lower_percentile)
upper_bound = df[col].quantile(upper_percentile)
# Count values that will be capped
lower_capped = (df[col] < lower_bound).sum()
upper_capped = (df[col] > upper_bound).sum()
# Apply capping
df_result[col] = np.clip(df[col], lower_bound, upper_bound)
# Store statistics
capping_stats[col] = {
'lower_bound': lower_bound,
'upper_bound': upper_bound,
'lower_capped_count': lower_capped,
'upper_capped_count': upper_capped,
'total_capped': lower_capped + upper_capped
}
# Print summary
print("Capping Summary:")
for col, stats in capping_stats.items():
print(f" {col}: {stats['total_capped']} values capped "
f"(Lower: {stats['lower_capped_count']}, Upper: {stats['upper_capped_count']})")
print(f" Bounds: [{stats['lower_bound']:.4f}, {stats['upper_bound']:.4f}]")
if not inplace:
return df_result
def calculate_correlation_table(df, feature_columns, target_column, correlation_type='pearson'):
"""Return a DataFrame summarizing correlations between features and target variable.
Args:
df (pd.DataFrame): The DataFrame containing the data.
feature_columns (list of str): List of feature column names to correlate with target.
target_column (str): Name of the target variable column.
correlation_type (str): Type of correlation to calculate.
Options: 'pearson', 'spearman', 'kendall'. Default: 'pearson'.
Returns:
pd.DataFrame: Prettified table with correlation coefficients for each feature.
"""
import pandas as pd
import numpy as np
from scipy import stats
# Validate correlation type
valid_types = ['pearson', 'spearman', 'kendall']
if correlation_type not in valid_types:
raise ValueError(f"correlation_type must be one of {valid_types}")
# Check if target column exists
if target_column not in df.columns:
raise ValueError(f"Target column '{target_column}' not found in DataFrame")
results = []
for col in feature_columns:
if col in df.columns:
# Get non-null values for both columns
valid_data = df[[col, target_column]].dropna()
if len(valid_data) < 2:
correlation = np.nan
p_value = np.nan
sample_size = len(valid_data)
else:
try:
if correlation_type == 'pearson':
corr_result = stats.pearsonr(valid_data[col], valid_data[target_column])
elif correlation_type == 'spearman':
corr_result = stats.spearmanr(valid_data[col], valid_data[target_column])
elif correlation_type == 'kendall':
corr_result = stats.kendalltau(valid_data[col], valid_data[target_column])
# Extract scalar values from the result
correlation = float(corr_result[0]) if hasattr(corr_result[0], '__len__') and len(corr_result[0]) == 1 else corr_result[0]
p_value = float(corr_result[1]) if hasattr(corr_result[1], '__len__') and len(corr_result[1]) == 1 else corr_result[1]
# Ensure they are scalars
if hasattr(correlation, '__len__'):
correlation = correlation[0] if len(correlation) > 0 else np.nan
if hasattr(p_value, '__len__'):
p_value = p_value[0] if len(p_value) > 0 else np.nan
sample_size = len(valid_data)
except Exception as e:
correlation = np.nan
p_value = np.nan
sample_size = len(valid_data)
# Determine significance level
if pd.isna(p_value) or p_value is None:
significance = 'N/A'
elif p_value < 0.001:
significance = '***'
elif p_value < 0.01:
significance = '**'
elif p_value < 0.05:
significance = '*'
else:
significance = ''
results.append({
'Feature': col,
'dtype': str(df[col].dtype),
'Correlation': correlation,
'P-Value': p_value,
'Significance': significance,
'Sample Size': sample_size,
'Abs Correlation': abs(correlation) if not pd.isna(correlation) else np.nan
})
else:
results.append({
'Feature': col,
'dtype': 'N/A',
'Correlation': np.nan,
'P-Value': np.nan,
'Significance': 'Column Missing',
'Sample Size': 0,
'Abs Correlation': np.nan
})
# Create DataFrame from results
result_df = pd.DataFrame(results).sort_values(by=['Abs Correlation'], ascending=False, na_position='last')
# Drop the helper column
result_df = result_df.drop('Abs Correlation', axis=1)
# Apply styling to the DataFrame
styled_df = result_df.style.background_gradient(
cmap='RdYlBu_r',
subset=['Correlation'],
vmin=-1,
vmax=1
).format({
'Correlation': '{:.4f}',
'P-Value': '{:.4f}',
'Sample Size': '{:,}'
}).set_properties(**{
'text-align': 'center',
'border': '1px solid gray',
'padding': '5px'
}).set_table_styles([
{'selector': 'th', 'props': [('background-color', '#f2f2f2'),
('color', 'black'),
('font-weight', 'bold'),
('text-align', 'center'),
('border', '1px solid gray'),
('padding', '5px')]},
{'selector': 'caption', 'props': [('caption-side', 'top'),
('font-size', '1.2em'),
('font-weight', 'bold')]}
]).set_caption(f'Correlation Analysis ({correlation_type.title()})')
return styled_df
def categorize_zoning(zoning_code):
"""
Categorizes King County zoning codes into consolidated categories.
Parameters:
-----------
zoning_code : str
The zoning designation to categorize
Returns:
--------
str
One of six categories: 'Residential Zones', 'Neighborhood Residential',
'Low-Rise Residential', 'Special Use Zones', 'Mixed Use and Commercial',
'Industrial and Other'
"""
# Handle missing or null values
if pd.isna(zoning_code) or zoning_code == '' or zoning_code == 'Unknown':
return 'Industrial and Other'
# Convert to string and strip whitespace
zone = str(zoning_code).strip()
# Neighborhood Residential (check first due to specificity)
if re.match(r'^NR\d*$', zone): # NR, NR1, NR2, NR3
return 'Neighborhood Residential'
# Low-Rise Residential
if re.match(r'^LR\d*\s*(\([^)]*\))?', zone): # LR1 (M), LR2 (M), LR3 (M), LR3 RC (M)
return 'Low-Rise Residential'
# Mixed Use and Commercial
mixed_use_patterns = [
r'^MU$', # MU
r'^MR\s*(\([^)]*\))?', # MR (M1)
r'^MML\s*U/\d+', # MML U/85
r'^MUR-\d+', # MUR-45, MUR-70
r'^NC\d*P?-\d+\s*(\([^)]*\))?', # NC1P-55 (M), NC2P-55 (M), NC2-40
r'^C\d+-\d+\s*(\([^)]*\))?' # C1-55 (M)
]
if any(re.match(pattern, zone) for pattern in mixed_use_patterns):
return 'Mixed Use and Commercial'
# Special Use Zones
special_use_patterns = [
r'^SF\s*\d*\s*(-\w+)?$', # SF 5000, SF 7200, SF-S, SF-SL
r'^SR-[\d.]+$', # SR-4.5, SR-6
r'^UVSF-\d+$', # UVSF-1
r'^SFR\s*\d*$' # SFR with numbers
]
if any(re.match(pattern, zone) for pattern in special_use_patterns):
return 'Special Use Zones'
# Residential Zones (basic residential patterns)
residential_patterns = [
r'^R-\d+$', # R-1, R-4, R-6, R-8
r'^R\d+$', # R4, R6, R8 (without dash)
r'^RS-?\d+$', # RS-7200, RS9600
r'^RSA\s*\d+$', # RSA 4, RSA 6
r'^RSX\s*[\d.]+$', # RSX 7.2
r'^RSL\s*(\([^)]*\))?$', # RSL (M)
r'^RA[\d.]+[A-Z]*$', # RA2.5, RA5, RA5P, RA5SO
r'^R\s+\d+[a-z]?$' # R 5400d
]
if any(re.match(pattern, zone) for pattern in residential_patterns):
return 'Residential Zones'
# Industrial and Other (catch-all for remaining categories)
industrial_other_patterns = [
r'^[LH]DR$', # LDR, HDR
r'^L-\d+$', # L-1, L-3
r'^UL-\d+$', # UL-7200
r'^TC(\s*A\d+)?$', # TC, TC A3
r'^O$', # O (Office)
r'^PUD$', # PUD
r'^UR$', # UR
r'^RM\d*(-\d+)?$', # RM1800, RM-48
r'^A\d+$', # A10, A35
r'^NMF$' # NMF
]
if any(re.match(pattern, zone) for pattern in industrial_other_patterns):
return 'Industrial and Other'
# Default fallback for unrecognized patterns
return 'Industrial and Other'
import pandas as pd
def extract_sale_warning_codes(df):
"""
Extract sale warning codes into binary features
"""
# Clean and split the sale_warning column
df['sale_warning_clean'] = df['sale_warning'].str.strip()
df['sale_warning_list'] = df['sale_warning_clean'].str.split()
# Create binary columns for each possible code (1-62)
for code in range(1, 63):
code_str = str(code)
df[f'sale_warning_{code}'] = df['sale_warning_list'].apply(
lambda x: int(code_str in x) if isinstance(x, list) else 0
)
# Clean up intermediate columns
df.drop(columns=['sale_warning_clean', 'sale_warning_list'], inplace=True)
return df
2. Data Loading and Initial ExplorationΒΆ
Next we will load in the King County housing dataset from CSV files abd perform initial data exploration to understand the structure and content. We'll examine basic statistics, data types, and distributions.
train_data = pd.read_csv('dataset.csv')
test_data = pd.read_csv('test.csv')
manage_df_display(columns=None)
train_data.head()
id | sale_date | sale_price | sale_nbr | sale_warning | join_status | join_year | latitude | longitude | area | city | zoning | subdivision | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | submarket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2014-11-15 | 236000 | 2.0 | nochg | 2025 | 47.2917 | -122.3658 | 53 | FEDERAL WAY | RS7.2 | ALDERWOOD SOUTH DIV NO. 02 | 2 | 167000 | 372000 | 1975 | 0 | 10919 | 1560 | 1560 | 0 | 7 | 0 | 5 | 1.0 | 3 | 1 | 1 | 0 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | I | |
1 | 1 | 1999-01-15 | 313300 | NaN | 26 | nochg | 2025 | 47.6531 | -122.1996 | 74 | KIRKLAND | RS 8.5 | WILDWOOD LANE NO. 03 | 2 | 1184000 | 598000 | 1962 | 0 | 8900 | 2040 | 1220 | 820 | 7 | 7 | 4 | 1.0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | Q |
2 | 2 | 2006-08-15 | 341000 | 1.0 | nochg | 2025 | 47.4733 | -122.1901 | 30 | RENTON | R-8 | FALCON RIDGE (CEDAR RIDGE) | 2 | 230000 | 356000 | 1986 | 0 | 4953 | 1640 | 820 | 0 | 7 | 0 | 3 | 2.0 | 3 | 2 | 0 | 1 | 0 | 480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | |
3 | 3 | 1999-12-15 | 267000 | 1.0 | nochg | 2025 | 47.4739 | -122.3295 | 96 | BURIEN | RS-7200 | OLYMPIC VUE ESTATES | 2 | 190000 | 518000 | 1998 | 0 | 6799 | 2610 | 1010 | 500 | 8 | 7 | 3 | 2.0 | 4 | 2 | 0 | 1 | 0 | 530 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | |
4 | 4 | 2018-07-15 | 1650000 | 2.0 | miss99 | 2025 | 47.7516 | -122.1222 | 36 | KING COUNTY | RA2.5 | HOLLYWOOD HILL HIGHLANDS | 2 | 616000 | 1917000 | 1998 | 0 | 31687 | 4040 | 3640 | 0 | 12 | 0 | 3 | 2.0 | 4 | 2 | 1 | 1 | 0 | 810 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | P |
train_data.describe()
id | sale_price | sale_nbr | join_year | latitude | longitude | area | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 200000.000000 | 2.000000e+05 | 157818.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 2.000000e+05 | 2.000000e+05 | 200000.000000 | 200000.000000 | 2.000000e+05 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.00000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 |
mean | 99999.500000 | 5.841495e+05 | 2.162599 | 2023.403600 | 47.549248 | -122.210416 | 48.644215 | 4.108860 | 4.601691e+05 | 4.917715e+05 | 1974.184760 | 59.468830 | 1.378310e+04 | 2120.679850 | 1251.284280 | 293.238535 | 7.667290 | 2.811045 | 3.515745 | 1.523778 | 3.419390 | 1.579735 | 0.494115 | 0.493020 | 80.32632 | 274.151470 | 0.078620 | 0.006220 | 0.033505 | 0.198130 | 0.017940 | 0.053985 | 0.058800 | 0.215550 | 0.018425 | 0.055565 | 0.050075 | 0.014090 | 0.020875 | 0.013455 |
std | 57735.171256 | 4.170595e+05 | 1.113090 | 6.241643 | 0.142710 | 0.140339 | 27.132002 | 7.199323 | 3.510444e+05 | 3.680505e+05 | 30.544426 | 339.334129 | 3.793152e+04 | 909.799433 | 468.094648 | 443.577947 | 1.153746 | 3.556495 | 0.704148 | 0.526367 | 0.897639 | 0.672685 | 0.638183 | 0.525635 | 180.13173 | 288.338763 | 0.757477 | 0.078622 | 0.179952 | 0.548412 | 0.218994 | 0.379119 | 0.381868 | 0.724224 | 0.222746 | 0.380011 | 0.353664 | 0.200154 | 0.248977 | 0.181147 |
min | 0.000000 | 5.029300e+04 | 1.000000 | 1999.000000 | 47.155200 | -122.527700 | 1.000000 | 2.000000 | 0.000000e+00 | 0.000000e+00 | 1900.000000 | 0.000000 | 3.750000e+02 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 49999.750000 | 3.050000e+05 | 1.000000 | 2025.000000 | 47.446500 | -122.323800 | 26.000000 | 2.000000 | 2.310000e+05 | 2.800000e+05 | 1953.000000 | 0.000000 | 5.000000e+03 | 1460.000000 | 950.000000 | 0.000000 | 7.000000 | 0.000000 | 3.000000 | 1.000000 | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 99999.500000 | 4.599500e+05 | 2.000000 | 2025.000000 | 47.562800 | -122.222700 | 48.000000 | 2.000000 | 3.770000e+05 | 4.090000e+05 | 1978.000000 | 0.000000 | 7.438000e+03 | 1970.000000 | 1200.000000 | 0.000000 | 7.000000 | 0.000000 | 3.000000 | 1.500000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.00000 | 240.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 149999.250000 | 7.249500e+05 | 3.000000 | 2025.000000 | 47.673500 | -122.121700 | 71.000000 | 2.000000 | 5.940000e+05 | 5.990000e+05 | 2001.000000 | 0.000000 | 1.022000e+04 | 2610.000000 | 1470.000000 | 570.000000 | 8.000000 | 7.000000 | 4.000000 | 2.000000 | 4.000000 | 2.000000 | 1.000000 | 1.000000 | 0.00000 | 480.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 199999.000000 | 2.999950e+06 | 11.000000 | 2025.000000 | 47.777800 | -121.161300 | 100.000000 | 29.000000 | 1.386400e+07 | 1.006700e+07 | 2025.000000 | 2024.000000 | 2.310573e+06 | 13540.000000 | 7760.000000 | 5480.000000 | 13.000000 | 13.000000 | 5.000000 | 4.500000 | 14.000000 | 9.000000 | 8.000000 | 12.000000 | 12740.00000 | 4404.000000 | 9.000000 | 1.000000 | 1.000000 | 3.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 |
analyze_dataframe(train_data)
================================================== DATAFRAME ANALYSIS ================================================== Shape: (200000, 47) Data types: int64 36 object 7 float64 4 Name: count, dtype: int64 --- NUMERIC COLUMNS (40) --- sale_nbr: - Infinite values: 0 - NaN values: 42182 - Extremely large values: 0 --- NON-NUMERIC COLUMNS (7) --- sale_date: object, 313 unique values, 0 (0.0) missing sale_warning: object, 142 unique values, 0 (0.0) missing join_status: object, 8 unique values, 0 (0.0) missing city: object, 41 unique values, 0 (0.0) missing zoning: object, 500 unique values, 0 (0.0) missing subdivision: object, 10376 unique values, 17550 (0.08775) missing submarket: object, 19 unique values, 1717 (0.008585) missing
# No true duplicates.
print(train_data.duplicated().sum())
0
'''While no unqiue identifier is in the dataset, we can try to identify properties that have been sold
more than once by searching and sorting through by the latitude and longitude of the parcel and property subdivision'''
train_data[train_data.duplicated(subset=['latitude', 'longitude', 'subdivision'],
keep=False)].sort_values(by=['latitude', 'longitude', 'subdivision']).head(10)
id | sale_date | sale_price | sale_nbr | sale_warning | join_status | join_year | latitude | longitude | area | city | zoning | subdivision | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | submarket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
62015 | 62015 | 2006-01-15 | 463000 | 1.0 | nochg | 2025 | 47.1712 | -121.9123 | 40 | KING COUNTY | F | NaN | 2 | 0 | 0 | 1997 | 0 | 35996 | 2370 | 1500 | 0 | 8 | 0 | 3 | 2.0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
75626 | 75626 | 2023-10-15 | 720000 | 3.0 | nochg | 2025 | 47.1712 | -121.9123 | 40 | KING COUNTY | F | NaN | 2 | 0 | 0 | 1997 | 0 | 35996 | 2370 | 1500 | 0 | 8 | 0 | 3 | 2.0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
12624 | 12624 | 2015-09-15 | 317000 | 1.0 | nochg | 2025 | 47.1767 | -122.0249 | 40 | KING COUNTY | A35 | GLACIER VISTA DIV NO. 03 | 2 | 135000 | 399000 | 1975 | 0 | 19465 | 1450 | 1450 | 0 | 7 | 0 | 5 | 1.0 | 3 | 1 | 1 | 0 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
59737 | 59737 | 2023-07-15 | 600000 | 2.0 | nochg | 2025 | 47.1767 | -122.0249 | 40 | KING COUNTY | A35 | GLACIER VISTA DIV NO. 03 | 2 | 135000 | 399000 | 1975 | 0 | 19465 | 1450 | 1450 | 0 | 7 | 0 | 5 | 1.0 | 3 | 1 | 1 | 0 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
62451 | 62451 | 2021-08-15 | 505000 | 2.0 | nochg | 2025 | 47.1772 | -122.0262 | 40 | KING COUNTY | A35 | GLACIER VISTA DIV NO. 03 | 2 | 147000 | 394000 | 1974 | 0 | 19465 | 1750 | 1750 | 0 | 7 | 0 | 4 | 1.0 | 3 | 1 | 1 | 0 | 0 | 510 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
169708 | 169708 | 2004-04-15 | 245000 | 1.0 | nochg | 2025 | 47.1772 | -122.0262 | 40 | KING COUNTY | A35 | GLACIER VISTA DIV NO. 03 | 2 | 147000 | 394000 | 1974 | 0 | 19465 | 1750 | 1750 | 0 | 7 | 0 | 4 | 1.0 | 3 | 1 | 1 | 0 | 0 | 510 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
81509 | 81509 | 2022-04-15 | 1510000 | 4.0 | new | 2025 | 47.1774 | -122.0112 | 40 | KING COUNTY | RA10 | OSCEOLA ADD | 2 | 206000 | 948000 | 2006 | 0 | 42148 | 6970 | 2040 | 1730 | 9 | 9 | 3 | 2.5 | 4 | 2 | 2 | 1 | 0 | 1140 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
140601 | 140601 | 2013-01-15 | 587500 | 3.0 | new | 2025 | 47.1774 | -122.0112 | 40 | KING COUNTY | RA10 | OSCEOLA ADD | 2 | 206000 | 948000 | 2006 | 0 | 42148 | 6970 | 2040 | 1730 | 9 | 9 | 3 | 2.5 | 4 | 2 | 2 | 1 | 0 | 1140 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
3338 | 3338 | 2024-11-15 | 1349000 | 4.0 | nochg | 2025 | 47.1794 | -121.9727 | 40 | KING COUNTY | A35 | NaN | 2 | 0 | 0 | 1989 | 0 | 272599 | 4500 | 2210 | 0 | 9 | 0 | 4 | 2.0 | 4 | 2 | 2 | 1 | 0 | 980 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M | |
43973 | 43973 | 2002-11-15 | 465600 | 1.0 | nochg | 2025 | 47.1794 | -121.9727 | 40 | KING COUNTY | A35 | NaN | 2 | 0 | 0 | 1989 | 0 | 272599 | 4500 | 2210 | 0 | 9 | 0 | 4 | 2.0 | 4 | 2 | 2 | 1 | 0 | 980 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | M |
'''While it is possible for renovations/modifications between sales, altering the phyical characteristics of a property,
most of the properties linked by 'latitude', 'longitude', 'subdivision' share several physical characteristics (notably year built) suggesting
multiple sales of the same property. Example: id 62015 and 75626 appear to be the same property based on latitude, longitude, year built,
sqft_lot, sqft, and other values.'''
print(train_data.duplicated(subset=['latitude', 'longitude', 'subdivision']).sum())
37403
3. Data Preprocessing and CleaningΒΆ
We continue exploring the data, breaking the data down into logical categories for easier analysis of the many features present. At this stage, we will also start prepping the data for modeling by:
- Handling missing values through appropriate imputation strategies
- Detecting and addressing outliers using IQR-based methods
- Converting data types to appropriate formats (dates, categories, etc.)
- Creating masks for slicing data into meaningful segments
- Extracting information from text fields
- Encoding categorical variables for machine learning compatibility
### Create masks for slicing data into smaller groups
sale_data = ['sale_date', 'sale_price', 'sale_nbr', 'sale_warning']
admin_data = ['join_year', 'join_status']
geo_data = ['latitude', 'longitude', 'area', 'city', 'submarket']
legal_data = ['zoning', 'subdivision', 'present_use']
assessor_data = ['land_val', 'imp_val', 'grade', 'fbsmt_grade', 'condition']
property_data = ['year_built', 'year_reno', 'sqft', 'sqft_lot', 'sqft_fbsmt', 'sqft_1', 'stories', 'beds',
'bath_full', 'bath_3qtr', 'bath_half', 'garb_sqft', 'gara_sqft', 'wfnt', 'golf', 'greenbelt',
'noise_traffic', 'view_rainier', 'view_olympics', 'view_cascades', 'view_territorial', 'view_skyline',
'view_sound', 'view_lakewash', 'view_lakesamm', 'view_otherwater', 'view_other']
### Confirm all features minus 'id' accounted for
len(sale_data + admin_data + geo_data + legal_data + assessor_data + property_data) + 1 == len(train_data.columns)
True
Admin DataΒΆ
The join_status field indicates how each property record was matched to the master parcel database during data assembly:
new - Property record was newly added to the database
nochg - No change; record remained the same from previous data assembly
rebuilt - before - Property was rebuilt, and this represents the pre-reconstruction record
rebuilt - after - Property was rebuilt, and this represents the post-reconstruction record
reno - before - Property underwent renovation, representing the pre-renovation state
reno - after - Property underwent renovation, representing the post-renovation state
demo - Property was demolished
miss99 - Record was missing from the 1999 data assembly
train_data[admin_data].value_counts()
join_year join_status 2025 nochg 126281 new 53085 1999 rebuilt - before 3706 2025 rebuilt - after 3095 1999 reno - before 3073 demo 2869 2025 reno - before 2791 1999 reno - after 2632 2025 miss99 2468 Name: count, dtype: int64
train_data[train_data['join_status'] == 'nochg'].head(10)
id | sale_date | sale_price | sale_nbr | sale_warning | join_status | join_year | latitude | longitude | area | city | zoning | subdivision | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | submarket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2014-11-15 | 236000 | 2.0 | nochg | 2025 | 47.2917 | -122.3658 | 53 | FEDERAL WAY | RS7.2 | ALDERWOOD SOUTH DIV NO. 02 | 2 | 167000 | 372000 | 1975 | 0 | 10919 | 1560 | 1560 | 0 | 7 | 0 | 5 | 1.0 | 3 | 1 | 1 | 0 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | I | |
1 | 1 | 1999-01-15 | 313300 | NaN | 26 | nochg | 2025 | 47.6531 | -122.1996 | 74 | KIRKLAND | RS 8.5 | WILDWOOD LANE NO. 03 | 2 | 1184000 | 598000 | 1962 | 0 | 8900 | 2040 | 1220 | 820 | 7 | 7 | 4 | 1.0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | Q |
2 | 2 | 2006-08-15 | 341000 | 1.0 | nochg | 2025 | 47.4733 | -122.1901 | 30 | RENTON | R-8 | FALCON RIDGE (CEDAR RIDGE) | 2 | 230000 | 356000 | 1986 | 0 | 4953 | 1640 | 820 | 0 | 7 | 0 | 3 | 2.0 | 3 | 2 | 0 | 1 | 0 | 480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | |
3 | 3 | 1999-12-15 | 267000 | 1.0 | nochg | 2025 | 47.4739 | -122.3295 | 96 | BURIEN | RS-7200 | OLYMPIC VUE ESTATES | 2 | 190000 | 518000 | 1998 | 0 | 6799 | 2610 | 1010 | 500 | 8 | 7 | 3 | 2.0 | 4 | 2 | 0 | 1 | 0 | 530 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | |
5 | 5 | 2010-02-15 | 575000 | 3.0 | nochg | 2025 | 47.6813 | -122.3666 | 82 | SEATTLE | NR3 | BALLARD PARK ADD | 2 | 409000 | 794000 | 1928 | 0 | 2850 | 2820 | 1060 | 960 | 7 | 7 | 5 | 1.5 | 4 | 3 | 0 | 0 | 150 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | B | |
6 | 6 | 2016-10-15 | 276000 | 2.0 | nochg | 2025 | 47.4245 | -122.1773 | 51 | KENT | SR-6 | VISTA VIEW HEIGHTS NO. 02 | 2 | 220000 | 301000 | 1968 | 0 | 11261 | 1180 | 1180 | 0 | 7 | 0 | 4 | 1.0 | 3 | 1 | 0 | 0 | 0 | 480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | |
7 | 7 | 2001-08-15 | 235000 | 1.0 | nochg | 2025 | 47.3090 | -122.3490 | 54 | FEDERAL WAY | RS7.2 | WEST CAMPUS DIV NO. 04 | 2 | 166000 | 456000 | 1985 | 0 | 9765 | 2040 | 1120 | 0 | 8 | 0 | 4 | 2.0 | 3 | 2 | 0 | 1 | 0 | 560 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | I | |
8 | 8 | 2002-01-15 | 239950 | 1.0 | nochg | 2025 | 47.4955 | -122.3565 | 96 | BURIEN | RS-12000 | SEAMOUNT ADD | 2 | 274000 | 511000 | 1962 | 0 | 11000 | 2180 | 1090 | 1090 | 7 | 7 | 5 | 1.0 | 4 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | |
11 | 11 | 2004-09-15 | 229950 | 1.0 | nochg | 2025 | 47.4216 | -122.1525 | 60 | KING COUNTY | R6 | FOWLERS ADD | 2 | 203000 | 270000 | 1965 | 0 | 10289 | 1230 | 1230 | 0 | 7 | 0 | 4 | 1.0 | 3 | 1 | 0 | 1 | 0 | 490 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | |
12 | 12 | 2018-03-15 | 755000 | 2.0 | nochg | 2025 | 47.6219 | -122.0390 | 35 | SAMMAMISH | R4 | INGLEWOOD GLEN | 2 | 531000 | 623000 | 1982 | 0 | 20213 | 2270 | 810 | 810 | 8 | 8 | 4 | 2.0 | 4 | 2 | 0 | 1 | 0 | 400 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | O |
train_data[['sale_price', 'join_status', 'join_year', 'year_reno', 'condition', 'grade']].groupby(by=['join_status', 'join_year']).agg(['min', 'median', 'max'])
sale_price | year_reno | condition | grade | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | median | max | min | median | max | min | median | max | min | median | max | ||
join_status | join_year | ||||||||||||
demo | 1999 | 52400 | 410000.0 | 2975000 | 0 | 0.0 | 1998 | 1 | 3.0 | 5 | 1 | 7.0 | 12 |
miss99 | 2025 | 50500 | 450000.0 | 2995000 | 0 | 0.0 | 2023 | 1 | 3.0 | 5 | 4 | 8.0 | 13 |
new | 2025 | 50667 | 555000.0 | 2998000 | 0 | 0.0 | 2024 | 2 | 3.0 | 5 | 6 | 8.0 | 13 |
nochg | 2025 | 50293 | 415000.0 | 2999500 | 0 | 0.0 | 0 | 1 | 4.0 | 5 | 1 | 7.0 | 13 |
rebuilt - after | 2025 | 50462 | 405000.0 | 2925000 | 0 | 0.0 | 2023 | 3 | 3.0 | 5 | 5 | 9.0 | 13 |
rebuilt - before | 1999 | 65000 | 975000.0 | 2999950 | 0 | 0.0 | 1998 | 1 | 3.0 | 5 | 1 | 7.0 | 12 |
reno - after | 1999 | 50300 | 385000.0 | 2770000 | 0 | 0.0 | 1999 | 1 | 3.0 | 5 | 3 | 7.0 | 13 |
reno - before | 1999 | 54000 | 550000.0 | 2998000 | 0 | 1984.0 | 1999 | 1 | 3.0 | 5 | 3 | 7.0 | 13 |
2025 | 70000 | 837500.0 | 2998000 | 1999 | 2006.0 | 2023 | 2 | 3.0 | 5 | 5 | 8.0 | 13 |
Drop records with join-year of 1999 as they likely contain outdated assessment informationΒΆ
join_status and join_year relate to when a record is added to the assessor table and so would be unavailable at time prediction. We will refrain from removing them now however, as these values may provide insight into other features of the dataset and may be helpful in investigating potential quality issues.
train_data = train_data[(train_data['join_year'] == 2025) & (train_data['join_status'] == 'new') | (train_data['join_status'] == 'nochg')]
train_data.head(20)
id | sale_date | sale_price | sale_nbr | sale_warning | join_status | join_year | latitude | longitude | area | city | zoning | subdivision | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | submarket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2014-11-15 | 236000 | 2.0 | nochg | 2025 | 47.2917 | -122.3658 | 53 | FEDERAL WAY | RS7.2 | ALDERWOOD SOUTH DIV NO. 02 | 2 | 167000 | 372000 | 1975 | 0 | 10919 | 1560 | 1560 | 0 | 7 | 0 | 5 | 1.0 | 3 | 1 | 1 | 0 | 0 | 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | I | |
1 | 1 | 1999-01-15 | 313300 | NaN | 26 | nochg | 2025 | 47.6531 | -122.1996 | 74 | KIRKLAND | RS 8.5 | WILDWOOD LANE NO. 03 | 2 | 1184000 | 598000 | 1962 | 0 | 8900 | 2040 | 1220 | 820 | 7 | 7 | 4 | 1.0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | Q |
2 | 2 | 2006-08-15 | 341000 | 1.0 | nochg | 2025 | 47.4733 | -122.1901 | 30 | RENTON | R-8 | FALCON RIDGE (CEDAR RIDGE) | 2 | 230000 | 356000 | 1986 | 0 | 4953 | 1640 | 820 | 0 | 7 | 0 | 3 | 2.0 | 3 | 2 | 0 | 1 | 0 | 480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | |
3 | 3 | 1999-12-15 | 267000 | 1.0 | nochg | 2025 | 47.4739 | -122.3295 | 96 | BURIEN | RS-7200 | OLYMPIC VUE ESTATES | 2 | 190000 | 518000 | 1998 | 0 | 6799 | 2610 | 1010 | 500 | 8 | 7 | 3 | 2.0 | 4 | 2 | 0 | 1 | 0 | 530 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | |
5 | 5 | 2010-02-15 | 575000 | 3.0 | nochg | 2025 | 47.6813 | -122.3666 | 82 | SEATTLE | NR3 | BALLARD PARK ADD | 2 | 409000 | 794000 | 1928 | 0 | 2850 | 2820 | 1060 | 960 | 7 | 7 | 5 | 1.5 | 4 | 3 | 0 | 0 | 150 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | B | |
6 | 6 | 2016-10-15 | 276000 | 2.0 | nochg | 2025 | 47.4245 | -122.1773 | 51 | KENT | SR-6 | VISTA VIEW HEIGHTS NO. 02 | 2 | 220000 | 301000 | 1968 | 0 | 11261 | 1180 | 1180 | 0 | 7 | 0 | 4 | 1.0 | 3 | 1 | 0 | 0 | 0 | 480 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | |
7 | 7 | 2001-08-15 | 235000 | 1.0 | nochg | 2025 | 47.3090 | -122.3490 | 54 | FEDERAL WAY | RS7.2 | WEST CAMPUS DIV NO. 04 | 2 | 166000 | 456000 | 1985 | 0 | 9765 | 2040 | 1120 | 0 | 8 | 0 | 4 | 2.0 | 3 | 2 | 0 | 1 | 0 | 560 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | I | |
8 | 8 | 2002-01-15 | 239950 | 1.0 | nochg | 2025 | 47.4955 | -122.3565 | 96 | BURIEN | RS-12000 | SEAMOUNT ADD | 2 | 274000 | 511000 | 1962 | 0 | 11000 | 2180 | 1090 | 1090 | 7 | 7 | 5 | 1.0 | 4 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | |
10 | 10 | 2020-10-15 | 664000 | 5.0 | new | 2025 | 47.7021 | -122.0210 | 95 | KING COUNTY | R6 | TRILOGY AT REDMOND RIDGE | 2 | 451000 | 573000 | 2002 | 0 | 6558 | 1680 | 1680 | 0 | 8 | 0 | 3 | 1.0 | 2 | 1 | 1 | 0 | 0 | 440 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | P | |
11 | 11 | 2004-09-15 | 229950 | 1.0 | nochg | 2025 | 47.4216 | -122.1525 | 60 | KING COUNTY | R6 | FOWLERS ADD | 2 | 203000 | 270000 | 1965 | 0 | 10289 | 1230 | 1230 | 0 | 7 | 0 | 4 | 1.0 | 3 | 1 | 0 | 1 | 0 | 490 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | |
12 | 12 | 2018-03-15 | 755000 | 2.0 | nochg | 2025 | 47.6219 | -122.0390 | 35 | SAMMAMISH | R4 | INGLEWOOD GLEN | 2 | 531000 | 623000 | 1982 | 0 | 20213 | 2270 | 810 | 810 | 8 | 8 | 4 | 2.0 | 4 | 2 | 0 | 1 | 0 | 400 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | O | |
13 | 13 | 2021-05-15 | 2170000 | 3.0 | nochg | 2025 | 47.6552 | -122.0174 | 71 | KING COUNTY | RA2.5 | RIMWOOD DIV NO. 02 | 2 | 554000 | 1495000 | 1990 | 0 | 57063 | 4410 | 2810 | 0 | 9 | 0 | 4 | 2.0 | 5 | 2 | 1 | 2 | 0 | 650 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | O | |
14 | 14 | 2002-03-15 | 325000 | 1.0 | nochg | 2025 | 47.6882 | -122.3361 | 43 | SEATTLE | NR3 | DENNYS H L 1ST GREEN LAKE ADD | 2 | 521000 | 269000 | 1923 | 0 | 3160 | 1560 | 820 | 400 | 7 | 6 | 4 | 1.5 | 3 | 1 | 1 | 0 | 200 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | B | |
15 | 15 | 2006-02-15 | 175000 | 1.0 | nochg | 2025 | 47.2821 | -122.3532 | 54 | FEDERAL WAY | RS9.6 | RAINIER MANOR ADD | 2 | 116000 | 282000 | 1967 | 0 | 9085 | 1100 | 1100 | 0 | 6 | 0 | 4 | 1.0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | I | |
16 | 16 | 2012-06-15 | 470000 | 3.0 | new | 2025 | 47.5977 | -122.0157 | 35 | SAMMAMISH | R6 | RENAISSANCE DIV 1 | 2 | 443000 | 745000 | 1999 | 0 | 7898 | 2080 | 910 | 0 | 8 | 0 | 3 | 2.0 | 3 | 2 | 0 | 1 | 0 | 740 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | O | |
17 | 17 | 2000-07-15 | 174500 | NaN | nochg | 2025 | 47.3199 | -122.2139 | 28 | AUBURN | R7 | HILLMANS CD AUBURNDALE DIV NO. 02 | 2 | 147000 | 365000 | 1958 | 0 | 11200 | 1600 | 1600 | 0 | 7 | 0 | 4 | 1.0 | 4 | 2 | 0 | 0 | 0 | 340 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | L | |
18 | 18 | 2016-09-15 | 325000 | 3.0 | nochg | 2025 | 47.3669 | -122.2876 | 26 | KENT | SR-6 | RANDALL PARK DIV NO. 01 | 2 | 159000 | 425000 | 1977 | 0 | 7081 | 2420 | 1390 | 1030 | 7 | 6 | 4 | 1.0 | 3 | 1 | 1 | 1 | 310 | 240 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | I | |
19 | 19 | 2013-04-15 | 289995 | 1.0 | new | 2025 | 47.3604 | -122.0809 | 86 | COVINGTON | R8 | CORNERSTONE | 2 | 259000 | 376000 | 2012 | 0 | 4070 | 2400 | 1020 | 0 | 7 | 0 | 3 | 2.0 | 5 | 2 | 1 | 0 | 0 | 400 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | |
20 | 20 | 2015-04-15 | 782088 | 1.0 | new | 2025 | 47.5301 | -122.0738 | 65 | ISSAQUAH | UVSF-1 | TALUS PARCELS 10, 11 & 12 | 2 | 471000 | 1033000 | 2014 | 0 | 4224 | 2590 | 1190 | 310 | 10 | 9 | 3 | 2.0 | 4 | 2 | 1 | 0 | 420 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | R | |
22 | 22 | 2006-11-15 | 345000 | 2.0 | nochg | 2025 | 47.4983 | -122.2552 | 22 | KING COUNTY | R6 | BALCHS ALBERT PANORAMA VIEW NO. 02 | 2 | 214000 | 291000 | 1963 | 0 | 6500 | 1770 | 1170 | 600 | 7 | 6 | 3 | 1.0 | 3 | 2 | 0 | 0 | 570 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | J |
Sale DataΒΆ
analyze_dataframe(train_data[sale_data])
================================================== DATAFRAME ANALYSIS ================================================== Shape: (179366, 4) Data types: object 2 int64 1 float64 1 Name: count, dtype: int64 --- NUMERIC COLUMNS (2) --- sale_nbr: - Infinite values: 0 - NaN values: 39197 - Extremely large values: 0 --- NON-NUMERIC COLUMNS (2) --- sale_date: object, 313 unique values, 0 (0.0) missing sale_warning: object, 107 unique values, 0 (0.0) missing
### Convert Sale Date to datetime format
train_data['sale_date'] = pd.to_datetime(train_data.sale_date)
### Examine median price per year
train_data[['sale_date', 'sale_price']].groupby(train_data['sale_date'].dt.year).median()
sale_date | sale_price | |
---|---|---|
sale_date | ||
1999 | 1999-07-15 | 228950.0 |
2000 | 2000-06-15 | 249950.0 |
2001 | 2001-07-15 | 259950.0 |
2002 | 2002-06-15 | 278000.0 |
2003 | 2003-07-15 | 289980.0 |
2004 | 2004-07-15 | 323434.0 |
2005 | 2005-07-15 | 366874.0 |
2006 | 2006-06-15 | 412000.0 |
2007 | 2007-06-15 | 450000.0 |
2008 | 2008-06-15 | 428000.0 |
2009 | 2009-07-15 | 395000.0 |
2010 | 2010-06-15 | 390080.0 |
2011 | 2011-07-15 | 385000.0 |
2012 | 2012-07-15 | 395000.0 |
2013 | 2013-07-15 | 425000.0 |
2014 | 2014-07-15 | 454100.0 |
2015 | 2015-07-15 | 483000.0 |
2016 | 2016-07-15 | 535000.0 |
2017 | 2017-07-15 | 610000.0 |
2018 | 2018-06-15 | 650000.0 |
2019 | 2019-07-15 | 655000.0 |
2020 | 2020-07-15 | 705000.0 |
2021 | 2021-07-15 | 810000.0 |
2022 | 2022-06-15 | 889000.0 |
2023 | 2023-06-15 | 860000.0 |
2024 | 2024-07-15 | 925000.0 |
2025 | 2025-01-15 | 849950.0 |
### Convert Sale Price to 2025 dollars
inflation_dict = {}
for year in range(1999, 2025):
mul = cpi.inflate(1, date(year, 12, 31), items='Housing', area='Seattle-Tacoma-Bellevue WA')
inflation_dict[year] = mul
inflation_dict.update({2025: 1})
def adjust_for_inflation(dollar_value, year):
updated_value = dollar_value * inflation_dict[year]
return updated_value
train_data['adjusted_sale_price'] = train_data.apply(lambda row: adjust_for_inflation(row['sale_price'], row['sale_date'].year), axis=1).astype(int)
print(train_data[['sale_date', 'sale_price', 'adjusted_sale_price']].sort_values(by='sale_date'))
sale_date sale_price adjusted_sale_price 12292 1999-01-15 118000 292628 29264 1999-01-15 285000 706771 29265 1999-01-15 149500 370745 19967 1999-01-15 139950 347061 153419 1999-01-15 179361 444797 ... ... ... ... 16444 2025-01-15 886000 886000 158317 2025-01-15 745325 745325 172780 2025-01-15 610000 610000 96707 2025-01-15 865000 865000 50186 2025-01-15 980000 980000 [179366 rows x 3 columns]
# Remove unadjusted sale price
train_data.drop('sale_price', axis=1, inplace=True)
Use sale_nbr codes to filter for only Good Sale (i.e., fair market): 00, 01, 06, and 11.ΒΆ
After filtering the sale_nbr column should be dropped. The sale_warnings will be used to capture important notes on the property at time of sale that may have impacted price.
Sale Number (sale_nbr) Codes
- 00: GOOD SALE/Arm's Length Transaction: A standard, fair market value sale between unrelated parties.
- 01: Arm's Length Transaction/Partial SaleβSplit: A fair market value sale involving only a portion of the parent tract, requiring a new partial assessment.
- 02: NonβArm's Length Transaction/Sales between family members or related entities: Transactions where the buyer and seller are related or have a pre-existing relationship that might influence the sale price.
- 03: Change of Property use: The property's use category changes (e.g., commercial to residential), prompting a property class change.
- 04: Sale which includes a significant amount of personal property: The sale encompasses more than just the real estate, including items like furniture or appliances.
- 05: Forced Sale/Sale in which government agency or charitable group/non taxable entity is involved: Examples include sales involving religious organizations or transitions from taxable to non-taxable entities.
- 06: GOOD SALE/Sale in which property has previously sold within last 12 months: A good sale where the property has been involved in a recent transaction.
- 07: Transfer in which the sale was less than $10,000: A transfer of property at a lower value.
- 08: Change in property: Significant changes to the property through construction or updates occurred between the assessment and sale dates (e.g., renovations, additions).
- 09: Transfer tax not paid: No sale price recorded, indicating the transfer tax wasn't paid.
- 10: ASSEMBLAGE/Sale in which purchaser owns the adjoining property: The buyer owns an adjacent property, making it an assemblage (combining parcels) which may not be a good sale at fair market value.
- 11: GOOD SALE/Sale in which a single price applies to more than one property: A good sale involving multiple properties conveyed under one price
train_data = train_data[~train_data['sale_nbr'].isin([2,3,4,5,7,8,9,10,11])]
analyze_dataframe(train_data)
================================================== DATAFRAME ANALYSIS ================================================== Shape: (83483, 47) Data types: int64 36 object 6 float64 4 datetime64[ns] 1 Name: count, dtype: int64 --- NUMERIC COLUMNS (40) --- sale_nbr: - Infinite values: 0 - NaN values: 39197 - Extremely large values: 0 --- NON-NUMERIC COLUMNS (7) --- sale_date: datetime64[ns], 313 unique values, 0 (0.0) missing sale_warning: object, 88 unique values, 0 (0.0) missing join_status: object, 2 unique values, 0 (0.0) missing city: object, 40 unique values, 0 (0.0) missing zoning: object, 348 unique values, 0 (0.0) missing subdivision: object, 9217 unique values, 7254 (0.08689194207203862) missing submarket: object, 19 unique values, 801 (0.009594767797036522) missing
train_data['sale_nbr'] = train_data['sale_nbr'].astype('str')
train_data['sale_nbr'] = train_data['sale_nbr'].replace(to_replace='nan', value='standard')
train_data['sale_nbr'] = train_data['sale_nbr'].replace(to_replace='1.0', value='partial_split')
train_data['sale_nbr'] = train_data['sale_nbr'].replace(to_replace='6.0', value='sold_last_12mths')
column_analysis(train_data)
========================================================================================== DETAILED COLUMN ANALYSIS - ALL COLUMNS ========================================================================================== COLUMN: id ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1 β 1 β 0.00% β 0.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 1 β 0.00% β 0.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 1 β 0.00% β 0.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 7 β 1 β 0.00% β 0.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 8 β 1 β 0.00% β 0.01% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 11 β 1 β 0.00% β 0.01% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 14 β 1 β 0.00% β 0.01% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 15 β 1 β 0.00% β 0.01% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 17 β 1 β 0.00% β 0.01% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 19 β 1 β 0.00% β 0.01% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 83,483 β’ Missing values: 0 (0.00%) β’ Most frequent: '1' (1 times) β’ Least frequent: '19' (1 times) β’ Minimum value: 1 β’ Maximum value: 199999 β’ All values are unique (potential ID column) COLUMN: sale_date ====================================================================== Data Type: datetime64[ns] Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββββββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββββββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1999-06-15 00:00:00 β 731 β 0.88% β 0.88% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1999-03-15 00:00:00 β 692 β 0.83% β 1.70% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1999-07-15 00:00:00 β 680 β 0.81% β 2.52% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 2000-06-15 00:00:00 β 661 β 0.79% β 3.31% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1999-08-15 00:00:00 β 659 β 0.79% β 4.10% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 2004-06-15 00:00:00 β 648 β 0.78% β 4.88% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1999-09-15 00:00:00 β 646 β 0.77% β 5.65% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 1999-05-15 00:00:00 β 640 β 0.77% β 6.42% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 1999-04-15 00:00:00 β 626 β 0.75% β 7.17% β ββββββββββΌββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 2003-08-15 00:00:00 β 612 β 0.73% β 7.90% β ββββββββββ§ββββββββββββββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 313 β’ Missing values: 0 (0.00%) β’ Most frequent: '1999-06-15 00:00:00' (731 times) β’ Least frequent: '2003-08-15 00:00:00' (612 times) COLUMN: sale_nbr ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€βββββββββββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺβββββββββββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β partial_split β 43,196 β 51.74% β 51.74% β ββββββββββΌβββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β standard β 39,197 β 46.95% β 98.69% β ββββββββββΌβββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β sold_last_12mths β 1,090 β 1.31% β 100.00% β ββββββββββ§βββββββββββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 3 β’ Missing values: 0 (0.00%) β’ Most frequent: 'partial_split' (43,196 times) β’ Least frequent: 'sold_last_12mths' (1,090 times) β’ Low cardinality (good for categorical analysis) COLUMN: sale_warning ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β β 76,909 β 92.13% β 92.13% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 26 β 3,205 β 3.84% β 95.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 15 β 1,415 β 1.69% β 97.66% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 40 β 445 β 0.53% β 98.19% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 15 26 β 310 β 0.37% β 98.56% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 56 β 238 β 0.29% β 98.85% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 41 β 173 β 0.21% β 99.06% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 29 β 106 β 0.13% β 99.18% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 10 β 76 β 0.09% β 99.27% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 60 β 76 β 0.09% β 99.37% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 88 β’ Missing values: 0 (0.00%) β’ Most frequent: ' ' (76,909 times) β’ Least frequent: ' 60 ' (76 times) COLUMN: join_status ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β nochg β 64,933 β 77.78% β 77.78% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β new β 18,550 β 22.22% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2 β’ Missing values: 0 (0.00%) β’ Most frequent: 'nochg' (64,933 times) β’ Least frequent: 'new' (18,550 times) β’ Low cardinality (good for categorical analysis) COLUMN: join_year ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 2025 β 83,483 β 100.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 1 β’ Missing values: 0 (0.00%) β’ Most frequent: '2025' (83,483 times) β’ Least frequent: '2025' (83,483 times) β’ Minimum value: 2025 β’ Maximum value: 2025 β’ All values are the same (constant column) COLUMN: latitude ====================================================================== Data Type: float64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 47.6853 β 56 β 0.07% β 0.07% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 47.6882 β 55 β 0.07% β 0.13% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 47.6911 β 54 β 0.06% β 0.20% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 47.6721 β 54 β 0.06% β 0.26% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 47.6727 β 51 β 0.06% β 0.32% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 47.6901 β 51 β 0.06% β 0.38% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 47.6842 β 47 β 0.06% β 0.44% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 47.5671 β 47 β 0.06% β 0.50% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 47.5517 β 45 β 0.05% β 0.55% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 47.6919 β 45 β 0.05% β 0.60% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5,589 β’ Missing values: 0 (0.00%) β’ Most frequent: '47.6853' (56 times) β’ Least frequent: '47.6919' (45 times) β’ Minimum value: 47.1552 β’ Maximum value: 47.7778 COLUMN: longitude ====================================================================== Data Type: float64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€βββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺβββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β -122.351 β 75 β 0.09% β 0.09% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β -122.349 β 63 β 0.08% β 0.17% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β -122.362 β 62 β 0.07% β 0.24% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β -122.288 β 61 β 0.07% β 0.31% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β -122.308 β 59 β 0.07% β 0.38% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β -122.29 β 59 β 0.07% β 0.45% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β -122.363 β 57 β 0.07% β 0.52% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β -122.314 β 57 β 0.07% β 0.59% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β -122.3 β 55 β 0.07% β 0.66% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β -122.356 β 55 β 0.07% β 0.72% β ββββββββββ§βββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 6,332 β’ Missing values: 0 (0.00%) β’ Most frequent: '-122.3509' (75 times) β’ Least frequent: '-122.356' (55 times) β’ Minimum value: -122.5272 β’ Maximum value: -121.1616 COLUMN: area ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 69 β 2,197 β 2.63% β 2.63% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 35 β 2,027 β 2.43% β 5.06% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 32 β 1,565 β 1.87% β 6.93% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 37 β 1,550 β 1.86% β 8.79% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 72 β 1,543 β 1.85% β 10.64% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 56 β 1,509 β 1.81% β 12.45% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 6 β 1,506 β 1.80% β 14.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 53 β 1,500 β 1.80% β 16.05% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 93 β 1,457 β 1.75% β 17.79% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 73 β 1,388 β 1.66% β 19.46% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 89 β’ Missing values: 0 (0.00%) β’ Most frequent: '69' (2,197 times) β’ Least frequent: '73' (1,388 times) β’ Minimum value: 1 β’ Maximum value: 100 COLUMN: city ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β SEATTLE β 23,874 β 28.60% β 28.60% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β KING COUNTY β 10,996 β 13.17% β 41.77% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β BELLEVUE β 4,891 β 5.86% β 47.63% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β KENT β 4,256 β 5.10% β 52.73% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β SAMMAMISH β 4,186 β 5.01% β 57.74% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β RENTON β 4,055 β 4.86% β 62.60% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β FEDERAL WAY β 3,596 β 4.31% β 66.90% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β KIRKLAND β 3,519 β 4.22% β 71.12% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β SHORELINE β 2,480 β 2.97% β 74.09% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β AUBURN β 2,444 β 2.93% β 77.02% β ββββββββββ§ββββββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 40 β’ Missing values: 0 (0.00%) β’ Most frequent: 'SEATTLE' (23,874 times) β’ Least frequent: 'AUBURN' (2,444 times) COLUMN: zoning ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β NR3 β 12,795 β 15.33% β 15.33% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β R6 β 7,575 β 9.07% β 24.40% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β R4 β 4,774 β 5.72% β 30.12% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β RA5 β 3,778 β 4.53% β 34.64% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β NR2 β 3,498 β 4.19% β 38.83% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β SR-6 β 3,063 β 3.67% β 42.50% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β R-5 β 2,583 β 3.09% β 45.60% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β R-6 β 2,521 β 3.02% β 48.62% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β RS7.2 β 2,445 β 2.93% β 51.55% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β NR β 2,311 β 2.77% β 54.31% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 348 β’ Missing values: 0 (0.00%) β’ Most frequent: 'NR3' (12,795 times) β’ Least frequent: 'NR' (2,311 times) COLUMN: subdivision ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 76,229 | Missing: 7,254 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€βββββββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺβββββββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β MAPLE LEAF TO GREEN LAKE CI... β 312 β 0.37% β 0.37% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β GILMAN PARK ADD BLKS 01 THR... β 288 β 0.34% β 0.72% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β HOMECROFT ADD β 194 β 0.23% β 0.95% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β CHEROKEE BAY PARK ASSESSORS... β 186 β 0.22% β 1.17% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β GILMANS ADD BLKS 01 THRU 87 β 180 β 0.22% β 1.39% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β SEA VIEW PARK β 171 β 0.20% β 1.59% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β MC MICKEN HEIGHTS DIV NO. 02 β 154 β 0.18% β 1.78% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β SALMON BAY PARK ADD β 151 β 0.18% β 1.96% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β STATE ADD TO SEATTLE NO. 04 β 139 β 0.17% β 2.13% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β SOUTH PARK β 120 β 0.14% β 2.27% β ββββββββββ§βββββββββββββββββββββββββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 9,217 β’ Missing values: 7,254 (8.69%) β’ Most frequent: 'MAPLE LEAF TO GREEN LAKE CIRCLE POR OF' (312 times) β’ Least frequent: 'SOUTH PARK' (120 times) COLUMN: present_use ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 2 β 76,334 β 91.44% β 91.44% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 29 β 6,667 β 7.99% β 99.42% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 6 β 482 β 0.58% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 3 β’ Missing values: 0 (0.00%) β’ Most frequent: '2' (76,334 times) β’ Least frequent: '6' (482 times) β’ Minimum value: 2 β’ Maximum value: 29 β’ Low cardinality (good for categorical analysis) COLUMN: land_val ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 207000 β 613 β 0.73% β 0.73% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 216000 β 603 β 0.72% β 1.46% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 0 β 524 β 0.63% β 2.08% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 177000 β 497 β 0.60% β 2.68% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 135000 β 461 β 0.55% β 3.23% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 486000 β 457 β 0.55% β 3.78% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 196000 β 454 β 0.54% β 4.32% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 243000 β 445 β 0.53% β 4.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 268000 β 429 β 0.51% β 5.37% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 225000 β 423 β 0.51% β 5.88% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2,289 β’ Missing values: 0 (0.00%) β’ Most frequent: '207000' (613 times) β’ Least frequent: '225000' (423 times) β’ Minimum value: 0 β’ Maximum value: 10037000 COLUMN: imp_val ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1000 β 1,452 β 1.74% β 1.74% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 0 β 543 β 0.65% β 2.39% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 333000 β 228 β 0.27% β 2.66% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 375000 β 224 β 0.27% β 2.93% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 370000 β 224 β 0.27% β 3.20% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 398000 β 222 β 0.27% β 3.47% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 344000 β 222 β 0.27% β 3.73% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 390000 β 212 β 0.25% β 3.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 362000 β 210 β 0.25% β 4.24% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 359000 β 210 β 0.25% β 4.49% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2,234 β’ Missing values: 0 (0.00%) β’ Most frequent: '1000' (1,452 times) β’ Least frequent: '359000' (210 times) β’ Minimum value: 0 β’ Maximum value: 6653000 COLUMN: year_built ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1977 β 1,687 β 2.02% β 2.02% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1978 β 1,674 β 2.01% β 4.03% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1968 β 1,611 β 1.93% β 5.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 1990 β 1,510 β 1.81% β 7.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1967 β 1,475 β 1.77% β 9.53% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 1989 β 1,416 β 1.70% β 11.23% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1962 β 1,385 β 1.66% β 12.89% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 1987 β 1,385 β 1.66% β 14.55% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 1979 β 1,358 β 1.63% β 16.17% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 1988 β 1,233 β 1.48% β 17.65% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 125 β’ Missing values: 0 (0.00%) β’ Most frequent: '1977' (1,687 times) β’ Least frequent: '1988' (1,233 times) β’ Minimum value: 1900 β’ Maximum value: 2024 COLUMN: year_reno ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 83,476 β 99.99% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2022 β 2 β 0.00% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2024 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 2017 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 2020 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 2009 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 2023 β 1 β 0.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 7 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (83,476 times) β’ Least frequent: '2023' (1 times) β’ Minimum value: 0 β’ Maximum value: 2024 β’ Low cardinality (good for categorical analysis) COLUMN: sqft_lot ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 5000 β 1,249 β 1.50% β 1.50% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 6000 β 1,059 β 1.27% β 2.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 7200 β 918 β 1.10% β 3.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4000 β 838 β 1.00% β 4.87% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 7500 β 497 β 0.60% β 5.46% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 9600 β 474 β 0.57% β 6.03% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 8400 β 472 β 0.57% β 6.60% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 4800 β 414 β 0.50% β 7.09% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 4500 β 318 β 0.38% β 7.47% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 9000 β 306 β 0.37% β 7.84% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 19,966 β’ Missing values: 0 (0.00%) β’ Most frequent: '5000' (1,249 times) β’ Least frequent: '9000' (306 times) β’ Minimum value: 381 β’ Maximum value: 2076940 COLUMN: sqft ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1800 β 546 β 0.65% β 0.65% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1300 β 534 β 0.64% β 1.29% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1200 β 488 β 0.58% β 1.88% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 1660 β 477 β 0.57% β 2.45% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1440 β 476 β 0.57% β 3.02% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 1700 β 475 β 0.57% β 3.59% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 2020 β 473 β 0.57% β 4.16% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 1900 β 471 β 0.56% β 4.72% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 2000 β 471 β 0.56% β 5.28% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 1820 β 469 β 0.56% β 5.85% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 1,718 β’ Missing values: 0 (0.00%) β’ Most frequent: '1800' (546 times) β’ Least frequent: '1820' (469 times) β’ Minimum value: 200 β’ Maximum value: 13310 COLUMN: sqft_1 ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1010 β 1,117 β 1.34% β 1.34% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1200 β 1,082 β 1.30% β 2.63% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1300 β 1,056 β 1.26% β 3.90% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 1250 β 1,042 β 1.25% β 5.15% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1080 β 1,002 β 1.20% β 6.35% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 1060 β 992 β 1.19% β 7.54% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1090 β 951 β 1.14% β 8.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 1040 β 915 β 1.10% β 9.77% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 1120 β 912 β 1.09% β 10.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 1180 β 904 β 1.08% β 11.95% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 1,226 β’ Missing values: 0 (0.00%) β’ Most frequent: '1010' (1,117 times) β’ Least frequent: '1180' (904 times) β’ Minimum value: 80 β’ Maximum value: 7760 COLUMN: sqft_fbsmt ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 48,555 β 58.16% β 58.16% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 500 β 869 β 1.04% β 59.20% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 600 β 800 β 0.96% β 60.16% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 400 β 779 β 0.93% β 61.09% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 700 β 689 β 0.83% β 61.92% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 800 β 673 β 0.81% β 62.73% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1000 β 599 β 0.72% β 63.44% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 300 β 571 β 0.68% β 64.13% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 900 β 563 β 0.67% β 64.80% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 480 β 450 β 0.54% β 65.34% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 499 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (48,555 times) β’ Least frequent: '480' (450 times) β’ Minimum value: 0 β’ Maximum value: 5110 COLUMN: grade ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 7 β 34,397 β 41.20% β 41.20% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 8 β 25,950 β 31.08% β 72.29% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 9 β 10,566 β 12.66% β 84.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 6 β 6,984 β 8.37% β 93.31% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 10 β 3,600 β 4.31% β 97.62% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 11 β 1,030 β 1.23% β 98.85% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 5 β 693 β 0.83% β 99.68% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 12 β 213 β 0.26% β 99.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 4 β 31 β 0.04% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 13 β 14 β 0.02% β 99.99% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 13 β’ Missing values: 0 (0.00%) β’ Most frequent: '7' (34,397 times) β’ Least frequent: '13' (14 times) β’ Minimum value: 1 β’ Maximum value: 13 COLUMN: fbsmt_grade ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 48,555 β 58.16% β 58.16% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 7 β 14,941 β 17.90% β 76.06% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 6 β 8,234 β 9.86% β 85.92% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 8 β 7,416 β 8.88% β 94.80% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 9 β 2,193 β 2.63% β 97.43% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 5 β 1,267 β 1.52% β 98.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 10 β 529 β 0.63% β 99.58% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 11 β 156 β 0.19% β 99.77% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 4 β 144 β 0.17% β 99.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 12 β 22 β 0.03% β 99.97% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 13 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (48,555 times) β’ Least frequent: '12' (22 times) β’ Minimum value: 0 β’ Maximum value: 13 COLUMN: condition ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 3 β 49,235 β 58.98% β 58.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 4 β 25,571 β 30.63% β 89.61% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 5 β 8,373 β 10.03% β 99.64% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 2 β 269 β 0.32% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1 β 35 β 0.04% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5 β’ Missing values: 0 (0.00%) β’ Most frequent: '3' (49,235 times) β’ Least frequent: '1' (35 times) β’ Minimum value: 1 β’ Maximum value: 5 β’ Low cardinality (good for categorical analysis) COLUMN: stories ====================================================================== Data Type: float64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1 β 40,955 β 49.06% β 49.06% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 33,146 β 39.70% β 88.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1.5 β 6,701 β 8.03% β 96.79% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 2,156 β 2.58% β 99.37% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 2.5 β 462 β 0.55% β 99.92% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 4 β 45 β 0.05% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 3.5 β 17 β 0.02% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 4.5 β 1 β 0.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 8 β’ Missing values: 0 (0.00%) β’ Most frequent: '1.0' (40,955 times) β’ Least frequent: '4.5' (1 times) β’ Minimum value: 1.0 β’ Maximum value: 4.5 β’ Low cardinality (good for categorical analysis) COLUMN: beds ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 3 β 37,521 β 44.94% β 44.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 4 β 28,395 β 34.01% β 78.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 9,122 β 10.93% β 89.88% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 5 β 6,869 β 8.23% β 98.11% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 6 β 854 β 1.02% β 99.14% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 1 β 548 β 0.66% β 99.79% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 7 β 96 β 0.11% β 99.91% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 8 β 36 β 0.04% β 99.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 0 β 27 β 0.03% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 9 β 7 β 0.01% β 99.99% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 13 β’ Missing values: 0 (0.00%) β’ Most frequent: '3' (37,521 times) β’ Least frequent: '9' (7 times) β’ Minimum value: 0 β’ Maximum value: 13 COLUMN: bath_full ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1 β 43,834 β 52.51% β 52.51% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 33,585 β 40.23% β 92.74% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 4,743 β 5.68% β 98.42% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 0 β 819 β 0.98% β 99.40% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 471 β 0.56% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 5 β 28 β 0.03% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 6 β 2 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 9 β 1 β 0.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 8 β’ Missing values: 0 (0.00%) β’ Most frequent: '1' (43,834 times) β’ Least frequent: '9' (1 times) β’ Minimum value: 0 β’ Maximum value: 9 β’ Low cardinality (good for categorical analysis) COLUMN: bath_3qtr ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 45,360 β 54.33% β 54.33% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 32,143 β 38.50% β 92.84% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 5,655 β 6.77% β 99.61% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 294 β 0.35% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 24 β 0.03% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 5 β 7 β 0.01% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 6 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (45,360 times) β’ Least frequent: '5' (7 times) β’ Minimum value: 0 β’ Maximum value: 5 β’ Low cardinality (good for categorical analysis) COLUMN: bath_half ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 44,455 β 53.25% β 53.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 38,151 β 45.70% β 98.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 847 β 1.01% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 25 β 0.03% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 5 β 3 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 4 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 12 β 1 β 0.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 7 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (44,455 times) β’ Least frequent: '12' (1 times) β’ Minimum value: 0 β’ Maximum value: 12 β’ Low cardinality (good for categorical analysis) COLUMN: garb_sqft ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 64,516 β 77.28% β 77.28% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 200 β 1,107 β 1.33% β 78.61% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 240 β 1,039 β 1.24% β 79.85% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 290 β 778 β 0.93% β 80.78% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 220 β 742 β 0.89% β 81.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 260 β 728 β 0.87% β 82.54% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 480 β 646 β 0.77% β 83.32% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 310 β 554 β 0.66% β 83.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 300 β 499 β 0.60% β 84.58% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 180 β 491 β 0.59% β 85.17% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 248 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (64,516 times) β’ Least frequent: '180' (491 times) β’ Minimum value: 0 β’ Maximum value: 4000 COLUMN: gara_sqft ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 37,014 β 44.34% β 44.34% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 440 β 3,486 β 4.18% β 48.51% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 480 β 3,034 β 3.63% β 52.15% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 400 β 2,240 β 2.68% β 54.83% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 460 β 2,239 β 2.68% β 57.51% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 420 β 2,142 β 2.57% β 60.08% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 530 β 1,383 β 1.66% β 61.73% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 500 β 1,324 β 1.59% β 63.32% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 510 β 900 β 1.08% β 64.40% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 550 β 882 β 1.06% β 65.46% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 522 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (37,014 times) β’ Least frequent: '550' (882 times) β’ Minimum value: 0 β’ Maximum value: 4404 COLUMN: wfnt ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,598 β 98.94% β 98.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 8 β 325 β 0.39% β 99.33% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 9 β 187 β 0.22% β 99.55% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 173 β 0.21% β 99.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 6 β 124 β 0.15% β 99.91% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 7 β 62 β 0.07% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1 β 8 β 0.01% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 5 β 6 β 0.01% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 8 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,598 times) β’ Least frequent: '5' (6 times) β’ Minimum value: 0 β’ Maximum value: 9 β’ Low cardinality (good for categorical analysis) COLUMN: golf ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 83,062 β 99.50% β 99.50% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 421 β 0.50% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (83,062 times) β’ Least frequent: '1' (421 times) β’ Minimum value: 0 β’ Maximum value: 1 β’ Low cardinality (good for categorical analysis) COLUMN: greenbelt ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,087 β 97.13% β 97.13% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 2,396 β 2.87% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,087 times) β’ Least frequent: '1' (2,396 times) β’ Minimum value: 0 β’ Maximum value: 1 β’ Low cardinality (good for categorical analysis) COLUMN: noise_traffic ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 72,472 β 86.81% β 86.81% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 6,704 β 8.03% β 94.84% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 3,690 β 4.42% β 99.26% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 617 β 0.74% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (72,472 times) β’ Least frequent: '3' (617 times) β’ Minimum value: 0 β’ Maximum value: 3 β’ Low cardinality (good for categorical analysis) COLUMN: view_rainier ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,905 β 99.31% β 99.31% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 300 β 0.36% β 99.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 241 β 0.29% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 37 β 0.04% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,905 times) β’ Least frequent: '4' (37 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_olympics ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,768 β 97.95% β 97.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 1,038 β 1.24% β 99.19% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 460 β 0.55% β 99.74% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 217 β 0.26% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,768 times) β’ Least frequent: '4' (217 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_cascades ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,535 β 97.67% β 97.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 1,324 β 1.59% β 99.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 508 β 0.61% β 99.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 116 β 0.14% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,535 times) β’ Least frequent: '4' (116 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_territorial ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 76,605 β 91.76% β 91.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 4,335 β 5.19% β 96.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 1,867 β 2.24% β 99.19% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 676 β 0.81% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (76,605 times) β’ Least frequent: '4' (676 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_skyline ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,962 β 99.38% β 99.38% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 319 β 0.38% β 99.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 121 β 0.14% β 99.90% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 81 β 0.10% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,962 times) β’ Least frequent: '4' (81 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_sound ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,420 β 97.53% β 97.53% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 751 β 0.90% β 98.43% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 573 β 0.69% β 99.11% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 464 β 0.56% β 99.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 275 β 0.33% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,420 times) β’ Least frequent: '4' (275 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_lakewash ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,665 β 97.82% β 97.82% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 723 β 0.87% β 98.69% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 547 β 0.66% β 99.34% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 352 β 0.42% β 99.77% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 196 β 0.23% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,665 times) β’ Least frequent: '4' (196 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_lakesamm ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,999 β 99.42% β 99.42% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 151 β 0.18% β 99.60% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1 β 143 β 0.17% β 99.77% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 105 β 0.13% β 99.90% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 85 β 0.10% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,999 times) β’ Least frequent: '4' (85 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_otherwater ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,858 β 99.25% β 99.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 302 β 0.36% β 99.61% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 167 β 0.20% β 99.81% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 156 β 0.19% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,858 times) β’ Least frequent: '4' (156 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_other ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 83,044 β 99.47% β 99.47% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 324 β 0.39% β 99.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 95 β 0.11% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 20 β 0.02% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (83,044 times) β’ Least frequent: '4' (20 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: submarket ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 82,682 | Missing: 801 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β K β 9,038 β 10.83% β 10.83% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β I β 7,345 β 8.80% β 19.62% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β B β 6,733 β 8.07% β 27.69% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β R β 6,659 β 7.98% β 35.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β Q β 6,071 β 7.27% β 42.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β O β 5,268 β 6.31% β 49.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β F β 4,760 β 5.70% β 54.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β D β 4,732 β 5.67% β 60.62% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β C β 4,392 β 5.26% β 65.88% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β L β 4,212 β 5.05% β 70.92% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 19 β’ Missing values: 801 (0.96%) β’ Most frequent: 'K' (9,038 times) β’ Least frequent: 'L' (4,212 times) COLUMN: adjusted_sale_price ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 458781 β 64 β 0.08% β 0.08% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 607575 β 57 β 0.07% β 0.14% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 561734 β 57 β 0.07% β 0.21% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 533178 β 55 β 0.07% β 0.28% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 601208 β 54 β 0.06% β 0.34% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 509330 β 53 β 0.06% β 0.41% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 557977 β 52 β 0.06% β 0.47% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 608383 β 52 β 0.06% β 0.53% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 483580 β 52 β 0.06% β 0.59% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 741490 β 51 β 0.06% β 0.66% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 33,677 β’ Missing values: 0 (0.00%) β’ Most frequent: '458781' (64 times) β’ Least frequent: '741490' (51 times) β’ Minimum value: 87189 β’ Maximum value: 7315704
sale_warning_codes = {
1: "Personal Property Included",
2: "1031 Trade",
3: "Contract Or Cash Sale",
4: "Presale",
5: "Full Sales Price Not Reported",
6: "Refund",
7: "Questionable Per Sales Identification",
8: "Questionable Per Appraisal",
9: "Questionable Per Mainframe System (Obsolete Code)",
10: "Tear Down",
11: "Corporate Affiliates",
12: "Estate Administrator, Guardian, Or Executor",
13: "Bankruptcy - Receiver Or Trustee",
14: "Sheriff / Tax Sale",
15: "No Market Exposure",
16: "Government Agency",
17: "Non-Profit Organization",
18: "Quit Claim Deed",
19: "Seller'S Or Purchaser'S Assignment",
20: "Correction Deed",
21: "Trade",
22: "Partial Interest (1/3, 1/2, Etc.)",
23: "Forced Sale",
24: "Easement Or Right-Of-Way",
25: "Fulfillment Of Contract Deed",
26: "Imp. Characteristics Changed Since Sale",
27: "Timber And Forest Land",
28: "New Plat (With Less Than 20% Sold)",
29: "Segregation And/Or Merger",
30: "Historic Property",
31: "Exempt From Excise Tax",
32: "$1,000 Sale Or Less",
33: "Lease Or Lease-Hold",
34: "Change Of Use",
35: "Open Space Designation Continued/Ok'D After Sale",
36: "Plottage",
37: "Securing Of Debt",
38: "Divorce",
39: "Assumption Of Mortgage W/No Addl Consideration Pd",
40: "Relocation - Sale To Service",
41: "Relocation - Sale By Service",
42: "Development Rights To Cnty,Cty,Or Prvt Developer",
43: "Development Rights Parcel To Prvt Sector",
44: "Tenant",
45: "Multi-Parcel Sale",
46: "Non-Representative Sale",
47: "Non-Conventional Heating System",
48: "Condo With Garage, Moorage, Or Storage",
49: "Mobile Home",
50: "Condo Wholesale",
51: "Related Party, Friend, Or Neighbor",
52: "Statement To Dor",
53: "Residual Sales",
54: "Affordable Housing Sales",
55: "Shell",
56: "Builder Or Developer Sales",
57: "Selling Or Buying Costs Affecting Sale Price",
58: "Preliminary Shortplat Approval",
59: "Bulk Portfolio Sale",
60: "Short Sale",
61: "Financial Institution Resale",
62: "Auction Sale"
}
# Select most predictive warning features to reduce dimensionality
train_data = extract_sale_warning_codes(train_data)
warning_features = [col for col in train_data.columns if col.startswith('sale_warning_')]
selector = SelectKBest(f_regression, k=20) # Keep top 20 warning features
X_warnings_selected = selector.fit_transform(train_data[warning_features], train_data['adjusted_sale_price'])
selected_warnings = list(selector.get_feature_names_out())
selected_warnings
['sale_warning_3', 'sale_warning_4', 'sale_warning_10', 'sale_warning_15', 'sale_warning_16', 'sale_warning_24', 'sale_warning_26', 'sale_warning_28', 'sale_warning_29', 'sale_warning_30', 'sale_warning_34', 'sale_warning_35', 'sale_warning_36', 'sale_warning_41', 'sale_warning_44', 'sale_warning_54', 'sale_warning_56', 'sale_warning_57', 'sale_warning_58', 'sale_warning_60']
Geo DataΒΆ
column_analysis(train_data[geo_data])
========================================================================================== DETAILED COLUMN ANALYSIS - ALL COLUMNS ========================================================================================== COLUMN: latitude ====================================================================== Data Type: float64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 47.6853 β 56 β 0.07% β 0.07% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 47.6882 β 55 β 0.07% β 0.13% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 47.6911 β 54 β 0.06% β 0.20% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 47.6721 β 54 β 0.06% β 0.26% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 47.6727 β 51 β 0.06% β 0.32% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 47.6901 β 51 β 0.06% β 0.38% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 47.6842 β 47 β 0.06% β 0.44% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 47.5671 β 47 β 0.06% β 0.50% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 47.5517 β 45 β 0.05% β 0.55% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 47.6919 β 45 β 0.05% β 0.60% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5,589 β’ Missing values: 0 (0.00%) β’ Most frequent: '47.6853' (56 times) β’ Least frequent: '47.6919' (45 times) β’ Minimum value: 47.1552 β’ Maximum value: 47.7778 COLUMN: longitude ====================================================================== Data Type: float64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€βββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺβββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β -122.351 β 75 β 0.09% β 0.09% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β -122.349 β 63 β 0.08% β 0.17% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β -122.362 β 62 β 0.07% β 0.24% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β -122.288 β 61 β 0.07% β 0.31% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β -122.308 β 59 β 0.07% β 0.38% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β -122.29 β 59 β 0.07% β 0.45% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β -122.363 β 57 β 0.07% β 0.52% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β -122.314 β 57 β 0.07% β 0.59% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β -122.3 β 55 β 0.07% β 0.66% β ββββββββββΌβββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β -122.356 β 55 β 0.07% β 0.72% β ββββββββββ§βββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 6,332 β’ Missing values: 0 (0.00%) β’ Most frequent: '-122.3509' (75 times) β’ Least frequent: '-122.356' (55 times) β’ Minimum value: -122.5272 β’ Maximum value: -121.1616 COLUMN: area ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 69 β 2,197 β 2.63% β 2.63% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 35 β 2,027 β 2.43% β 5.06% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 32 β 1,565 β 1.87% β 6.93% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 37 β 1,550 β 1.86% β 8.79% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 72 β 1,543 β 1.85% β 10.64% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 56 β 1,509 β 1.81% β 12.45% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 6 β 1,506 β 1.80% β 14.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 53 β 1,500 β 1.80% β 16.05% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 93 β 1,457 β 1.75% β 17.79% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 73 β 1,388 β 1.66% β 19.46% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 89 β’ Missing values: 0 (0.00%) β’ Most frequent: '69' (2,197 times) β’ Least frequent: '73' (1,388 times) β’ Minimum value: 1 β’ Maximum value: 100 COLUMN: city ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β SEATTLE β 23,874 β 28.60% β 28.60% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β KING COUNTY β 10,996 β 13.17% β 41.77% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β BELLEVUE β 4,891 β 5.86% β 47.63% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β KENT β 4,256 β 5.10% β 52.73% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β SAMMAMISH β 4,186 β 5.01% β 57.74% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β RENTON β 4,055 β 4.86% β 62.60% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β FEDERAL WAY β 3,596 β 4.31% β 66.90% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β KIRKLAND β 3,519 β 4.22% β 71.12% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β SHORELINE β 2,480 β 2.97% β 74.09% β ββββββββββΌββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β AUBURN β 2,444 β 2.93% β 77.02% β ββββββββββ§ββββββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 40 β’ Missing values: 0 (0.00%) β’ Most frequent: 'SEATTLE' (23,874 times) β’ Least frequent: 'AUBURN' (2,444 times) COLUMN: submarket ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 82,682 | Missing: 801 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β K β 9,038 β 10.83% β 10.83% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β I β 7,345 β 8.80% β 19.62% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β B β 6,733 β 8.07% β 27.69% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β R β 6,659 β 7.98% β 35.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β Q β 6,071 β 7.27% β 42.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β O β 5,268 β 6.31% β 49.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β F β 4,760 β 5.70% β 54.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β D β 4,732 β 5.67% β 60.62% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β C β 4,392 β 5.26% β 65.88% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β L β 4,212 β 5.05% β 70.92% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 19 β’ Missing values: 801 (0.96%) β’ Most frequent: 'K' (9,038 times) β’ Least frequent: 'L' (4,212 times)
train_data[geo_data].head(10)
latitude | longitude | area | city | submarket | |
---|---|---|---|---|---|
1 | 47.6531 | -122.1996 | 74 | KIRKLAND | Q |
2 | 47.4733 | -122.1901 | 30 | RENTON | K |
3 | 47.4739 | -122.3295 | 96 | BURIEN | G |
7 | 47.3090 | -122.3490 | 54 | FEDERAL WAY | I |
8 | 47.4955 | -122.3565 | 96 | BURIEN | G |
11 | 47.4216 | -122.1525 | 60 | KING COUNTY | K |
14 | 47.6882 | -122.3361 | 43 | SEATTLE | B |
15 | 47.2821 | -122.3532 | 54 | FEDERAL WAY | I |
17 | 47.3199 | -122.2139 | 28 | AUBURN | L |
19 | 47.3604 | -122.0809 | 86 | COVINGTON | K |
# Confirm that all latitude and longitude points fall within King County, WA
def plot_points_with_king_county_boundary(lat_col, lon_col):
"""
Plots latitude and longitude points over the actual King County boundary from OpenStreetMap.
Parameters:
lat_col (pd.Series): Latitude values.
lon_col (pd.Series): Longitude values.
"""
fig, ax = plt.subplots(figsize=(10, 8))
# Plot property points
ax.scatter(lon_col, lat_col, c='blue', label='Property Points', s=5, alpha=0.6, zorder=3)
king_county = ox.geocode_to_gdf("King County, Washington, USA")
king_county.plot(ax=ax, color='none', edgecolor='red', linewidth=2, zorder=2, label='King County Boundary')
# Adjust plot limits
ax.set_xlim(lon_col.min() - 0.05, lon_col.max() + 0.05)
ax.set_ylim(lat_col.min() - 0.05, lat_col.max() + 0.05)
# Labeling
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Latitude and Longitude Points with King County Boundary')
ax.legend()
ax.grid(True, linestyle='--', alpha=0.5)
plt.show()
plot_points_with_king_county_boundary(train_data.latitude, train_data.longitude)
len(train_data[train_data.longitude > -121.7])
76
train_data.city.sort_values().unique()
array(['ALGONA', 'AUBURN', 'BEAUX ARTS', 'BELLEVUE', 'BLACK DIAMOND', 'BOTHELL', 'BURIEN', 'CARNATION', 'CLYDE HILL', 'COVINGTON', 'DES MOINES', 'DUVALL', 'ENUMCLAW', 'FEDERAL WAY', 'HUNTS POINT', 'ISSAQUAH', 'KENMORE', 'KENT', 'KING COUNTY', 'KIRKLAND', 'LAKE FOREST PARK', 'MAPLE VALLEY', 'MEDINA', 'MERCER ISLAND', 'MILTON', 'NEWCASTLE', 'NORMANDY PARK', 'NORTH BEND', 'PACIFIC', 'REDMOND', 'RENTON', 'SAMMAMISH', 'SEATTLE', 'SHORELINE', 'SKYKOMISH', 'SNOQUALMIE', 'SeaTac', 'TUKWILA', 'WOODINVILLE', 'YARROW POINT'], dtype=object)
# Clean up value for Sea-Tac
train_data['city'] = train_data.city.replace('SeaTac', 'SEA-TAC')
train_data.city.sort_values().unique()
array(['ALGONA', 'AUBURN', 'BEAUX ARTS', 'BELLEVUE', 'BLACK DIAMOND', 'BOTHELL', 'BURIEN', 'CARNATION', 'CLYDE HILL', 'COVINGTON', 'DES MOINES', 'DUVALL', 'ENUMCLAW', 'FEDERAL WAY', 'HUNTS POINT', 'ISSAQUAH', 'KENMORE', 'KENT', 'KING COUNTY', 'KIRKLAND', 'LAKE FOREST PARK', 'MAPLE VALLEY', 'MEDINA', 'MERCER ISLAND', 'MILTON', 'NEWCASTLE', 'NORMANDY PARK', 'NORTH BEND', 'PACIFIC', 'REDMOND', 'RENTON', 'SAMMAMISH', 'SEA-TAC', 'SEATTLE', 'SHORELINE', 'SKYKOMISH', 'SNOQUALMIE', 'TUKWILA', 'WOODINVILLE', 'YARROW POINT'], dtype=object)
train_data.submarket.sort_values().unique()
array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', nan], dtype=object)
train_data['submarket'] = train_data.submarket.fillna('Unknown')
train_data.submarket.value_counts()
submarket K 9038 I 7345 B 6733 R 6659 Q 6071 O 5268 F 4760 D 4732 C 4392 L 4212 M 3905 A 3616 N 3531 P 2912 E 2839 G 2799 J 2310 S 1124 Unknown 801 H 436 Name: count, dtype: int64
Legal DataΒΆ
column_analysis(train_data[legal_data])
========================================================================================== DETAILED COLUMN ANALYSIS - ALL COLUMNS ========================================================================================== COLUMN: zoning ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β NR3 β 12,795 β 15.33% β 15.33% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β R6 β 7,575 β 9.07% β 24.40% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β R4 β 4,774 β 5.72% β 30.12% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β RA5 β 3,778 β 4.53% β 34.64% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β NR2 β 3,498 β 4.19% β 38.83% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β SR-6 β 3,063 β 3.67% β 42.50% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β R-5 β 2,583 β 3.09% β 45.60% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β R-6 β 2,521 β 3.02% β 48.62% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β RS7.2 β 2,445 β 2.93% β 51.55% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β NR β 2,311 β 2.77% β 54.31% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 348 β’ Missing values: 0 (0.00%) β’ Most frequent: 'NR3' (12,795 times) β’ Least frequent: 'NR' (2,311 times) COLUMN: subdivision ====================================================================== Data Type: object Total Rows: 83,483 | Non-Missing: 76,229 | Missing: 7,254 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€βββββββββββββββββββββββββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺβββββββββββββββββββββββββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β MAPLE LEAF TO GREEN LAKE CI... β 312 β 0.37% β 0.37% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β GILMAN PARK ADD BLKS 01 THR... β 288 β 0.34% β 0.72% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β HOMECROFT ADD β 194 β 0.23% β 0.95% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β CHEROKEE BAY PARK ASSESSORS... β 186 β 0.22% β 1.17% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β GILMANS ADD BLKS 01 THRU 87 β 180 β 0.22% β 1.39% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β SEA VIEW PARK β 171 β 0.20% β 1.59% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β MC MICKEN HEIGHTS DIV NO. 02 β 154 β 0.18% β 1.78% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β SALMON BAY PARK ADD β 151 β 0.18% β 1.96% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β STATE ADD TO SEATTLE NO. 04 β 139 β 0.17% β 2.13% β ββββββββββΌβββββββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β SOUTH PARK β 120 β 0.14% β 2.27% β ββββββββββ§βββββββββββββββββββββββββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 9,217 β’ Missing values: 7,254 (8.69%) β’ Most frequent: 'MAPLE LEAF TO GREEN LAKE CIRCLE POR OF' (312 times) β’ Least frequent: 'SOUTH PARK' (120 times) COLUMN: present_use ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 2 β 76,334 β 91.44% β 91.44% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 29 β 6,667 β 7.99% β 99.42% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 6 β 482 β 0.58% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 3 β’ Missing values: 0 (0.00%) β’ Most frequent: '2' (76,334 times) β’ Least frequent: '6' (482 times) β’ Minimum value: 2 β’ Maximum value: 29 β’ Low cardinality (good for categorical analysis)
The zoning field contains King County's zoning designations that regulate land use and development intensity. Based on the observed patterns:
Residential Zones
R-[number] (e.g., R-1, R-4, R-6, R-8) - Single-family residential with density restrictions
R[number] (e.g., R4, R6, R8) - Residential zones with varying density allowances
RS-[number] (e.g., RS-7200, RS-9600) - Residential suburban with minimum lot sizes
RSA [number] (e.g., RSA 4, RSA 6) - Residential suburban administrative zones
RSX [number] (e.g., RSX 7.2) - Residential suburban experimental zones
RSL (M) - Residential small lot with modifications
RA[number] (e.g., RA2.5, RA5) - Rural area residential with minimum acreage
RA[number]P - Rural area residential with special provisions
RA[number]SO - Rural area residential with special overlay
Neighborhood Residential
- NR, NR1, NR2, NR3 - Neighborhood residential of varying densities
Low-Rise Residential
LR1 (M), LR2 (M), LR3 (M) - Low-rise residential with modifications
LR3 RC (M) - Low-rise residential with ridge and character modifications
Special Use Zones
SF [number] (e.g., SF 5000, SF 7200) - Single-family with minimum lot sizes
SF-S, SF-SL - Single-family special designations
SR-[number] (e.g., SR-4.5, SR-6) - Single-family residential
UVSF-1 - Urban village single-family
SFR [number] - Single-family residential with special provisions
Mixed Use and Commercial
MU - Mixed use
MR (M1) - Mid-rise with modifications
MML U/85 - Mixed use low-rise
MUR-[number] (e.g., MUR-45, MUR-70) - Mixed use residential
NC1P-55 (M), NC2P-55 (M), NC2-40 - Neighborhood commercial
C1-55 (M) - Commercial with modifications
Industrial and Other
LDR - Low-density residential
HDR - High-density residential
L-1, L-3 - Light industrial
UL-7200 - Urban light industrial
TC, TC A3 - Town center
O - Office
PUD - Planned unit development
UR - Urban reserve
RM[number] (e.g., RM1800, RM-48) - Residential multifamily
A[number] (e.g., A10, A35) - Agricultural
R [number]d (e.g., R 5400d) - Residential with density modifications
NMF - Neighborhood multifamily
### Create zoning_categories to group similar zones
train_data['zoning_category'] = train_data['zoning'].apply(categorize_zoning)
train_data['zoning_category'].value_counts()
zoning_category Residential Zones 39669 Neighborhood Residential 18862 Industrial and Other 12123 Low-Rise Residential 5946 Special Use Zones 5346 Mixed Use and Commercial 1537 Name: count, dtype: int64
### Drop present_use and subdivision in favor of the consolidated zoning categories
legal_features_drop = ['present_use', 'subdivision']
Assesor FeaturesΒΆ
Data in the file is from the most recent assessment and not necessarily refective of the property at sale_date.ΒΆ
Therefore the features will not be used for training purposes, but reviewed for informational purposes
column_analysis(train_data[assessor_data])
========================================================================================== DETAILED COLUMN ANALYSIS - ALL COLUMNS ========================================================================================== COLUMN: land_val ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 207000 β 613 β 0.73% β 0.73% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 216000 β 603 β 0.72% β 1.46% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 0 β 524 β 0.63% β 2.08% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 177000 β 497 β 0.60% β 2.68% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 135000 β 461 β 0.55% β 3.23% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 486000 β 457 β 0.55% β 3.78% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 196000 β 454 β 0.54% β 4.32% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 243000 β 445 β 0.53% β 4.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 268000 β 429 β 0.51% β 5.37% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 225000 β 423 β 0.51% β 5.88% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2,289 β’ Missing values: 0 (0.00%) β’ Most frequent: '207000' (613 times) β’ Least frequent: '225000' (423 times) β’ Minimum value: 0 β’ Maximum value: 10037000 COLUMN: imp_val ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1000 β 1,452 β 1.74% β 1.74% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 0 β 543 β 0.65% β 2.39% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 333000 β 228 β 0.27% β 2.66% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 375000 β 224 β 0.27% β 2.93% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 370000 β 224 β 0.27% β 3.20% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 398000 β 222 β 0.27% β 3.47% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 344000 β 222 β 0.27% β 3.73% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 390000 β 212 β 0.25% β 3.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 362000 β 210 β 0.25% β 4.24% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 359000 β 210 β 0.25% β 4.49% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2,234 β’ Missing values: 0 (0.00%) β’ Most frequent: '1000' (1,452 times) β’ Least frequent: '359000' (210 times) β’ Minimum value: 0 β’ Maximum value: 6653000 COLUMN: grade ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 7 β 34,397 β 41.20% β 41.20% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 8 β 25,950 β 31.08% β 72.29% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 9 β 10,566 β 12.66% β 84.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 6 β 6,984 β 8.37% β 93.31% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 10 β 3,600 β 4.31% β 97.62% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 11 β 1,030 β 1.23% β 98.85% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 5 β 693 β 0.83% β 99.68% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 12 β 213 β 0.26% β 99.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 4 β 31 β 0.04% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 13 β 14 β 0.02% β 99.99% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 13 β’ Missing values: 0 (0.00%) β’ Most frequent: '7' (34,397 times) β’ Least frequent: '13' (14 times) β’ Minimum value: 1 β’ Maximum value: 13 COLUMN: fbsmt_grade ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 48,555 β 58.16% β 58.16% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 7 β 14,941 β 17.90% β 76.06% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 6 β 8,234 β 9.86% β 85.92% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 8 β 7,416 β 8.88% β 94.80% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 9 β 2,193 β 2.63% β 97.43% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 5 β 1,267 β 1.52% β 98.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 10 β 529 β 0.63% β 99.58% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 11 β 156 β 0.19% β 99.77% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 4 β 144 β 0.17% β 99.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 12 β 22 β 0.03% β 99.97% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 13 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (48,555 times) β’ Least frequent: '12' (22 times) β’ Minimum value: 0 β’ Maximum value: 13 COLUMN: condition ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 3 β 49,235 β 58.98% β 58.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 4 β 25,571 β 30.63% β 89.61% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 5 β 8,373 β 10.03% β 99.64% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 2 β 269 β 0.32% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1 β 35 β 0.04% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5 β’ Missing values: 0 (0.00%) β’ Most frequent: '3' (49,235 times) β’ Least frequent: '1' (35 times) β’ Minimum value: 1 β’ Maximum value: 5 β’ Low cardinality (good for categorical analysis)
Property Characteristic FeaturesΒΆ
analyze_dataframe(train_data[property_data])
================================================== DATAFRAME ANALYSIS ================================================== Shape: (83483, 27) Data types: int64 26 float64 1 Name: count, dtype: int64 --- NUMERIC COLUMNS (27) ---
column_analysis(train_data[property_data])
========================================================================================== DETAILED COLUMN ANALYSIS - ALL COLUMNS ========================================================================================== COLUMN: year_built ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1977 β 1,687 β 2.02% β 2.02% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1978 β 1,674 β 2.01% β 4.03% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1968 β 1,611 β 1.93% β 5.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 1990 β 1,510 β 1.81% β 7.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1967 β 1,475 β 1.77% β 9.53% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 1989 β 1,416 β 1.70% β 11.23% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1962 β 1,385 β 1.66% β 12.89% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 1987 β 1,385 β 1.66% β 14.55% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 1979 β 1,358 β 1.63% β 16.17% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 1988 β 1,233 β 1.48% β 17.65% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 125 β’ Missing values: 0 (0.00%) β’ Most frequent: '1977' (1,687 times) β’ Least frequent: '1988' (1,233 times) β’ Minimum value: 1900 β’ Maximum value: 2024 COLUMN: year_reno ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 83,476 β 99.99% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2022 β 2 β 0.00% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2024 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 2017 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 2020 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 2009 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 2023 β 1 β 0.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 7 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (83,476 times) β’ Least frequent: '2023' (1 times) β’ Minimum value: 0 β’ Maximum value: 2024 β’ Low cardinality (good for categorical analysis) COLUMN: sqft ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1800 β 546 β 0.65% β 0.65% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1300 β 534 β 0.64% β 1.29% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1200 β 488 β 0.58% β 1.88% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 1660 β 477 β 0.57% β 2.45% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1440 β 476 β 0.57% β 3.02% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 1700 β 475 β 0.57% β 3.59% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 2020 β 473 β 0.57% β 4.16% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 1900 β 471 β 0.56% β 4.72% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 2000 β 471 β 0.56% β 5.28% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 1820 β 469 β 0.56% β 5.85% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 1,718 β’ Missing values: 0 (0.00%) β’ Most frequent: '1800' (546 times) β’ Least frequent: '1820' (469 times) β’ Minimum value: 200 β’ Maximum value: 13310 COLUMN: sqft_lot ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 5000 β 1,249 β 1.50% β 1.50% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 6000 β 1,059 β 1.27% β 2.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 7200 β 918 β 1.10% β 3.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4000 β 838 β 1.00% β 4.87% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 7500 β 497 β 0.60% β 5.46% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 9600 β 474 β 0.57% β 6.03% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 8400 β 472 β 0.57% β 6.60% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 4800 β 414 β 0.50% β 7.09% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 4500 β 318 β 0.38% β 7.47% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 9000 β 306 β 0.37% β 7.84% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 19,966 β’ Missing values: 0 (0.00%) β’ Most frequent: '5000' (1,249 times) β’ Least frequent: '9000' (306 times) β’ Minimum value: 381 β’ Maximum value: 2076940 COLUMN: sqft_fbsmt ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 48,555 β 58.16% β 58.16% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 500 β 869 β 1.04% β 59.20% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 600 β 800 β 0.96% β 60.16% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 400 β 779 β 0.93% β 61.09% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 700 β 689 β 0.83% β 61.92% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 800 β 673 β 0.81% β 62.73% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1000 β 599 β 0.72% β 63.44% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 300 β 571 β 0.68% β 64.13% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 900 β 563 β 0.67% β 64.80% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 480 β 450 β 0.54% β 65.34% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 499 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (48,555 times) β’ Least frequent: '480' (450 times) β’ Minimum value: 0 β’ Maximum value: 5110 COLUMN: sqft_1 ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1010 β 1,117 β 1.34% β 1.34% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1200 β 1,082 β 1.30% β 2.63% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1300 β 1,056 β 1.26% β 3.90% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 1250 β 1,042 β 1.25% β 5.15% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 1080 β 1,002 β 1.20% β 6.35% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 1060 β 992 β 1.19% β 7.54% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1090 β 951 β 1.14% β 8.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 1040 β 915 β 1.10% β 9.77% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 1120 β 912 β 1.09% β 10.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 1180 β 904 β 1.08% β 11.95% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 1,226 β’ Missing values: 0 (0.00%) β’ Most frequent: '1010' (1,117 times) β’ Least frequent: '1180' (904 times) β’ Minimum value: 80 β’ Maximum value: 7760 COLUMN: stories ====================================================================== Data Type: float64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1 β 40,955 β 49.06% β 49.06% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 33,146 β 39.70% β 88.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1.5 β 6,701 β 8.03% β 96.79% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 2,156 β 2.58% β 99.37% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 2.5 β 462 β 0.55% β 99.92% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 4 β 45 β 0.05% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 3.5 β 17 β 0.02% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 4.5 β 1 β 0.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 8 β’ Missing values: 0 (0.00%) β’ Most frequent: '1.0' (40,955 times) β’ Least frequent: '4.5' (1 times) β’ Minimum value: 1.0 β’ Maximum value: 4.5 β’ Low cardinality (good for categorical analysis) COLUMN: beds ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 3 β 37,521 β 44.94% β 44.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 4 β 28,395 β 34.01% β 78.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 9,122 β 10.93% β 89.88% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 5 β 6,869 β 8.23% β 98.11% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 6 β 854 β 1.02% β 99.14% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 1 β 548 β 0.66% β 99.79% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 7 β 96 β 0.11% β 99.91% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 8 β 36 β 0.04% β 99.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 0 β 27 β 0.03% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 9 β 7 β 0.01% β 99.99% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 13 β’ Missing values: 0 (0.00%) β’ Most frequent: '3' (37,521 times) β’ Least frequent: '9' (7 times) β’ Minimum value: 0 β’ Maximum value: 13 COLUMN: bath_full ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 1 β 43,834 β 52.51% β 52.51% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 33,585 β 40.23% β 92.74% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 4,743 β 5.68% β 98.42% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 0 β 819 β 0.98% β 99.40% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 471 β 0.56% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 5 β 28 β 0.03% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 6 β 2 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 9 β 1 β 0.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 8 β’ Missing values: 0 (0.00%) β’ Most frequent: '1' (43,834 times) β’ Least frequent: '9' (1 times) β’ Minimum value: 0 β’ Maximum value: 9 β’ Low cardinality (good for categorical analysis) COLUMN: bath_3qtr ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 45,360 β 54.33% β 54.33% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 32,143 β 38.50% β 92.84% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 5,655 β 6.77% β 99.61% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 294 β 0.35% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 24 β 0.03% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 5 β 7 β 0.01% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 6 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (45,360 times) β’ Least frequent: '5' (7 times) β’ Minimum value: 0 β’ Maximum value: 5 β’ Low cardinality (good for categorical analysis) COLUMN: bath_half ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 44,455 β 53.25% β 53.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 38,151 β 45.70% β 98.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 847 β 1.01% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 25 β 0.03% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 5 β 3 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 4 β 1 β 0.00% β 100.00% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 12 β 1 β 0.00% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 7 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (44,455 times) β’ Least frequent: '12' (1 times) β’ Minimum value: 0 β’ Maximum value: 12 β’ Low cardinality (good for categorical analysis) COLUMN: garb_sqft ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 64,516 β 77.28% β 77.28% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 200 β 1,107 β 1.33% β 78.61% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 240 β 1,039 β 1.24% β 79.85% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 290 β 778 β 0.93% β 80.78% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 220 β 742 β 0.89% β 81.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 260 β 728 β 0.87% β 82.54% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 480 β 646 β 0.77% β 83.32% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 310 β 554 β 0.66% β 83.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 300 β 499 β 0.60% β 84.58% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 180 β 491 β 0.59% β 85.17% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 248 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (64,516 times) β’ Least frequent: '180' (491 times) β’ Minimum value: 0 β’ Maximum value: 4000 COLUMN: gara_sqft ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 37,014 β 44.34% β 44.34% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 440 β 3,486 β 4.18% β 48.51% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 480 β 3,034 β 3.63% β 52.15% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 400 β 2,240 β 2.68% β 54.83% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 460 β 2,239 β 2.68% β 57.51% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 420 β 2,142 β 2.57% β 60.08% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 530 β 1,383 β 1.66% β 61.73% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 500 β 1,324 β 1.59% β 63.32% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #9 β 510 β 900 β 1.08% β 64.40% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #10 β 550 β 882 β 1.06% β 65.46% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 522 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (37,014 times) β’ Least frequent: '550' (882 times) β’ Minimum value: 0 β’ Maximum value: 4404 COLUMN: wfnt ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,598 β 98.94% β 98.94% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 8 β 325 β 0.39% β 99.33% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 9 β 187 β 0.22% β 99.55% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 173 β 0.21% β 99.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 6 β 124 β 0.15% β 99.91% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #6 β 7 β 62 β 0.07% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #7 β 1 β 8 β 0.01% β 99.99% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #8 β 5 β 6 β 0.01% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 8 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,598 times) β’ Least frequent: '5' (6 times) β’ Minimum value: 0 β’ Maximum value: 9 β’ Low cardinality (good for categorical analysis) COLUMN: golf ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 83,062 β 99.50% β 99.50% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 421 β 0.50% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (83,062 times) β’ Least frequent: '1' (421 times) β’ Minimum value: 0 β’ Maximum value: 1 β’ Low cardinality (good for categorical analysis) COLUMN: greenbelt ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,087 β 97.13% β 97.13% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 2,396 β 2.87% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 2 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,087 times) β’ Least frequent: '1' (2,396 times) β’ Minimum value: 0 β’ Maximum value: 1 β’ Low cardinality (good for categorical analysis) COLUMN: noise_traffic ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 72,472 β 86.81% β 86.81% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 6,704 β 8.03% β 94.84% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 3,690 β 4.42% β 99.26% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 617 β 0.74% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (72,472 times) β’ Least frequent: '3' (617 times) β’ Minimum value: 0 β’ Maximum value: 3 β’ Low cardinality (good for categorical analysis) COLUMN: view_rainier ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,905 β 99.31% β 99.31% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 300 β 0.36% β 99.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 241 β 0.29% β 99.96% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 37 β 0.04% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,905 times) β’ Least frequent: '4' (37 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_olympics ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,768 β 97.95% β 97.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 1,038 β 1.24% β 99.19% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 460 β 0.55% β 99.74% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 217 β 0.26% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,768 times) β’ Least frequent: '4' (217 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_cascades ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,535 β 97.67% β 97.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 1,324 β 1.59% β 99.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 508 β 0.61% β 99.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 116 β 0.14% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,535 times) β’ Least frequent: '4' (116 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_territorial ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 76,605 β 91.76% β 91.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 4,335 β 5.19% β 96.95% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 1,867 β 2.24% β 99.19% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 676 β 0.81% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (76,605 times) β’ Least frequent: '4' (676 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_skyline ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,962 β 99.38% β 99.38% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 319 β 0.38% β 99.76% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 121 β 0.14% β 99.90% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 81 β 0.10% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,962 times) β’ Least frequent: '4' (81 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_sound ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,420 β 97.53% β 97.53% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 751 β 0.90% β 98.43% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 573 β 0.69% β 99.11% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 464 β 0.56% β 99.67% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 275 β 0.33% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,420 times) β’ Least frequent: '4' (275 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_lakewash ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 81,665 β 97.82% β 97.82% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 1 β 723 β 0.87% β 98.69% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 2 β 547 β 0.66% β 99.34% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 352 β 0.42% β 99.77% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 196 β 0.23% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (81,665 times) β’ Least frequent: '4' (196 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_lakesamm ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,999 β 99.42% β 99.42% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 151 β 0.18% β 99.60% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 1 β 143 β 0.17% β 99.77% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 3 β 105 β 0.13% β 99.90% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #5 β 4 β 85 β 0.10% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 5 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,999 times) β’ Least frequent: '4' (85 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_otherwater ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 82,858 β 99.25% β 99.25% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 302 β 0.36% β 99.61% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 167 β 0.20% β 99.81% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 156 β 0.19% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (82,858 times) β’ Least frequent: '4' (156 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis) COLUMN: view_other ====================================================================== Data Type: int64 Total Rows: 83,483 | Non-Missing: 83,483 | Missing: 0 VALUE DISTRIBUTION: -------------------------------------------------- ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ€βββββββββββββββββ β Rank β Value β Count β Percentage β Cumulative % β ββββββββββͺββββββββββͺββββββββββͺβββββββββββββββͺβββββββββββββββββ‘ β #1 β 0 β 83,044 β 99.47% β 99.47% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #2 β 2 β 324 β 0.39% β 99.86% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #3 β 3 β 95 β 0.11% β 99.98% β ββββββββββΌββββββββββΌββββββββββΌβββββββββββββββΌβββββββββββββββββ€ β #4 β 4 β 20 β 0.02% β 100.00% β ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ§βββββββββββββββββ STATISTICS: ------------------------------ β’ Unique values: 4 β’ Missing values: 0 (0.00%) β’ Most frequent: '0' (83,044 times) β’ Least frequent: '4' (20 times) β’ Minimum value: 0 β’ Maximum value: 4 β’ Low cardinality (good for categorical analysis)
### Run Outlier Analysis on the numerical sqft features
calculate_outlier_table(train_data, ['sqft', 'sqft_1', 'sqft_fbsmt', 'gara_sqft', 'garb_sqft', 'sqft_lot'])
Β | Column | dtype | min | max | Total Outliers (%) | Lower Outliers (%) | Upper Outliers (%) |
---|---|---|---|---|---|---|---|
4 | garb_sqft | int64 | 0.00 | 4000.00 | 22.72% | 0.00% | 22.72% |
5 | sqft_lot | int64 | 381.00 | 2076940.00 | 10.78% | 0.00% | 10.78% |
1 | sqft_1 | int64 | 80.00 | 7760.00 | 3.05% | 0.05% | 3.00% |
0 | sqft | int64 | 200.00 | 13310.00 | 2.03% | 0.00% | 2.03% |
2 | sqft_fbsmt | int64 | 0.00 | 5110.00 | 1.88% | 0.00% | 1.88% |
3 | gara_sqft | int64 | 0.00 | 4404.00 | 0.23% | 0.00% | 0.23% |
### Examine sqft = 0 along with sqft_1
sqft_var = ['sqft', 'sqft_1', 'sqft_fbsmt', 'gara_sqft', 'garb_sqft', 'sqft_lot']
for var in sqft_var:
print(f'{var} at 1% percentile = {train_data[var].quantile(.01)}')
print(f'{var} at 5% percentile = {train_data[var].quantile(.01)}')
print(f'{var} at 95% percentile = {train_data[var].quantile(.95)}')
print(f'{var} at 99% percentile = {train_data[var].quantile(.99)}')
print("")
sqft at 1% percentile = 760.0 sqft at 5% percentile = 760.0 sqft at 95% percentile = 3670.0 sqft at 99% percentile = 4640.0 sqft_1 at 1% percentile = 400.0 sqft_1 at 5% percentile = 400.0 sqft_1 at 95% percentile = 2050.0 sqft_1 at 99% percentile = 2680.0 sqft_fbsmt at 1% percentile = 0.0 sqft_fbsmt at 5% percentile = 0.0 sqft_fbsmt at 95% percentile = 1200.0 sqft_fbsmt at 99% percentile = 1650.0 gara_sqft at 1% percentile = 0.0 gara_sqft at 5% percentile = 0.0 gara_sqft at 95% percentile = 750.0 gara_sqft at 99% percentile = 950.0 garb_sqft at 1% percentile = 0.0 garb_sqft at 5% percentile = 0.0 garb_sqft at 95% percentile = 520.0 garb_sqft at 99% percentile = 700.0 sqft_lot at 1% percentile = 797.0 sqft_lot at 5% percentile = 797.0 sqft_lot at 95% percentile = 39166.69999999998 sqft_lot at 99% percentile = 193439.6599999987
large_attached_garage = train_data[train_data['gara_sqft'] > 1000]
large_attached_garage.describe()
id | sale_date | join_year | latitude | longitude | area | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | adjusted_sale_price | sale_warning_1 | sale_warning_2 | sale_warning_3 | sale_warning_4 | sale_warning_5 | sale_warning_6 | sale_warning_7 | sale_warning_8 | sale_warning_9 | sale_warning_10 | sale_warning_11 | sale_warning_12 | sale_warning_13 | sale_warning_14 | sale_warning_15 | sale_warning_16 | sale_warning_17 | sale_warning_18 | sale_warning_19 | sale_warning_20 | sale_warning_21 | sale_warning_22 | sale_warning_23 | sale_warning_24 | sale_warning_25 | sale_warning_26 | sale_warning_27 | sale_warning_28 | sale_warning_29 | sale_warning_30 | sale_warning_31 | sale_warning_32 | sale_warning_33 | sale_warning_34 | sale_warning_35 | sale_warning_36 | sale_warning_37 | sale_warning_38 | sale_warning_39 | sale_warning_40 | sale_warning_41 | sale_warning_42 | sale_warning_43 | sale_warning_44 | sale_warning_45 | sale_warning_46 | sale_warning_47 | sale_warning_48 | sale_warning_49 | sale_warning_50 | sale_warning_51 | sale_warning_52 | sale_warning_53 | sale_warning_54 | sale_warning_55 | sale_warning_56 | sale_warning_57 | sale_warning_58 | sale_warning_59 | sale_warning_60 | sale_warning_61 | sale_warning_62 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 583.000000 | 583 | 583.0 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 5.830000e+02 | 5.830000e+02 | 583.000000 | 583.0 | 5.830000e+02 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 583.000000 | 5.830000e+02 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.000000 | 583.0 | 583.0 | 583.0 | 583.0 | 583.000000 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.000000 | 583.0 | 583.0 | 583.000000 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.000000 | 583.0 | 583.0 | 583.0 | 583.0 | 583.000000 | 583.000000 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.000000 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 | 583.0 |
mean | 99570.855918 | 2009-03-01 20:25:06.689536768 | 2025.0 | 47.540101 | -122.085634 | 62.660377 | 2.053173 | 6.148419e+05 | 1.107655e+06 | 1988.849057 | 0.0 | 7.223184e+04 | 3750.723842 | 2223.835334 | 269.939966 | 9.480274 | 1.651801 | 3.447684 | 1.728988 | 3.874786 | 2.094340 | 0.670669 | 0.871355 | 12.349914 | 1230.087479 | 0.324185 | 0.025729 | 0.041166 | 0.120069 | 0.051458 | 0.066895 | 0.111492 | 0.404803 | 0.017153 | 0.039451 | 0.058319 | 0.060034 | 0.085763 | 0.039451 | 1.550213e+06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.001715 | 0.0 | 0.0 | 0.0 | 0.0 | 0.017153 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.044597 | 0.0 | 0.0 | 0.001715 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.001715 | 0.0 | 0.0 | 0.0 | 0.0 | 0.012007 | 0.003431 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.001715 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
min | 35.000000 | 1999-01-15 00:00:00 | 2025.0 | 47.201600 | -122.496400 | 1.000000 | 2.000000 | 0.000000e+00 | 0.000000e+00 | 1921.000000 | 0.0 | 5.189000e+03 | 840.000000 | 390.000000 | 0.000000 | 6.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1010.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.358280e+05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
25% | 49484.000000 | 2002-06-30 00:00:00 | 2025.0 | 47.395800 | -122.148750 | 47.000000 | 2.000000 | 3.400000e+05 | 6.135000e+05 | 1985.000000 | 0.0 | 1.940800e+04 | 2845.000000 | 1588.000000 | 0.000000 | 8.000000 | 0.000000 | 3.000000 | 1.000000 | 3.000000 | 2.000000 | 0.000000 | 1.000000 | 0.000000 | 1055.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.235880e+05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
50% | 97186.000000 | 2006-09-15 00:00:00 | 2025.0 | 47.566400 | -122.076000 | 66.000000 | 2.000000 | 5.250000e+05 | 9.870000e+05 | 1990.000000 | 0.0 | 3.560800e+04 | 3590.000000 | 2010.000000 | 0.000000 | 10.000000 | 0.000000 | 3.000000 | 2.000000 | 4.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 1120.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.320076e+06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
75% | 149202.000000 | 2016-02-14 00:00:00 | 2025.0 | 47.669150 | -122.025950 | 72.000000 | 2.000000 | 6.880000e+05 | 1.439000e+06 | 1997.000000 | 0.0 | 7.838800e+04 | 4410.000000 | 2620.000000 | 0.000000 | 11.000000 | 0.000000 | 4.000000 | 2.000000 | 4.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 | 1270.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.945776e+06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
max | 199962.000000 | 2024-07-15 00:00:00 | 2025.0 | 47.776400 | -121.705100 | 100.000000 | 29.000000 | 5.591000e+06 | 6.653000e+06 | 2022.000000 | 0.0 | 1.092431e+06 | 13310.000000 | 7760.000000 | 3730.000000 | 13.000000 | 11.000000 | 5.000000 | 2.500000 | 7.000000 | 6.000000 | 4.000000 | 3.000000 | 1200.000000 | 4404.000000 | 9.000000 | 1.000000 | 1.000000 | 3.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 | 2.000000 | 3.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 5.973216e+06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
std | 58692.390383 | NaN | 0.0 | 0.155294 | 0.114369 | 21.320205 | 1.130149 | 5.074465e+05 | 6.969236e+05 | 14.305859 | 0.0 | 1.045035e+05 | 1432.813786 | 885.067103 | 644.425841 | 1.449922 | 3.485935 | 0.592267 | 0.443897 | 0.818451 | 0.824675 | 0.767430 | 0.596887 | 102.272131 | 339.842076 | 1.573912 | 0.158462 | 0.198845 | 0.425952 | 0.398572 | 0.406965 | 0.536876 | 0.996960 | 0.184579 | 0.315936 | 0.366129 | 0.428573 | 0.520560 | 0.326632 | 9.133677e+05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.041416 | 0.0 | 0.0 | 0.0 | 0.0 | 0.129952 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.206594 | 0.0 | 0.0 | 0.041416 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.041416 | 0.0 | 0.0 | 0.0 | 0.0 | 0.109010 | 0.058520 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.041416 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
len(train_data[train_data['sqft'] >= 4900])
568
train_data[train_data['sqft'] >= 10000]
id | sale_date | sale_nbr | sale_warning | join_status | join_year | latitude | longitude | area | city | zoning | subdivision | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | submarket | adjusted_sale_price | sale_warning_1 | sale_warning_2 | sale_warning_3 | sale_warning_4 | sale_warning_5 | sale_warning_6 | sale_warning_7 | sale_warning_8 | sale_warning_9 | sale_warning_10 | sale_warning_11 | sale_warning_12 | sale_warning_13 | sale_warning_14 | sale_warning_15 | sale_warning_16 | sale_warning_17 | sale_warning_18 | sale_warning_19 | sale_warning_20 | sale_warning_21 | sale_warning_22 | sale_warning_23 | sale_warning_24 | sale_warning_25 | sale_warning_26 | sale_warning_27 | sale_warning_28 | sale_warning_29 | sale_warning_30 | sale_warning_31 | sale_warning_32 | sale_warning_33 | sale_warning_34 | sale_warning_35 | sale_warning_36 | sale_warning_37 | sale_warning_38 | sale_warning_39 | sale_warning_40 | sale_warning_41 | sale_warning_42 | sale_warning_43 | sale_warning_44 | sale_warning_45 | sale_warning_46 | sale_warning_47 | sale_warning_48 | sale_warning_49 | sale_warning_50 | sale_warning_51 | sale_warning_52 | sale_warning_53 | sale_warning_54 | sale_warning_55 | sale_warning_56 | sale_warning_57 | sale_warning_58 | sale_warning_59 | sale_warning_60 | sale_warning_61 | sale_warning_62 | zoning_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
75850 | 75850 | 2004-07-15 | partial_split | new | 2025 | 47.6003 | -122.0071 | 35 | SAMMAMISH | R4 | BEAVERDAM DIV NO. 01 | 2 | 1261000 | 3388000 | 1999 | 0 | 77101 | 10380 | 5270 | 5110 | 13 | 12 | 3 | 1.0 | 6 | 2 | 3 | 1 | 0 | 790 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | O | 6083831 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones | |
89377 | 89377 | 2004-09-15 | standard | 26 | nochg | 2025 | 47.6271 | -122.3149 | 13 | SEATTLE | NR3 | CAPITOL HILL | 2 | 1744000 | 4223000 | 1914 | 0 | 13744 | 10950 | 3770 | 800 | 13 | 8 | 5 | 2.5 | 13 | 9 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | D | 3871529 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Neighborhood Residential |
100476 | 100476 | 2003-03-15 | partial_split | nochg | 2025 | 47.4398 | -122.0240 | 66 | KING COUNTY | RA5 | WEBSTER LAKE ESTATES | 2 | 507000 | 2649000 | 1998 | 0 | 206038 | 13310 | 5290 | 3730 | 12 | 10 | 3 | 2.0 | 5 | 2 | 4 | 1 | 0 | 1570 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | 5392654 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones | |
104430 | 104430 | 1999-03-15 | standard | nochg | 2025 | 47.4419 | -122.0136 | 66 | KING COUNTY | RA5 | NaN | 2 | 695000 | 2101000 | 1998 | 0 | 102366 | 10150 | 4950 | 2490 | 12 | 10 | 3 | 2.0 | 4 | 4 | 1 | 2 | 0 | 1370 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | K | 4451419 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones | |
128359 | 128359 | 2000-03-15 | standard | 10 | new | 2025 | 47.5945 | -122.2066 | 92 | BELLEVUE | R-1.8 | NaN | 2 | 3204000 | 6653000 | 2001 | 0 | 65775 | 11400 | 7760 | 320 | 13 | 9 | 3 | 2.0 | 5 | 5 | 0 | 2 | 0 | 1290 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | S | 1776733 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Industrial and Other |
train_data[(train_data['sqft'] <= 300) & (train_data['beds'] > 0)]
id | sale_date | sale_nbr | sale_warning | join_status | join_year | latitude | longitude | area | city | zoning | subdivision | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | submarket | adjusted_sale_price | sale_warning_1 | sale_warning_2 | sale_warning_3 | sale_warning_4 | sale_warning_5 | sale_warning_6 | sale_warning_7 | sale_warning_8 | sale_warning_9 | sale_warning_10 | sale_warning_11 | sale_warning_12 | sale_warning_13 | sale_warning_14 | sale_warning_15 | sale_warning_16 | sale_warning_17 | sale_warning_18 | sale_warning_19 | sale_warning_20 | sale_warning_21 | sale_warning_22 | sale_warning_23 | sale_warning_24 | sale_warning_25 | sale_warning_26 | sale_warning_27 | sale_warning_28 | sale_warning_29 | sale_warning_30 | sale_warning_31 | sale_warning_32 | sale_warning_33 | sale_warning_34 | sale_warning_35 | sale_warning_36 | sale_warning_37 | sale_warning_38 | sale_warning_39 | sale_warning_40 | sale_warning_41 | sale_warning_42 | sale_warning_43 | sale_warning_44 | sale_warning_45 | sale_warning_46 | sale_warning_47 | sale_warning_48 | sale_warning_49 | sale_warning_50 | sale_warning_51 | sale_warning_52 | sale_warning_53 | sale_warning_54 | sale_warning_55 | sale_warning_56 | sale_warning_57 | sale_warning_58 | sale_warning_59 | sale_warning_60 | sale_warning_61 | sale_warning_62 | zoning_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
121034 | 121034 | 2013-11-15 | standard | nochg | 2025 | 47.5847 | -122.0002 | 69 | SAMMAMISH | R4 | NaN | 2 | 973000 | 0 | 1985 | 0 | 40040 | 200 | 200 | 0 | 7 | 0 | 3 | 1.0 | 3 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | O | 1351169 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones |
#Logically sqft_1 should always be less than sqft
#Potential issues in data quality due one or more of the following: Manual input mistakes during property assessment; Different measurement standards applied at different times; Values recorded by different assessors using varying methodologies
train_data[(train_data['sqft'] < train_data['sqft_1'])].head(10)
id | sale_date | sale_nbr | sale_warning | join_status | join_year | latitude | longitude | area | city | zoning | subdivision | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | submarket | adjusted_sale_price | sale_warning_1 | sale_warning_2 | sale_warning_3 | sale_warning_4 | sale_warning_5 | sale_warning_6 | sale_warning_7 | sale_warning_8 | sale_warning_9 | sale_warning_10 | sale_warning_11 | sale_warning_12 | sale_warning_13 | sale_warning_14 | sale_warning_15 | sale_warning_16 | sale_warning_17 | sale_warning_18 | sale_warning_19 | sale_warning_20 | sale_warning_21 | sale_warning_22 | sale_warning_23 | sale_warning_24 | sale_warning_25 | sale_warning_26 | sale_warning_27 | sale_warning_28 | sale_warning_29 | sale_warning_30 | sale_warning_31 | sale_warning_32 | sale_warning_33 | sale_warning_34 | sale_warning_35 | sale_warning_36 | sale_warning_37 | sale_warning_38 | sale_warning_39 | sale_warning_40 | sale_warning_41 | sale_warning_42 | sale_warning_43 | sale_warning_44 | sale_warning_45 | sale_warning_46 | sale_warning_47 | sale_warning_48 | sale_warning_49 | sale_warning_50 | sale_warning_51 | sale_warning_52 | sale_warning_53 | sale_warning_54 | sale_warning_55 | sale_warning_56 | sale_warning_57 | sale_warning_58 | sale_warning_59 | sale_warning_60 | sale_warning_61 | sale_warning_62 | zoning_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3794 | 3794 | 2004-06-15 | standard | nochg | 2025 | 47.5750 | -122.1424 | 31 | BELLEVUE | R-5 | EASTGATE ADD DIV A | 2 | 900000 | 35000 | 1954 | 0 | 10400 | 1050 | 1160 | 0 | 7 | 0 | 4 | 1.0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | R | 525421 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones | |
4066 | 4066 | 2018-06-15 | standard | nochg | 2025 | 47.2486 | -122.0010 | 40 | KING COUNTY | RA2.5 | NaN | 2 | 301000 | 635000 | 1988 | 0 | 98881 | 2360 | 2660 | 0 | 8 | 0 | 3 | 1.0 | 2 | 2 | 0 | 1 | 0 | 790 | 0 | 0 | 0 | 0 | 3 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | M | 912977 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones | |
7382 | 7382 | 2004-06-15 | partial_split | nochg | 2025 | 47.4862 | -122.3303 | 96 | BURIEN | RS-7200 | CEDARHURST DIV NO. 01 | 2 | 201000 | 184000 | 1963 | 0 | 6600 | 620 | 850 | 0 | 6 | 0 | 3 | 1.0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | 285387 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones | |
10650 | 10650 | 2015-06-15 | partial_split | nochg | 2025 | 47.3729 | -122.2780 | 26 | KENT | SR-6 | KENTWOOD GLEN NO. 02 | 2 | 151000 | 383000 | 1967 | 0 | 8480 | 1650 | 1940 | 0 | 7 | 0 | 4 | 1.0 | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | I | 357102 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Special Use Zones | |
12230 | 12230 | 2017-10-15 | standard | 26 | nochg | 2025 | 47.5076 | -122.1415 | 66 | KING COUNTY | RA5 | MAY VALLEY DIV NO. 02 | 2 | 253000 | 394000 | 1965 | 0 | 16400 | 1300 | 1640 | 0 | 7 | 0 | 4 | 1.0 | 3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | K | 525427 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones |
21623 | 21623 | 2010-05-15 | standard | 60 | nochg | 2025 | 47.6784 | -122.1602 | 93 | REDMOND | NR | BURKE-FARRARS KIRKLAND DIV NO. 12 | 2 | 775000 | 1000 | 1962 | 0 | 16900 | 920 | 1300 | 0 | 6 | 0 | 4 | 1.0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Q | 501974 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | Neighborhood Residential |
22436 | 22436 | 2004-05-15 | standard | nochg | 2025 | 47.6917 | -122.2015 | 74 | KIRKLAND | RS 7.2 | BURKE-FARRARS KIRKLAND DIV NO. 27 | 2 | 1058000 | 241000 | 1969 | 0 | 7200 | 1250 | 1400 | 0 | 7 | 0 | 3 | 1.0 | 3 | 1 | 1 | 0 | 0 | 390 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Q | 772093 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Industrial and Other | |
26605 | 26605 | 2003-12-15 | partial_split | nochg | 2025 | 47.6960 | -122.1139 | 72 | REDMOND | NR | MESA VERDE DIV NO. 01 | 2 | 664000 | 98000 | 1969 | 0 | 7360 | 1060 | 1130 | 0 | 7 | 0 | 3 | 1.0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | P | 528030 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Neighborhood Residential | |
38742 | 38742 | 2009-10-15 | standard | nochg | 2025 | 47.4735 | -122.3438 | 96 | BURIEN | RS-7200 | LEONARD ADD | 2 | 226000 | 286000 | 1953 | 0 | 8360 | 1420 | 1500 | 0 | 7 | 0 | 3 | 1.0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | G | 455976 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones | |
39678 | 39678 | 2013-06-15 | partial_split | nochg | 2025 | 47.7516 | -122.3567 | 1 | SHORELINE | R6 | BALCHS ALBERT PARK HIGHLANDS ADD | 2 | 386000 | 288000 | 1955 | 0 | 8100 | 1300 | 1590 | 0 | 7 | 0 | 3 | 1.0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | A | 488164 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones |
# Will remove the 126 affected rows due to low number of records and inability to which value (sqft or sqft_1) is incorrect
train_data = train_data.drop(train_data[(train_data['sqft'] < train_data['sqft_1'])].index)
sns.boxenplot(data=train_data[train_data['sqft_fbsmt'] > 0], x='sqft_fbsmt')
plt.show()
sns.boxenplot(data=train_data[train_data['gara_sqft'] > 0], x='gara_sqft')
plt.show()
sns.boxenplot(data=train_data[train_data['garb_sqft'] > 0], x='garb_sqft')
plt.show()
train_data[(train_data['garb_sqft'] >= 4000)]
id | sale_date | sale_nbr | sale_warning | join_status | join_year | latitude | longitude | area | city | zoning | subdivision | present_use | land_val | imp_val | year_built | year_reno | sqft_lot | sqft | sqft_1 | sqft_fbsmt | grade | fbsmt_grade | condition | stories | beds | bath_full | bath_3qtr | bath_half | garb_sqft | gara_sqft | wfnt | golf | greenbelt | noise_traffic | view_rainier | view_olympics | view_cascades | view_territorial | view_skyline | view_sound | view_lakewash | view_lakesamm | view_otherwater | view_other | submarket | adjusted_sale_price | sale_warning_1 | sale_warning_2 | sale_warning_3 | sale_warning_4 | sale_warning_5 | sale_warning_6 | sale_warning_7 | sale_warning_8 | sale_warning_9 | sale_warning_10 | sale_warning_11 | sale_warning_12 | sale_warning_13 | sale_warning_14 | sale_warning_15 | sale_warning_16 | sale_warning_17 | sale_warning_18 | sale_warning_19 | sale_warning_20 | sale_warning_21 | sale_warning_22 | sale_warning_23 | sale_warning_24 | sale_warning_25 | sale_warning_26 | sale_warning_27 | sale_warning_28 | sale_warning_29 | sale_warning_30 | sale_warning_31 | sale_warning_32 | sale_warning_33 | sale_warning_34 | sale_warning_35 | sale_warning_36 | sale_warning_37 | sale_warning_38 | sale_warning_39 | sale_warning_40 | sale_warning_41 | sale_warning_42 | sale_warning_43 | sale_warning_44 | sale_warning_45 | sale_warning_46 | sale_warning_47 | sale_warning_48 | sale_warning_49 | sale_warning_50 | sale_warning_51 | sale_warning_52 | sale_warning_53 | sale_warning_54 | sale_warning_55 | sale_warning_56 | sale_warning_57 | sale_warning_58 | sale_warning_59 | sale_warning_60 | sale_warning_61 | sale_warning_62 | zoning_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
512 | 512 | 2020-01-15 | partial_split | new | 2025 | 47.6271 | -121.9556 | 71 | KING COUNTY | RA2.5 | NaN | 2 | 660000 | 830000 | 2000 | 0 | 210830 | 3200 | 2560 | 0 | 8 | 0 | 3 | 2.0 | 4 | 1 | 2 | 1 | 4000 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | O | 1727310 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Residential Zones |
### Convert select features to binomial ###
# sqft_fbsmt (1 if finished basement)
train_data['finished_basement_type'] = np.select([train_data['sqft_fbsmt'] == 0,
(train_data['sqft_fbsmt'] > 0) & (train_data['sqft_fbsmt'] <= 500),
(train_data['sqft_fbsmt'] > 1250) & (train_data['sqft_fbsmt'] <= 2000),
train_data['sqft_fbsmt'] > 2000],
['None', 'Small', 'Large', 'XLarge'],
default='Medium')
# gara_sqft (1 if attached garage)
train_data['attached_garage_type'] = np.select([train_data['gara_sqft'] == 0,
(train_data['gara_sqft'] > 0) & (train_data['gara_sqft'] <= 250),
(train_data['gara_sqft'] > 750) & (train_data['gara_sqft'] <= 1000),
train_data['gara_sqft'] >= 1000],
['None', 'Small', 'Large', 'XLarge'],
default='Medium')
# garb_sqft (1 if basement garage)
train_data['basement_garage_type'] = np.select([train_data['garb_sqft'] == 0,
(train_data['garb_sqft'] > 0) & (train_data['garb_sqft'] <= 250),
(train_data['garb_sqft'] > 750) & (train_data['garb_sqft'] <= 1000),
train_data['garb_sqft'] >= 1000],
['None', 'Small', 'Large', 'XLarge'],
default='Medium')
The following features are presented in the dataset in the form of a quality score as determined by an assessor during the most recent assessment. In addition to the challenges presented by some of the assesor dependent features and the potential for subjectivness in scoring, the features have been transformed to bionomial values where 1 indicates that the property has the feature as whether or not the feature exists is objective.
# wfnt (1 if wfnt indicates some level of access)
train_data['has_waterfront_access'] = [0 if x == 0 else 1 for x in train_data.wfnt]
# noise_traffic (1 if above typical noise levels)
train_data['above_typical_noise'] = [0 if x == 0 else 1 for x in train_data.noise_traffic]
# all view attributes (1 if view)
train_data['has_view_rainier'] = [0 if x == 0 else 1 for x in train_data.view_rainier]
train_data['has_view_olympics'] = [0 if x == 0 else 1 for x in train_data.view_olympics]
train_data['has_view_cascades'] = [0 if x == 0 else 1 for x in train_data.view_cascades]
train_data['has_view_territorial'] = [0 if x == 0 else 1 for x in train_data.view_territorial]
train_data['has_view_skyline'] = [0 if x == 0 else 1 for x in train_data.view_skyline]
train_data['has_view_sound'] = [0 if x == 0 else 1 for x in train_data.view_sound]
train_data['has_view_lakewash'] = [0 if x == 0 else 1 for x in train_data.view_lakewash]
train_data['has_view_lakesamm'] = [0 if x == 0 else 1 for x in train_data.view_lakesamm]
train_data['has_view_otherwater'] = [0 if x == 0 else 1 for x in train_data.view_otherwater]
train_data['has_view_other'] = [0 if x == 0 else 1 for x in train_data.view_other]
Trim dataset down by removing features as indicated by above analysis.ΒΆ
selected_features = [
'id',
### sale features ###
'sale_date',
'adjusted_sale_price',
'sale_nbr',
'sale_warning_3',
'sale_warning_4',
'sale_warning_10',
'sale_warning_15',
'sale_warning_16',
'sale_warning_17',
'sale_warning_24',
'sale_warning_26',
'sale_warning_29',
'sale_warning_30',
'sale_warning_35',
'sale_warning_36',
'sale_warning_38',
'sale_warning_40',
'sale_warning_41',
'sale_warning_44',
'sale_warning_54',
'sale_warning_57',
'sale_warning_58',
'sale_warning_60',
### Geographic Features ###
'latitude',
'longitude',
'area',
'city',
### Legal Features ###
'zoning_category',
### Property Features ###
'year_built',
'year_reno',
'sqft',
'sqft_1',
'stories',
'beds',
'bath_full',
'bath_3qtr',
'bath_half',
'golf',
'greenbelt',
'submarket',
'finished_basement_type',
'attached_garage_type',
'basement_garage_type',
'has_waterfront_access',
'above_typical_noise',
'has_view_rainier',
'has_view_olympics',
'has_view_cascades',
'has_view_territorial',
'has_view_skyline',
'has_view_sound',
'has_view_lakewash',
'has_view_lakesamm',
'has_view_otherwater',
'has_view_other'
]
train_data = train_data[selected_features]
4. Feature EngineeringΒΆ
While we have done some feature engineering above, in this section we employ more advance techniques in an attempt to capture more complex factors that influence house prices. In this section, we create:
Spatial Features:
- Distance to key locations (downtown, water bodies, etc.)
- Neighborhood density and characteristics
- Geographic clustering of similar properties
Temporal Features:
- Seasonality indicators (month, quarter) with cyclical encoding
- Market trend indicators
- Property age and renovation timing
Property Characteristic Features:
- Size ratios and efficiency metrics
- Quality and condition indicators
- Amenity presence and quality scores
Market Context Features:
- Neighborhood price trends
- Sales volume indicators
- Comparative market analysis features
These engineered features help our model understand the complex relationships between property characteristics and market values, improving prediction accuracy and reliability.
# Log-transform sale_price and square footage variables (right-skewed distributions)
train_data.drop(index=train_data[train_data['sqft'] == 0].index, inplace=True)
train_data['log_adj_sale_price'] = np.log(train_data['adjusted_sale_price'])
train_data['log_sqft'] = np.log(train_data['sqft'])
train_data['log_sqft_1'] = np.log(train_data['sqft_1'])
train_data.head()
id | sale_date | adjusted_sale_price | sale_nbr | sale_warning_3 | sale_warning_4 | sale_warning_10 | sale_warning_15 | sale_warning_16 | sale_warning_17 | sale_warning_24 | sale_warning_26 | sale_warning_29 | sale_warning_30 | sale_warning_35 | sale_warning_36 | sale_warning_38 | sale_warning_40 | sale_warning_41 | sale_warning_44 | sale_warning_54 | sale_warning_57 | sale_warning_58 | sale_warning_60 | latitude | longitude | area | city | zoning_category | year_built | year_reno | sqft | sqft_1 | stories | beds | bath_full | bath_3qtr | bath_half | golf | greenbelt | submarket | finished_basement_type | attached_garage_type | basement_garage_type | has_waterfront_access | above_typical_noise | has_view_rainier | has_view_olympics | has_view_cascades | has_view_territorial | has_view_skyline | has_view_sound | has_view_lakewash | has_view_lakesamm | has_view_otherwater | has_view_other | log_adj_sale_price | log_sqft | log_sqft_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1999-01-15 | 776952 | standard | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47.6531 | -122.1996 | 74 | KIRKLAND | Industrial and Other | 1962 | 0 | 2040 | 1220 | 1.0 | 3 | 1 | 1 | 1 | 0 | 0 | Q | Medium | None | None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 13.563134 | 7.620705 | 7.106606 |
2 | 2 | 2006-08-15 | 697511 | partial_split | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47.4733 | -122.1901 | 30 | RENTON | Residential Zones | 1986 | 0 | 1640 | 820 | 2.0 | 3 | 2 | 0 | 1 | 0 | 0 | K | None | Medium | None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13.455274 | 7.402452 | 6.709304 |
3 | 3 | 1999-12-15 | 662133 | partial_split | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47.4739 | -122.3295 | 96 | BURIEN | Residential Zones | 1998 | 0 | 2610 | 1010 | 2.0 | 4 | 2 | 0 | 1 | 0 | 0 | G | Small | Medium | None | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13.403222 | 7.867106 | 6.917706 |
7 | 7 | 2001-08-15 | 527497 | partial_split | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47.3090 | -122.3490 | 54 | FEDERAL WAY | Industrial and Other | 1985 | 0 | 2040 | 1120 | 2.0 | 3 | 2 | 0 | 1 | 0 | 0 | I | None | Medium | None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13.175898 | 7.620705 | 7.021084 |
8 | 8 | 2002-01-15 | 534295 | partial_split | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47.4955 | -122.3565 | 96 | BURIEN | Residential Zones | 1962 | 0 | 2180 | 1090 | 1.0 | 4 | 1 | 1 | 1 | 0 | 0 | G | Medium | None | None | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13.188703 | 7.687080 | 6.993933 |
Spatial Feature EngineeringΒΆ
# Distance-based features: Distance to city centers, schools, transit hubs
# Neighborhood clustering: K-means clustering on lat/lon to create location groups
# Market density: Count of recent sales within radius
def create_spatial_features(df):
"""
Create spatial features for location-based insights in house price prediction.
Parameters:
df (DataFrame): Input dataframe containing latitude, longitude, and location data
Returns:
DataFrame: Enhanced dataframe with spatial features
"""
df_spatial = df.copy()
# Reference points for distance calculations
reference_points = {
'seattle_downtown': (47.6062, -122.3321),
'bellevue_downtown': (47.6101, -122.2015),
'redmond_downtown': (47.6740, -122.1215),
'sea_airport': (47.4502, -122.3088)
}
if 'latitude' in df_spatial.columns and 'longitude' in df_spatial.columns:
# 1. Distance-based features
print("Creating distance-based features...")
for location, coords in reference_points.items():
distances = []
for idx, row in df_spatial.iterrows():
if pd.notna(row['latitude']) and pd.notna(row['longitude']):
try:
distance = geodesic(
(row['latitude'], row['longitude']),
coords
).kilometers
distances.append(distance)
except:
distances.append(np.nan)
else:
distances.append(np.nan)
df_spatial[f'distance_to_{location}'] = distances
# 2. Geographic clustering for neighborhood analysis
print("Creating geographic clusters...")
valid_coords = df_spatial[['latitude', 'longitude']].dropna()
if len(valid_coords) > 50:
cluster_configs = {
'macro_neighborhood': min(25, len(valid_coords)//4),
'micro_neighborhood': min(100, len(valid_coords)//2),
'local_area': min(200, len(valid_coords))
}
for cluster_name, n_clusters in cluster_configs.items():
if len(valid_coords) >= n_clusters and n_clusters > 1:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
df_spatial[f'{cluster_name}_cluster'] = -1
clusters = kmeans.fit_predict(valid_coords)
df_spatial.loc[valid_coords.index, f'{cluster_name}_cluster'] = clusters
# 3. Geographic zones
print("Creating geographic zones...")
def categorize_location(lat, lon):
if pd.isna(lat) or pd.isna(lon):
return 'Unknown'
# Seattle metro area zones
if 47.6 <= lat <= 47.8 and -122.4 <= lon <= -122.2:
return 'Seattle_Core'
elif 47.5 <= lat <= 47.7 and -122.2 <= lon <= -122.0:
return 'Eastside_Core'
elif 47.3 <= lat <= 47.5:
return 'South_County'
elif lat >= 47.7:
return 'North_County'
else:
return 'Other'
df_spatial['geographic_zone'] = df_spatial.apply(
lambda row: categorize_location(row['latitude'], row['longitude']),
axis=1
)
print(f"Created {len([col for col in df_spatial.columns if col not in df.columns])} spatial features")
return df_spatial
Temporal Feature EngineeringΒΆ
# Sale seasonality: Month, quarter indicators
# Market timing: Time since last major economic event
# Property age: Age at sale, years since renovation
# Market momentum: Rolling average prices in area
def create_temporal_features(df):
"""
Create temporal features for time-based patterns in house price prediction.
Parameters:
df (DataFrame): Input dataframe containing sale_date and property age data
Returns:
DataFrame: Enhanced dataframe with temporal features
"""
df_temporal = df.copy()
if 'sale_date' in df_temporal.columns:
print("Processing sale date features...")
# Ensure sale_date is datetime
df_temporal['sale_date'] = pd.to_datetime(df_temporal['sale_date'])
# 1. Basic time components
df_temporal['sale_year'] = df_temporal['sale_date'].dt.year
df_temporal['sale_month'] = df_temporal['sale_date'].dt.month
df_temporal['sale_quarter'] = df_temporal['sale_date'].dt.quarter
df_temporal['sale_day_of_year'] = df_temporal['sale_date'].dt.dayofyear
df_temporal['sale_week_of_year'] = df_temporal['sale_date'].dt.isocalendar().week
# 2. Cyclical encoding for better ML performance
df_temporal['month_sin'] = np.sin(2 * np.pi * df_temporal['sale_month'] / 12)
df_temporal['month_cos'] = np.cos(2 * np.pi * df_temporal['sale_month'] / 12)
df_temporal['quarter_sin'] = np.sin(2 * np.pi * df_temporal['sale_quarter'] / 4)
df_temporal['quarter_cos'] = np.cos(2 * np.pi * df_temporal['sale_quarter'] / 4)
# 3. Seasonal indicators
df_temporal['is_spring'] = ((df_temporal['sale_month'] >= 3) &
(df_temporal['sale_month'] <= 5)).astype(int)
df_temporal['is_summer'] = ((df_temporal['sale_month'] >= 6) &
(df_temporal['sale_month'] <= 8)).astype(int)
df_temporal['is_fall'] = ((df_temporal['sale_month'] >= 9) &
(df_temporal['sale_month'] <= 11)).astype(int)
df_temporal['is_winter'] = ((df_temporal['sale_month'] == 12) |
(df_temporal['sale_month'] <= 2)).astype(int)
# 4. Market timing features
print("Creating market timing features...")
market_events = {
'2000-03-10': 'dot_com_peak',
'2008-09-15': 'financial_crisis',
'2020-03-15': 'covid_start'
}
for event_date, event_name in market_events.items():
event_datetime = pd.to_datetime(event_date)
days_since = (df_temporal['sale_date'] - event_datetime).dt.days
df_temporal[f'days_since_{event_name}'] = days_since
df_temporal[f'months_since_{event_name}'] = days_since / 30.44
# 5. Market cycle indicators
df_temporal['is_pre_2008_crisis'] = (df_temporal['sale_date'] < '2008-01-01').astype(int)
df_temporal['is_post_2008_recovery'] = ((df_temporal['sale_date'] >= '2012-01-01') &
(df_temporal['sale_date'] < '2020-01-01')).astype(int)
df_temporal['is_covid_era'] = (df_temporal['sale_date'] >= '2020-01-01').astype(int)
# 6. Property age features
if 'year_built' in df_temporal.columns and 'sale_date' in df_temporal.columns:
print("Creating property age features...")
# Negative values of property_age_at_sale represent land sales where construction occured after sale
df_temporal['property_age_at_sale'] = df_temporal['sale_year'] - df_temporal['year_built']
df_temporal['is_land_sale_post_construction'] = (df_temporal['property_age_at_sale'] <= 0).astype(int)
df_temporal['is_vintage'] = (df_temporal['property_age_at_sale'] >= 50).astype(int)
# 7. Renovation timing
if 'year_reno' in df_temporal.columns and 'sale_date' in df_temporal.columns:
print("Creating renovation features...")
df_temporal['years_since_renovation'] = df_temporal['sale_year'] - df_temporal['year_reno']
df_temporal['has_been_renovated'] = (df_temporal['year_reno'] > df_temporal['year_built']).astype(int)
print(f"Created {len([col for col in df_temporal.columns if col not in df.columns])} temporal features")
return df_temporal
Property Characteristic EnhancementΒΆ
# Composite scores: Price per square foot, bathroom-to-bedroom ratio
# Quality interactions: Grade Γ condition interactions
# Size categories: Binned square footage categories
# Luxury indicators: Combination of high-end features
def create_property_features(df):
"""
Create enhanced property characteristic features for house price prediction.
Parameters:
df (DataFrame): Input dataframe containing property characteristics
Returns:
DataFrame: Enhanced dataframe with property features
"""
df_property = df.copy()
print("Creating property characteristic features...")
# 1. Size and ratio features
if 'sqft' in df_property.columns and 'sqft_lot' in df_property.columns:
df_property['house_to_lot_ratio'] = (df_property['sqft'] /
df_property['sqft_lot'].replace(0, np.nan))
# Lot size categories
df_property['lot_size_category'] = pd.cut(
df_property['sqft_lot'],
bins=[0, 5000, 10000, 20000, np.inf],
labels=['Small', 'Medium', 'Large', 'XLarge']
)
# 2. Room efficiency and ratios
if 'beds' in df_property.columns and 'sqft' in df_property.columns:
df_property['sqft_per_bedroom'] = (df_property['sqft'] /
df_property['beds'].replace(0, np.nan))
# 3. Bathroom calculations
bath_columns = ['bath_full', 'bath_3qtr', 'bath_half']
available_bath_cols = [col for col in bath_columns if col in df_property.columns]
if available_bath_cols:
df_property['total_bathrooms'] = 0
weights = {'bath_full': 1.0, 'bath_3qtr': 0.75, 'bath_half': 0.5}
for col in available_bath_cols:
df_property['total_bathrooms'] += (
df_property[col].fillna(0) * weights.get(col, 1.0)
)
if 'beds' in df_property.columns:
df_property['bathroom_bedroom_ratio'] = (
df_property['total_bathrooms'] /
df_property['beds'].replace(0, np.nan)
)
# 4. Property type categorization
# Multi-story classification
if 'stories' in df_property.columns:
df_property['is_single_story'] = (df_property['stories'] == 1.0).astype(int)
df_property['is_multi_story'] = (df_property['stories'] > 1.0).astype(int)
print(f"Created {len([col for col in df_property.columns if col not in df.columns])} property features")
return df_property
Market Context FeaturesΒΆ
# Comparative market analysis: Recent sales in same subdivision
# Supply indicators: Inventory levels by area and price range
# Economic indicators: Interest rates, local employment data (if available)
def create_market_context_features(df):
"""
Create market context and comparative features for house price prediction.
Parameters:
df (DataFrame): Input dataframe containing sale data and location information
Returns:
DataFrame: Enhanced dataframe with market context features
"""
df_market = df.copy()
print("Creating market context features...")
# 1. Time-based market features
if 'sale_date' in df_market.columns and 'adjusted_sale_price' in df_market.columns:
print("Processing temporal market patterns...")
# Ensure proper datetime format
df_market['sale_date'] = pd.to_datetime(df_market['sale_date'])
df_market_sorted = df_market.sort_values('sale_date').copy()
# Rolling market indicators (6-month windows)
window_size = min(180, len(df_market_sorted) // 4) # Adaptive window size
if window_size >= 10:
df_market_sorted['rolling_median_price_6m'] = (
df_market_sorted['adjusted_sale_price']
.rolling(window=window_size, min_periods=10)
.median()
)
df_market_sorted['rolling_std_price_6m'] = (
df_market_sorted['adjusted_sale_price']
.rolling(window=window_size, min_periods=10)
.std()
)
# Price volatility (coefficient of variation)
df_market_sorted['market_volatility_6m'] = (
df_market_sorted['rolling_std_price_6m'] /
df_market_sorted['rolling_median_price_6m']
)
# Restore original order
df_market = df_market_sorted.sort_index()
# Year-over-year market features
if 'sale_year' in df_market.columns:
yearly_stats = df_market.groupby('sale_year')['adjusted_sale_price'].agg([
'median', 'mean', 'std', 'count'
]).reset_index()
yearly_stats.columns = ['sale_year', 'yearly_median_price', 'yearly_mean_price',
'yearly_price_std', 'yearly_sale_count']
df_market = df_market.merge(yearly_stats, on='sale_year', how='left')
# Market activity indicators
df_market['relative_to_yearly_median'] = (
df_market['adjusted_sale_price'] / df_market['yearly_median_price']
)
df_market['is_high_activity_year'] = (
df_market['yearly_sale_count'] >
df_market['yearly_sale_count'].quantile(0.75)
).astype(int)
# 2. Price trend features
if all(col in df_market.columns for col in ['sale_date', 'adjusted_sale_price']):
print("Creating price momentum indicators...")
df_market_sorted = df_market.sort_values('sale_date').copy()
# Price momentum (30-day and 90-day trends)
for window in [30, 90]:
window_size = min(window, len(df_market_sorted) // 10)
if window_size >= 5:
df_market_sorted[f'price_trend_{window}d'] = (
df_market_sorted['adjusted_sale_price']
.rolling(window=window_size, min_periods=5)
.apply(lambda x: np.polyfit(range(len(x)), x, 1)[0]
if len(x) >= 5 else np.nan, raw=False)
)
# Market momentum indicators
if 'price_trend_30d' in df_market_sorted.columns:
df_market_sorted['is_rising_market_30d'] = (
df_market_sorted['price_trend_30d'] > 0
).astype(int)
df_market_sorted['is_falling_market_30d'] = (
df_market_sorted['price_trend_30d'] < 0
).astype(int)
# Restore original order
df_market = df_market_sorted.sort_index()
# 3. Supply and demand indicators
print("Creating supply/demand proxies...")
# Monthly sales volume by area
if all(col in df_market.columns for col in ['sale_date', 'city']):
df_market['year_month'] = df_market['sale_date'].dt.to_period('M')
monthly_volume = df_market.groupby(['city', 'year_month']).size().reset_index(name='monthly_volume')
monthly_volume['year_month'] = monthly_volume['year_month'].astype(str)
df_market['year_month'] = df_market['year_month'].astype(str)
df_market = df_market.merge(monthly_volume, on=['city', 'year_month'], how='left')
# Market heat indicator
df_market['market_heat'] = pd.cut(
df_market['monthly_volume'].fillna(0),
bins=[0, 5, 15, 30, np.inf],
labels=['Cold', 'Warm', 'Hot', 'Very_Hot']
)
# Clean up temporary columns
columns_to_drop = ['year_month'] if 'year_month' in df_market.columns else []
if columns_to_drop:
df_market = df_market.drop(columns=columns_to_drop)
print(f"Created {len([col for col in df_market.columns if col not in df.columns])} market context features")
return df_market
df_enhanced = create_spatial_features(train_data)
df_enhanced = create_temporal_features(df_enhanced)
df_enhanced = create_property_features(df_enhanced)
df_enhanced = create_market_context_features(df_enhanced)
analyze_dataframe(df_enhanced)
Creating distance-based features... Creating geographic clusters... Creating geographic zones... Created 8 spatial features Processing sale date features... Creating market timing features... Creating property age features... Creating renovation features... Created 27 temporal features Creating property characteristic features... Created 5 property features Creating market context features... Processing temporal market patterns... Creating price momentum indicators... Creating supply/demand proxies... Created 15 market context features ================================================== DATAFRAME ANALYSIS ================================================== Shape: (83431, 114) Data types: int64 70 float64 29 object 8 int32 4 datetime64[ns] 1 UInt32 1 category 1 Name: count, dtype: int64 --- NUMERIC COLUMNS (104) --- sqft_per_bedroom: - Infinite values: 0 - NaN values: 27 - Extremely large values: 0 bathroom_bedroom_ratio: - Infinite values: 0 - NaN values: 27 - Extremely large values: 0 rolling_median_price_6m: - Infinite values: 0 - NaN values: 9 - Extremely large values: 0 rolling_std_price_6m: - Infinite values: 0 - NaN values: 9 - Extremely large values: 0 market_volatility_6m: - Infinite values: 0 - NaN values: 9 - Extremely large values: 0 price_trend_30d: - Infinite values: 0 - NaN values: 4 - Extremely large values: 0 price_trend_90d: - Infinite values: 0 - NaN values: 4 - Extremely large values: 0 --- NON-NUMERIC COLUMNS (10) --- sale_date: datetime64[ns], 313 unique values, 0 (0.0) missing sale_nbr: object, 3 unique values, 0 (0.0) missing city: object, 40 unique values, 0 (0.0) missing zoning_category: object, 6 unique values, 0 (0.0) missing submarket: object, 20 unique values, 0 (0.0) missing finished_basement_type: object, 5 unique values, 0 (0.0) missing attached_garage_type: object, 5 unique values, 0 (0.0) missing basement_garage_type: object, 5 unique values, 0 (0.0) missing geographic_zone: object, 5 unique values, 0 (0.0) missing market_heat: category, 4 unique values, 0 (0.0) missing
df_enhanced.to_csv('train_full_features.csv')
5. Model DevelopmentΒΆ
In this section, we implement a Quantile Regression Forest model to predict house price intervals:
Model Selection: We use RandomForestQuantileRegressor to generate prediction intervals rather than point estimates.
Feature Preparation: We transform and scale our engineered features for optimal model performance.
Hyperparameter Tuning: We use GridSearchCV to find the optimal model parameters:
- Number of trees (n_estimators)
- Maximum tree depth
- Minimum samples per leaf
- Other forest parameters
Model Training: We train the model on our prepared training data.
Prediction Generation: We generate predictions for the 5th, 50th (median), and 95th percentiles to create 90% prediction intervals.
This approach allows us to quantify uncertainty in our predictions, providing a range of likely values rather than a single point estimate.
X = df_enhanced.drop(['adjusted_sale_price', 'log_adj_sale_price', 'id'], axis=1)
y = df_enhanced['adjusted_sale_price']
#y = df_enhanced['log_adj_sale_price']
numerical_features = []
categorical_features = []
datatime_features = []
for col in X.columns:
if (X[col].dtype == 'int64') or (X[col].dtype == 'float64'):
numerical_features.append(col)
elif (X[col].dtype == 'object') or (X[col].dtype == 'category'):
categorical_features.append(col)
else:
datatime_features.append(col)
# Define transformations
numerical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer([
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=34)
X_train_transformed = preprocessor.fit_transform(X_train)
feature_names = preprocessor.get_feature_names_out()
X_train_transformed = pd.DataFrame(X_train_transformed, columns=feature_names)
X_test_transformed = preprocessor.transform(X_test)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=feature_names)
selector = SelectKBest(score_func=f_regression, k=50)
X_train_transformed_small = selector.fit_transform(X_train_transformed, y_train)
X_test_transformed_small = selector.transform(X_test_transformed)
X_train_transformed_small.shape
(62573, 50)
X_train_transformed_small
array([[-0.11786295, 0.66478132, -0.30115317, ..., 0. , 0. , 0. ], [-0.20352627, 0.16128939, 2.13725661, ..., 0. , 1. , 0. ], [ 1.40805855, 0.29008965, 0.73132665, ..., 0. , 0. , 0. ], ..., [ 0.58137264, -0.22511139, -0.45492676, ..., 0. , 0. , 0. ], [-0.35256653, -0.56467572, 0.77526196, ..., 0. , 1. , 0. ], [ 0.77986083, -0.68176686, -0.45492676, ..., 0. , 0. , 0. ]], shape=(62573, 50))
selector.get_feature_names_out()
array(['num__latitude', 'num__sqft', 'num__sqft_1', 'num__stories', 'num__beds', 'num__bath_full', 'num__bath_3qtr', 'num__bath_half', 'num__has_view_territorial', 'num__has_view_lakewash', 'num__log_sqft', 'num__log_sqft_1', 'num__distance_to_seattle_downtown', 'num__distance_to_bellevue_downtown', 'num__distance_to_redmond_downtown', 'num__distance_to_sea_airport', 'num__days_since_dot_com_peak', 'num__months_since_dot_com_peak', 'num__days_since_financial_crisis', 'num__months_since_financial_crisis', 'num__days_since_covid_start', 'num__months_since_covid_start', 'num__is_pre_2008_crisis', 'num__sqft_per_bedroom', 'num__total_bathrooms', 'num__bathroom_bedroom_ratio', 'num__is_single_story', 'num__is_multi_story', 'num__rolling_median_price_6m', 'num__rolling_std_price_6m', 'num__yearly_median_price', 'num__yearly_mean_price', 'num__yearly_price_std', 'num__yearly_sale_count', 'num__relative_to_yearly_median', 'num__is_high_activity_year', 'num__price_trend_30d', 'num__price_trend_90d', 'num__is_rising_market_30d', 'num__is_falling_market_30d', 'cat__city_MERCER ISLAND', 'cat__submarket_D', 'cat__submarket_I', 'cat__submarket_O', 'cat__submarket_R', 'cat__submarket_S', 'cat__finished_basement_type_Large', 'cat__attached_garage_type_Large', 'cat__geographic_zone_Eastside_Core', 'cat__geographic_zone_South_County'], dtype=object)
Quantile Regression ForestΒΆ
Rationale:
- Directly estimates quantiles (5th and 95th percentiles for 90% intervals)
- Handles non-linear relationships and feature interactions naturally
- Robust to outliers and distributional assumptions
- Proven effectiveness in real estate price prediction
Implementation:
- Use Random Forest with quantile loss functions
- Estimate 5th and 95th percentiles simultaneously
- Bootstrap aggregation for additional stability
from sklearn.model_selection import GridSearchCV
# Define the model
qrf = RandomForestQuantileRegressor(random_state=42)
# Define the parameter grid to search
param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 10, 20],
}
# Setup GridSearchCV
grid_search = GridSearchCV(
estimator=qrf,
param_grid=param_grid,
cv=3,
scoring='neg_mean_absolute_error',
verbose=90
)
# Fit the model
grid_search.fit(X_train_transformed_small, y_train)
# Best model
best_qrf = grid_search.best_estimator_
Fitting 3 folds for each of 6 candidates, totalling 18 fits [CV 1/3; 1/6] START max_depth=None, n_estimators=100............................ [CV 1/3; 1/6] END max_depth=None, n_estimators=100;, score=-2085.218 total time= 1.7min [CV 2/3; 1/6] START max_depth=None, n_estimators=100............................ [CV 2/3; 1/6] END max_depth=None, n_estimators=100;, score=-2184.625 total time= 1.7min [CV 3/3; 1/6] START max_depth=None, n_estimators=100............................ [CV 3/3; 1/6] END max_depth=None, n_estimators=100;, score=-2144.231 total time= 1.6min [CV 1/3; 2/6] START max_depth=None, n_estimators=200............................ [CV 1/3; 2/6] END max_depth=None, n_estimators=200;, score=-2050.715 total time= 3.3min [CV 2/3; 2/6] START max_depth=None, n_estimators=200............................ [CV 2/3; 2/6] END max_depth=None, n_estimators=200;, score=-2146.086 total time= 3.3min [CV 3/3; 2/6] START max_depth=None, n_estimators=200............................ [CV 3/3; 2/6] END max_depth=None, n_estimators=200;, score=-2154.586 total time= 3.2min [CV 1/3; 3/6] START max_depth=10, n_estimators=100.............................. [CV 1/3; 3/6] END max_depth=10, n_estimators=100;, score=-5267.360 total time= 1.1min [CV 2/3; 3/6] START max_depth=10, n_estimators=100.............................. [CV 2/3; 3/6] END max_depth=10, n_estimators=100;, score=-5422.035 total time= 1.1min [CV 3/3; 3/6] START max_depth=10, n_estimators=100.............................. [CV 3/3; 3/6] END max_depth=10, n_estimators=100;, score=-5547.786 total time= 1.1min [CV 1/3; 4/6] START max_depth=10, n_estimators=200.............................. [CV 1/3; 4/6] END max_depth=10, n_estimators=200;, score=-5129.712 total time= 2.3min [CV 2/3; 4/6] START max_depth=10, n_estimators=200.............................. [CV 2/3; 4/6] END max_depth=10, n_estimators=200;, score=-5256.372 total time= 2.0min [CV 3/3; 4/6] START max_depth=10, n_estimators=200.............................. [CV 3/3; 4/6] END max_depth=10, n_estimators=200;, score=-5425.331 total time= 2.1min [CV 1/3; 5/6] START max_depth=20, n_estimators=100.............................. [CV 1/3; 5/6] END max_depth=20, n_estimators=100;, score=-2087.715 total time= 1.7min [CV 2/3; 5/6] START max_depth=20, n_estimators=100.............................. [CV 2/3; 5/6] END max_depth=20, n_estimators=100;, score=-2144.319 total time= 1.7min [CV 3/3; 5/6] START max_depth=20, n_estimators=100.............................. [CV 3/3; 5/6] END max_depth=20, n_estimators=100;, score=-2172.198 total time= 1.7min [CV 1/3; 6/6] START max_depth=20, n_estimators=200.............................. [CV 1/3; 6/6] END max_depth=20, n_estimators=200;, score=-2069.367 total time= 3.3min [CV 2/3; 6/6] START max_depth=20, n_estimators=200.............................. [CV 2/3; 6/6] END max_depth=20, n_estimators=200;, score=-2131.047 total time= 3.3min [CV 3/3; 6/6] START max_depth=20, n_estimators=200.............................. [CV 3/3; 6/6] END max_depth=20, n_estimators=200;, score=-2163.186 total time= 3.3min
# Prediction with multiple quantiles
quantiles = [0.05, 0.5, 0.95] # 5th, 50th, 95th percentiles
predictions = best_qrf.predict(X_train_transformed_small, quantiles=quantiles)
print(predictions[:10])
[[1268545. 1273353. 1273353. ] [ 716555. 718299. 719171. ] [ 632647. 634854. 634854. ] [ 676359. 676359. 676359. ] [ 674081. 674081. 674081. ] [ 696795. 696795. 696795. ] [ 303721. 309510. 311631.55] [ 664732. 666779. 672411.15] [ 509330. 509330. 509330. ] [1459969. 1460076. 1460076. ]]
5.1 Model EvaluationΒΆ
We evaluate our quantile regression forest model using several approaches:
Interval Coverage: We measure the percentage of actual prices that fall within our predicted 90% intervals.
Interval Width: We analyze the width of prediction intervals to assess model confidence across different property types.
Geospatial Visualization: We plot predictions on a map of King County, color-coding by prediction quality:
- Green (Excellent): Within interval and close to median prediction
- Gold (Good): Within interval but further from median
- Red (Poor): Outside the prediction interval
This comprehensive evaluation helps us understand where our model performs well and where it might need improvement.
def plot_predictions_on_king_county(model, X_data, y_actual, original_data,
dataset_type='test', quantiles=[0.05, 0.95],
figsize=(15, 12), point_size=15, alpha=0.7):
"""
Plot house price predictions on King County boundary map with color-coded accuracy.
"""
print("Downloading King County boundary from OpenStreetMap...")
# Create figure
fig, ax = plt.subplots(figsize=figsize)
king_county = ox.geocode_to_gdf("King County, Washington, USA")
# Plot boundary
king_county.plot(ax=ax, color='lightgray', edgecolor='black', alpha=0.3, linewidth=2)
print(f"Making predictions for {len(X_data)} properties...")
# Make predictions
predictions = model.predict(X_data, quantiles=quantiles)
lower_bound = predictions[:, 0]
upper_bound = predictions[:, 1]
median_pred = np.ravel(model.predict(X_data, quantiles=[0.5]))
# Determine accuracy categories
within_interval = (y_actual >= lower_bound) & (y_actual <= upper_bound)
# Create color categories
colors = []
categories = []
within_interval = np.array(within_interval)
for i in range(len(y_actual)):
actual = y_actual.iloc[i] if isinstance(y_actual, pd.Series) else y_actual[i]
if within_interval[i]:
# Check how well-centered the prediction is
interval_width = upper_bound[i] - lower_bound[i]
if interval_width > 0:
distance_from_center = abs(actual - median_pred[i])
relative_position = distance_from_center / (interval_width / 2)
if relative_position <= 0.3: # Very close to center
colors.append('green')
categories.append('Excellent')
else: # Within interval but not perfectly centered
colors.append('gold')
categories.append('Good')
else:
colors.append('green')
categories.append('Excellent')
else:
colors.append('red')
categories.append('Poor')
# Get coordinates
if len(original_data) >= len(X_data):
lats = original_data['latitude'].iloc[:len(X_data)].values
lons = original_data['longitude'].iloc[:len(X_data)].values
else:
lats = original_data['latitude'].values
lons = original_data['longitude'].values
# Plot points by category
for category, color in [('Excellent', 'green'), ('Good', 'gold'), ('Poor', 'red')]:
mask = [c == category for c in categories]
if sum(mask) > 0:
mask_lats = [lats[i] for i, m in enumerate(mask) if m]
mask_lons = [lons[i] for i, m in enumerate(mask) if m]
ax.scatter(mask_lons, mask_lats, c=color, s=point_size, alpha=alpha,
label=f'{category} ({sum(mask)})',
edgecolors='black', linewidth=0.3, zorder=5)
# Customize plot
ax.set_xlabel('Longitude', fontsize=12, fontweight='bold')
ax.set_ylabel('Latitude', fontsize=12, fontweight='bold')
ax.set_title(f'King County House Price Predictions - {dataset_type.title()} Set\n'
f'{round((quantiles[1] - quantiles[0]) * 100, 2)}% Confidence Intervals',
fontsize=14, fontweight='bold', pad=20)
# Add legend
ax.legend(loc='upper right', frameon=True, fancybox=True, shadow=True)
ax.grid(True, alpha=0.3, zorder=1)
ax.set_aspect('equal', adjustable='box')
# Calculate statistics
total = len(categories)
excellent_count = categories.count('Excellent')
good_count = categories.count('Good')
poor_count = categories.count('Poor')
within_pct = (sum(within_interval) / total) * 100
# Add statistics text box
stats_text = f"""
Accuracy Summary:
Excellent: {excellent_count} ({excellent_count/total*100:.1f}%)
Good: {good_count} ({good_count/total*100:.1f}%)
Poor: {poor_count} ({poor_count/total*100:.1f}%)
Within Interval: {within_pct:.1f}%
Total: {total:,} properties
"""
ax.text(0.007, 0.24, stats_text, transform=ax.transAxes, fontsize=10,
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.9))
# Set map bounds
if len(lats) > 0 and len(lons) > 0:
lat_margin = (max(lats) - min(lats)) * 0.05
lon_margin = (max(lons) - min(lons)) * 0.05
ax.set_xlim(min(lons) - lon_margin, max(lons) + lon_margin)
ax.set_ylim(min(lats) - lat_margin, max(lats) + lat_margin)
plt.tight_layout()
# Create statistics dictionary
stats = {
'total': total,
'excellent': excellent_count,
'good': good_count,
'poor': poor_count,
'within_interval_pct': within_pct
}
# Print summary
print(f"\n=== {dataset_type.title()} Set Results ===")
print(f"π’ Excellent: {excellent_count} ({excellent_count/total*100:.1f}%)")
print(f"π‘ Good: {good_count} ({good_count/total*100:.1f}%)")
print(f"π΄ Poor: {poor_count} ({poor_count/total*100:.1f}%)")
print(f"Within {round((quantiles[1] - quantiles[0]) * 100, 2)}% Interval: {within_pct:.1f}%")
return fig, ax, stats
fig, ax, stats = plot_predictions_on_king_county(
model=best_qrf, # Your trained model
X_data=X_test_transformed_small, # Your preprocessed features
y_actual=y_test, # Your actual prices
original_data=train_data, # Original data with lat/lon
dataset_type='test',
quantiles=[0.05, 0.95], # 90% confidence interval
figsize=(16, 14),
point_size=25,
alpha=0.8
)
plt.show()
Downloading King County boundary from OpenStreetMap... Making predictions for 20858 properties... === Test Set Results === π’ Excellent: 16762 (80.4%) π‘ Good: 3439 (16.5%) π΄ Poor: 657 (3.1%) Within 90.0% Interval: 96.9%
6. Conclusion and Additional ObservationsΒΆ
Our quantile regression forest model successfully generates reliable prediction intervals for house prices in King County, with approximately 96.8% of actual prices falling within our 90% prediction intervals.
Key findings:
Location Factors: Geographic location remains the strongest predictor of house prices, with waterfront properties and proximity to urban centers commanding significant premiums.
Prediction Confidence: Our model provides narrower (more confident) intervals for mid-range properties in established neighborhoods, while luxury properties and unusual homes have wider prediction intervals.
Model Limitations: The model performs less reliably for:
- Very high-end luxury properties
- Properties with unusual combinations of features
- Areas with limited comparable sales
Future work could explore ensemble methods combining multiple modeling approaches, incorporation of additional external data sources, and development of specialized models for different market segments.