Home Pandas + Power BI - Properati
Post
Cancel

Pandas + Power BI - Properati

  • Understand data
  • Prepare data
  • Clean data
    • Remove unnecessary data
    • Fix time series related data
    • Replace NaNs by real and useful data
  • Build PBI dashboard

About

This dataset comes from Properati Data (https://www.properati.com.ar/data/) is the data division of Properati, a real estate search site in Latin America.

The idea was to work 100% from PowerBI but I ran into difficulties (power bi is very slow for some tasks) so I decided to first clean the data with Pandas and then use PowerBI for graphics

I will be updating this page at least once a week

Topic

It contains information on online property rental or sale ads from around 2020 to 2021, it has various details such as square meters, price, region, etc.

Features

ColDescription
typeNotice type (Property, Development/Project)
countryCountry in which the notice is published
idNotice identifier. It is not unique: if the notice is updated by the real estate agency (new version of the notice), a new record is created with the same id but different dates
start_dateNotice registration date
end_dateNotice cancellation date
placeFields referring to the location of the property or development
latLatitude
lonLength
l1,l2,l3,l4Country, province, city, neighborhood
operationType of operation (Sale, Rent)
typeType of property (House, Department, PH)
roomsNumber of rooms (useful in Argentina)
bedroomsNumber of bedrooms (useful in the rest of the countries)
bathroomsNumber of bathrooms
surface_totalTotal surface in m²
surface_coveredSurface covered in m²
pricePrice
currencyCurrency of the published price
price_periodPrice Period (Daily, Weekly, Monthly)
titleTitle of the ad
descriptionDescription of the ad
developmentFields related to real estate development (empty if the ad is for a property)
statusDevelopment status (Completed, Under construction, …)
nameDevelopment name
short_descriptionShort description of the ad
descriptionDescription of the ad

Exploring and filtering data

1
2
3
import pandas as pd
df = pd.read_csv('ar_properties.csv')
df.head()

Desktop View

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df.info()

    RangeIndex: 1000000 entries, 0 to 999999
    Data columns (total 25 columns):
     #   Column           Non-Null Count    Dtype  
    ---  ------           --------------    -----  
     0   id               1000000 non-null  object 
     1   ad_type          1000000 non-null  object 
     2   start_date       1000000 non-null  object 
     3   end_date         1000000 non-null  object 
     4   created_on       1000000 non-null  object 
     5   lat              894233 non-null   float64
     6   lon              894191 non-null   float64
     7   l1               1000000 non-null  object 
     8   l2               1000000 non-null  object 
     9   l3               965273 non-null   object 
     10  l4               306162 non-null   object 
     11  l5               5530 non-null     object 
     12  l6               0 non-null        float64
     13  rooms            714179 non-null   float64
     14  bedrooms         649933 non-null   float64
     15  bathrooms        765122 non-null   float64
     16  surface_total    477831 non-null   float64
     17  surface_covered  487756 non-null   float64
     18  price            958243 non-null   float64
     19  currency         955491 non-null   object 
     20  price_period     429870 non-null   object 
     21  title            999999 non-null   object 
     22  description      999958 non-null   object 
     23  property_type    1000000 non-null  object 
     24  operation_type   1000000 non-null  object 
    dtypes: float64(9), object(16)
    memory usage: 190.7+ MB

We have a 190MB memory usage dataframe, we will be dropping the following columns:

  • ID: is not needed because it can be replaced by the dataframe index
  • created_on: is the same as start_date
  • L5, L6: have a lot of nulls
  • title, description: has text which i will not use for this notebook

This improves memory usage about 20%

1
2
3
df = df.drop(['title', 'description','id','created_on','l5','l6'], axis = 1)
print('memory usage: '+str(df.memory_usage().sum()//1024000)+' MB')
->>> memory usage: 148 MB # aprox. -20% memory usage

What is inside each column? More or less 50% float 50% text…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
df.dtypes

    ad_type             object
    start_date          object
    end_date            object
    lat                float64
    lon                float64
    l1                  object
    l2                  object
    l3                  object
    l4                  object
    rooms              float64
    bedrooms           float64
    bathrooms          float64
    surface_total      float64
    surface_covered    float64
    price              float64
    currency            object
    price_period        object
    property_type       object
    operation_type      object

We can later improve the dataframe memory usage transforming some columns to Categorical data type

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
df.nunique()

    ad_type                 1
    start_date            362
    end_date              450
    lat                389394
    lon                392553
    l1                      4
    l2                     43
    l3                   1367
    l4                   1070
    rooms                  38
    bedrooms               85
    bathrooms              20
    surface_total        5191
    surface_covered      3370
    price               20807
    currency                5
    price_period            3
    property_type          10
    operation_type          3
    dtype: int64

Nulls for each column (%)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
df.isna().mean()*100).round(1)

    ad_type             0.0
    start_date          0.0
    end_date            0.0
    lat                10.6
    lon                10.6
    l1                  0.0
    l2                  0.0
    l3                  3.5
    l4                 69.4
    rooms              28.6
    bedrooms           35.0
    bathrooms          23.5
    surface_total      52.2
    surface_covered    51.2
    price               4.2
    currency            4.5
    price_period       57.0
    property_type       0.0
    operation_type      0.0

Drop L4 because we can locate using L1, L2, L3, lat and lon

1
2
df = df.drop(['l4'],axis=1)
df.head(3)

Transforming start_date and end_date to date format but… we ran into an error!

some dates are formated as 9999-12-31 00:00:00

1
2
3
4
5
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df.head(3)

ERROR: OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 9999-12-31 00:00:00

Let’s fix it replacing 9999 by start_date YYYY :-)

1
2
import numpy as np
df['end_date'] = np.where(df.end_date.str[:4] == '9999', df.start_date.str[:4] + df.end_date.str[4:], df.end_date)

Check how many unique values are present

1
2
3
4
cat_cols = ['l1','currency','price_period', 'property_type', 'operation_type']

df.select_dtypes(include=object).columns.tolist()
(pd.DataFrame(df[cat_cols].melt(var_name='column', value_name='value').value_counts()).rename(columns={0:'qty'}).sort_values(by=['column', 'qty']))
qty
columnvalue
currencyCOP3
PEN292
UYU2785
ARS214254
USD738157
l1Brasil244
Estados Unidos892
Uruguay15749
Argentina983115
operation_typeAlquiler temporal41679
Alquiler207490
Venta750831
price_periodDiario11
Semanal17
Mensual429842
property_typeCasa de campo2251
Depósito7872
Cochera13276
Otro27837
Oficina32390
Local comercial45593
PH45837
Lote121311
Casa245726
Departamento457907

We will only focus on Argentinan market, also drop rows with low amount of values

Currency

1
2
ars_usd_currency = df[ (df['currency'] != 'ARS') & (df['currency'] != 'USD') ].index
df.drop(ars_usd_currency,inplace=True)

L1

1
2
L1_only_argentina = df[ (df['l1'] != 'Argentina')].index
df.drop(L1_only_argentina,inplace=True)

Some rows in LAT and LON columns have a lot of NaN values, we will be replacing NaNs using the location in L3 L4 columns, steps followed:

  • Create the a new df by filtering only null values in LAT column
    1
    
    df_NaN_lat_lon = df.loc[(df['lat'].isnull())]
    
  • Download the new df as a new .csv
    1
    
    df_NaN_lat_lon.to_csv('df_nans.csv')
    
  • Pass it to .xls so i can interact much better with it (https://convertio.co/es/csv-xlsx/)
  • Filter that .xls for unique values
  • Use the google datasheet extension ‘ezGeocode’. You pass it the location and returns lat & lon
  • Now it’s time to merge both files and replace nulls for real LAT & LON

Dashboard!

After a few days of work, here is the final dashboard! I tried to make it as simple as possible and self explanatory

The main idea was to practice the use of PowerBI after cleaning data with Pandas, i am very happy with the final result :-)

Desktop View Desktop View Desktop View Desktop View

This post is licensed under CC BY 4.0 by the author.