Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative dataset - sold prices #25

Open
osmya opened this issue Apr 24, 2020 · 11 comments
Open

Alternative dataset - sold prices #25

osmya opened this issue Apr 24, 2020 · 11 comments

Comments

@osmya
Copy link

osmya commented Apr 24, 2020

Toby

Would be interesting to create an additional class to take data from previously sold
https://www.rightmove.co.uk/house-prices/London-87490.html?soldIn=1&page=1

Thoughts?

@alandinbedia
Copy link

This would be so helpful!!!

@toby-p
Copy link
Owner

toby-p commented Apr 30, 2020

This is a nice idea - I didn't realize this data was available on the website. Will take a look when I get a chance, or feel free to submit a pull request if you want to have a go. I agree it should probably be a separate class or at least not have any impact on the current API.

@alandinbedia
Copy link

Thanks for the response to both. The easiest way to find the information is actually on the 'Market info' Tab.
For example: I do my normal search for 'Sale' properties, then the property details have a few tabs, such as description, floorplan, map etc... the last one is market info which is from the land registry files so very useful to see what was the sold value for a house selling now.

Please keep me updated if any enhancements on this could be added.
I am also asking a friend of mine (as I am not a developer) to help me to see if he can figure it out and will share if we manage.

Many thanks

@oddkiva
Copy link

oddkiva commented May 3, 2020

Hello all, I am @alandinbedia 's friend.

I followed @osmya 's input initially to find the list of sold properties. I haven't formatted the history of transactions.

import json
import requests

from bs4 import BeautifulSoup

import pandas as pd


class SoldProperties:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.__PRELOADED_STATE__ = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['results']['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        current = page_data['pagination']['current']
        last = page_data['pagination']['last']
        if current == last:
            return

        # Scrape each page
        for page in range(current + 1, last):
            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&page={page}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        return property_data_frame


# 1. Adapt the URL here
#    Go to: https://www.rightmove.co.uk/house-prices.html
#    Type region of interest.
#    Click on 'list view' so that RightMove show the results on the web navigator.
#    Copy the corresponding link here.
url = "https://www.rightmove.co.uk/house-prices/detail.html?country=england&locationIdentifier=REGION%5E70417&searchLocation=London+Fields&radius=0.25"

# 2. Launch the data scraping here.
sold_properties = SoldProperties(url)

# 3. Save the results somewhere.
sold_properties.table.to_csv('sold_properties.csv')

@oddkiva
Copy link

oddkiva commented May 3, 2020

Since it was not really useful, I had a look at the other page and re-adapted the class easily to find out the list of properties for sale.

import json
import requests

from bs4 import BeautifulSoup

import pandas as pd


class PropertiesForSale:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.jsonModel = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        page = 2;
        last = int(page_data['pagination']['last'])
        chunk_size = int(page_data['pagination']['next'])

        # Scrape each page
        while True:
            next_index = (page - 1) * chunk_size
            if next_index > last:
                print('Finished!')
                break

            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&index={page * chunk_size}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

            # Go to the next page.
            page += 1

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        return property_data_frame


# 1. Adapt the URL here
#    Go to: https://www.rightmove.co.uk/house-prices.html
#    Type region of interest.
#    Click on 'list view' so that RightMove show the results on the web navigator.
#    Copy the corresponding link here.
url = "https://www.rightmove.co.uk/property-for-sale/find.html?searchType=SALE&locationIdentifier=REGION%5E70417&insId=1&radius=0.0&minPrice=&maxPrice=&minBedrooms=&maxBedrooms=&displayPropertyType=&maxDaysSinceAdded=&_includeSSTC=on&sortByPriceDescending=&primaryDisplayPropertyType=&secondaryDisplayPropertyType=&oldDisplayPropertyType=&oldPrimaryDisplayPropertyType=&newHome=&auction=false"

# 2. Launch the data scraping here.
properties_for_sale = PropertiesForSale(url)

# 3. Save the results somewhere.
properties_for_sale.table.to_csv('properties_for_sale.csv')

HTH

@toby-p
Copy link
Owner

toby-p commented May 5, 2020

Thanks for doing this, will take a proper look at it when I get the time to add it to the package.

@p2327
Copy link

p2327 commented May 14, 2020

  • I am going to look into this now

@p2327
Copy link

p2327 commented May 15, 2020

@davidok8 hey I think your first class is incredibly useful especially as it gives exact postcode and price history

I think the output could be more streamlined, so I'll work on that and open a PR @toby-p

I am not sure what class PropertiesForSale does?

edit:grammar

@oddkiva
Copy link

oddkiva commented May 15, 2020

Glad to know the first one is useful.

The second class merely returns the list of properties not sold yet. True, it does not contain any market information (probaly the market history of the area).

On the other hand you can find complementary information (GPS location, size in sq feet, addedOrReduced, area in development, etc.). You have to reformat the data...

@davidok8 hey I think your first class is incredibly useful especially as it gives exact postcode and price history

I think the output could be more streamlined, so I'll work on that and open a PR @toby-p

I am not sure what class PropertiesForSale does?

edit:grammar

@p2327
Copy link

p2327 commented May 18, 2020

@davidok8

check latest commit in this PR

You can access a processed df by invoking .processed_data on a SoldProperties object

Note that some pf the code is redundant - i will trim it later

changes to your class:

# imports
import ast
import re
import datetime as dt
from datetime import datetime
from lxml import html
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json

# Env
address_pattern = r'([\s\S]+?)([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?[0-9]?\s[0-9]?[A-Za-z][A-Za-z])'
outwardcode_pattern = r'([A-Za-z][A-Za-z]?[0-9][0-9]?[A-Za-z]?[0-9]?)'

# Helpers
def extract_price(series):
    prices = []
    for entry in series:
        prices.append(int(entry[0]['displayPrice'].strip('£').replace(',', '')))
    return prices


def extract_date(series):
    dates = []
    for entry in series:
        dates.append(datetime.strptime(entry[0]['dateSold'], '%d %b %Y'))
    return dates


def extract_tenure(series):
    tenures = []
    for entry in series:
        tenures.append(entry[0]['tenure'])
    return tenures


def extract_coords(series, lat=False):
    coords = []
    if lat:
        for entry in series:
            coords.append(entry['lat'])
    else:
        for entry in series:
            coords.append(entry['lng'])
    return coords

class SoldProperties:

    def __init__(self, url: str, get_floorplans: bool = False):
        """Initialize the scraper with a URL from the results of a property
        search performed on www.rightmove.co.uk.

        Args:
            url (str): full HTML link to a page of rightmove search results.
            get_floorplans (bool): optionally scrape links to the individual
                floor plan images for each listing (be warned this drastically
                increases runtime so is False by default).
        """
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    @staticmethod
    def _request(url: str):
        r = requests.get(url)
        return r.status_code, r.content

    def refresh_data(self, url: str = None, get_floorplans: bool = False):
        """Make a fresh GET request for the rightmove data.

        Args:
            url (str): optionally pass a new HTML link to a page of rightmove
                search results (else defaults to the current `url` attribute).
            get_floorplans (bool): optionally scrape links to the individual
                flooplan images for each listing (this drastically increases
                runtime so is False by default).
        """
        url = self.url if not url else url
        self._status_code, self._first_page = self._request(url)
        self._url = url
        self._validate_url()
        self._results = self._get_results()

    def _validate_url(self):
        """Basic validation that the URL at least starts in the right format and
        returns status code 200."""
        real_url = "{}://www.rightmove.co.uk/{}/find.html?"
        protocols = ["http", "https"]
        types = ["property-to-rent", "property-for-sale", "new-homes-for-sale"]
        urls = [real_url.format(p, t) for p in protocols for t in types]
        conditions = [self.url.startswith(u) for u in urls]
        conditions.append(self._status_code == 200)
        if not any(conditions):
            raise ValueError(f"Invalid rightmove search URL:\n\n\t{self.url}")

    @property
    def url(self):
        return self._url

    @property
    def table(self):
        return self._results

    def _parse_page_data_of_interest(self, request_content: str):
        """Method to scrape data from a single page of search results. Used
        iteratively by the `get_results` method to scrape data from every page
        returned by the search."""
        soup = BeautifulSoup(request_content, features='lxml')

        start = 'window.__PRELOADED_STATE__ = '
        tags = soup.find_all(
            lambda tag: tag.name == 'script' and start in tag.get_text())
        if not tags:
            raise ValueError('Could not extract data from current page!')
        if len(tags) > 1:
            raise ValueError('Inconsistent data in current page!')

        json_str = tags[0].get_text()[len(start):]
        json_obj = json.loads(json_str)

        return json_obj

    def _get_properties_list(self, json_obj):
        return json_obj['results']['properties']

    def _get_results(self):
        """Build a Pandas DataFrame with all results returned by the search."""
        print('Scraping page {}'.format(1))
        print('- Parsing data from page {}'.format(1))
        try:
            page_data = self._parse_page_data_of_interest(self._first_page)
            properties = self._get_properties_list(page_data)
        except ValueError:
            print('Failed to get property data from page {}'.format(1))

        final_results = properties

        current = page_data['pagination']['current']
        last = page_data['pagination']['last']
        if current == last:
            return

        # Scrape each page
        for page in range(current + 1, last):
            print('Scraping page {}'.format(page))

            # Create the URL of the specific results page:
            p_url = f"{str(self.url)}&page={page}"

            # Make the request:
            print('- Downloading data from page {}'.format(page))
            status_code, page_content = self._request(p_url)

            # Requests to scrape lots of pages eventually get status 400, so:
            if status_code != 200:
                print('Failed to access page {}'.format(page))
                continue

            # Create a temporary DataFrame of page results:
            print('- Parsing data from page {}'.format(page))
            try:
                page_data = self._parse_page_data_of_interest(page_content)
                properties = self._get_properties_list(page_data)
            except ValueError:
                print('Failed to get property data from page {}'.format(page))

            # Append the list or properties.
            final_results += properties

        # Transform the final results into a table.
        property_data_frame = pd.DataFrame.from_records(final_results)

        def process_data(rawdf):
            df = rawdf.copy()
        
            address = df['address'].str.extract(address_pattern, expand=True).to_numpy()
            outwardcodes = df['address'].str.extract(outwardcode_pattern, expand=True).to_numpy()
            
            df = (df.drop(['address', 'images', 'hasFloorPlan', 'detailUrl'], axis=1)
                    .assign(address=address[:, 0])
                    .assign(postcode=address[:, 1])
                    .assign(outwardcode=outwardcodes[:, 0])
                    #.assign(transactions=df.transactions.apply(ast.literal_eval))
                    #.assign(location=df.location.apply(ast.literal_eval))
                    .assign(last_price=lambda x: extract_price(x.transactions))
                    .assign(sale_date=lambda x: extract_date(x.transactions))
                    .assign(tenure=lambda x: extract_tenure(x.transactions))
                    .assign(lat=lambda x: extract_coords(x.location, lat=True))
                    .assign(lng=lambda x: extract_coords(x.location))
                    .drop(['transactions', 'location'], axis=1)
            )
            return df
     
        #return process_data(property_data_frame)

        return property_data_frame

    @property
    def processed_data(self):
        df = self._results
    
        address = df['address'].str.extract(address_pattern, expand=True).to_numpy()
        outwardcodes = df['address'].str.extract(outwardcode_pattern, expand=True).to_numpy()
        
        df = (df.drop(['address', 'images', 'hasFloorPlan', 'detailUrl'], axis=1)
                .assign(address=address[:, 0])
                .assign(postcode=address[:, 1])
                .assign(outwardcode=outwardcodes[:, 0])
                #.assign(transactions=df.transactions.apply(ast.literal_eval))
                #.assign(location=df.location.apply(ast.literal_eval))
                .assign(last_price=lambda x: extract_price(x.transactions))
                .assign(sale_date=lambda x: extract_date(x.transactions))
                .assign(tenure=lambda x: extract_tenure(x.transactions))
                .assign(lat=lambda x: extract_coords(x.location, lat=True))
                .assign(lng=lambda x: extract_coords(x.location))
                .drop(['transactions', 'location'], axis=1)
                .reindex(columns=['last_price', 
                                'sale_date', 
                                'propertyType',
                                'bedrooms',
                                'bathrooms', 
                                'tenure', 
                                'address', 
                                'postcode', 
                                'outwardcode', 
                                'lat', 
                                'lng'])
        )
        return df
     

@andrewwilso
Copy link

This is extremely useful. Is it possible to include the get_floorplans option as in the main class?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants