Bigpatent dataset versions

Hi,

I’m trying to make sense out of the various versions of the bigpatent dataset.

In particular, the version in the huggingface’s dataset library is uncased and tokenized. Did you obtain that version out of the raw untokenized dataset here or it was provided by the authors?

My confusion comes from the fact that there are different versions of the dataset. For example, the Pegasus preserved casing for their experients. As far as I understood, their version of the dataset was obtained from the raw dataset by using this preprocessing. However, I fail when trying to build the dataset*, even when considering a single code split only.

Also, the huggingface dataset seems to lack some (useless) sections (e.g. in relation to previous art) so I was wondering how you obtained the clean version of the dataset in the first place.

Can anybody help?

*I have seen from the TFDS issues that huggingface had the same problem when trying to replicate the results from Pegasus. Did you manage to solve that problem?

Hi ! We’re using the same source data as what TFDS used to use.

What error do you get when trying to build the dataset ?

I tried to build the pegasus dataset as:

python -m tensorflow_datasets.scripts.download_and_prepare --datasets=big_patent/g

However, while I use a machine with 60+ GB of RAM, the process gets killed apparently due to RAM overuse. Do you know if any prebuilt version of the cased (Pegasus) dataset is available?

I downloaded the preprocessed prebuilt huggingface dataset from here (from the link on github).

My confusion comes from situations like the following.
For patent with publication_number US-2007088503-A1, the huggingface version description is:

“referring now to fig1 and 2 , a service technician visiting a customer service location is provided with a technician input device 2 for receiving and transmitting information related to a disruption or interruption of service at the service location . […]”

However, if I look at the raw data used by the TFDS (here, from this script), the description contains a lot of text before that paragraph:

“CROSS-REFERENCE TO RELATED APPLICATIONS \n [0001] This is a continuation of application Ser. No. 10/445,861 filed May 27, 2003, which is a continuation of application Ser. No. 10/032,853 filed Oct. 25, 2001 and now U.S. Pat. No. 6,772,064. \n \n \n BACKGROUND OF THE INVENTION \n [0002] 1. Field of the Invention \n [0003] The present methods and systems generally relate to processing and transmitting information to facilitate providing service in a telecommunications network. The methods and systems discussed herein more particularly relate to use of global satellite positioning to facilitate processing and transmission of information associated with telecommunications service locations and routing travel between more than one such service location. \n [0004] 2. Description of the Related Art \n [0005] Efficient and effective customer service is an essential requirement for commercial enterprises to compete successfully in today's business world. In the telecommunications industry, for example, providing customer service is an important part of sustaining market share in view of the many competitors in the industry. Customers whose telephone service, for example, is interrupted or disconnected for even a relatively short period of time may desire to seek an alternative source for service, especially if the interruption or disconnection is not addressed by a quick and effective customer service response. \n [0006] One important aspect of providing customer service is maintaining accurate and complete knowledge of the customer's location. Computer systems and databases that provide customer addresses often only provide vague references, however, to the exact location of the customer. Such customer addresses typically do not include information of sufficient specificity to permit efficient identification of a service location associated with the customer. In the context of a technician transporting a vehicle to a customer's service location, for example, this lack of sufficient service location information can generate excessive driving time and slow response time. Where the response time is unacceptably high, the lack of sufficient service location information can result in delayed or missed customer commitments. It can be appreciated that such delayed or missed customer commitments can cause a commercial enterprise to lose valuable customers. \n [0007] What are needed, therefore, are methods and systems for acquiring information associated with a customer's service location. Such methods and systems are needed to obtain, for example, a latitude and longitude associated with the customer's service location. In one aspect, if latitude and longitude information could be collected by a service technician when the customer's service location is visited, those coordinates could then be used to find the customer at a later date. Moreover, if latitude and longitude coordinates could be made available in a database associated with that specific customer, the coordinates could be used to assist in determining the service location of that customer. Such service location information could permit a service technician to drive directly to the customer service location with little or no time lost searching for the service location. \n [0008] What are also needed are methods and systems for providing a service technician with directions, such as driving directions between two or more service locations. Such directions could be employed to route travel from a first customer service location to a second customer service location. It can be seen that such directions would further reduce the possibility of error in locating a customer service location and thereby enhance customer service response time. \n SUMMARY \n [0009] Methods and systems are provided for obtaining information related to a customer service location. One embodiment of the method includes requesting at least one set of coordinates associated with the customer service location; accessing a technician server to direct a global satellite positioning system to obtain the set of coordinates for the customer service location; obtaining the coordinates and updating one or more databases with the coordinates. The coordinates may include at least one of a latitude and a longitude associated with the customer service location. One embodiment of a system for obtaining information related to a customer service location includes an input device configured for use by a service technician at the customer service location. A technician server is included in the system for receiving data transmissions from the input device. The technician server is in communication with a global positioning satellite system for determining a set of coordinates associated with the input device. Computer-readable media embodiments are also presented in connection with these methods and systems. \n [0010] In addition, methods and systems are discussed herein for generating directions for a service technician traveling from a first customer service location to at least a second customer service location. One embodiment of the method includes obtaining through a technician server at least one set of \u201cfrom\u201d coordinates associated with the first customer service location and at least one set of \u201cto\u201d coordinates associated with the second customer location; transmitting the \u201cfrom\u201d and \u201cto\u201d coordinates to a mapping system; and, generating directions in the mapping system based on the \u201cto\u201d and the \u201cfrom\u201d coordinates. One system embodiment includes an input device configured for use by a service technician at a first customer service location. A technician server is provided for receiving data transmissions from the input device. A global positioning satellite system, which is configured for determining at least one set of \u201cfrom\u201d coordinates associated with the input device is provided for use on an as needed basis. At least one database is included in the system for storing a \u201cto\u201d set of coordinates associated with the second customer service location and the \u201cfrom\u201d set of coordinates. The system further includes a mapping system operatively associated with the input device for generating travel directions based on the \u201cfrom\u201d and \u201cto\u201d coordinates. At least one of the sets of coordinates includes latitude and a longitude data. Computer-readable media embodiments of these methods and systems are also provided. \n \n \n BRIEF DESCRIPTION OF THE FIGURES \n [0011] FIG. 1 is a schematic diagram depicting one embodiment of a system for obtaining, processing, and transmitting information related to providing customer service at a customer service location; \n [0012] FIG. 2 is a schematic diagram depicting a portion of the system of FIG. 1 in more detail; \n [0013] FIG. 3 is a process flow diagram showing one embodiment of a method for obtaining, transmitting and processing information related to providing service at a customer service location; \n [0014] FIG. 4 is a schematic diagram depicting one embodiment of a system for obtaining, processing, and transmitting information related to providing customer service at a customer service location; and, \n [0015] FIG. 5 is a progress flow diagram depicting one embodiment of a method for obtaining, processing, and transmitting information related to providing customer service at a customer service location. \n \n \n DETAILED DESCRIPTION \n [0016] Referring now to FIGS. 1 and 2 , a service technician visiting a customer service location is provided with a technician input device 2 for receiving and transmitting information related to a disruption or interruption of service at the service location. […]”

I am not sure of how the TFDS is built, but if it uses the code in the script linked above, I cannot see where the leading text is removed.
I am asking because I would like to use Pegasus and replicate some results, but I am failing in getting the right data in the first place.