Tempo with LSTM #165

korosig · 2022-03-10T13:31:25Z

I have a usage question.

If I have a (2D) time series that I want to use for e.g. an LSTM model. Then I first convert it to a 3D array and then pass it to the model. This is normally done in memory with numpy. But what happens when I manage my BIG file with Spark? The solutions I've seen so far all do it by working with Spark and then converting the 3D data in numpy at the end. And that puts everything in memory.... or am I thinking wrong?

A common Spark LSTM solution is looks like this:

# create fake dataset
import random 
from keras import models
from keras import layers
 
 
 
data = []
for node in range(0,100):
    for day in range(0,100):
        data.append([str(node),
                     day,
                     random.randrange(15, 25, 1),
                     random.randrange(50, 100, 1),
                     random.randrange(1000, 1045, 1)])
        
df = spark.createDataFrame(data,['Node', 'day','Temp','hum','press'])
 
# transform the data
df_trans  = df.groupBy('day').pivot('Node').sum()
df_trans = df_trans.orderBy(['day'], ascending=True)
 
#make tran/test data
trainDF = df_trans[df_trans.day < 70]
testDF = df_trans[df_trans.day > 70]
 
 
################## we lost the SPARK #############################
# create train/test array
trainArray = np.array(trainDF.select(trainDF.columns).collect())
testArray = np.array(testDF.select(trainDF.columns).collect())
 
# drop the target columns
xtrain = trainArray[:, 0:-1]
xtest = testArray[:, 0:-1]
# take the target column
ytrain = trainArray[:, -1:]
ytest = testArray[:, -1:]
 
# reshape 2D to 3D
xtrain = xtrain.reshape((xtrain.shape[0], 1, xtrain.shape[1]))
xtest = xtest.reshape((xtest.shape[0], 1, xtest.shape[1]))
 
# build the model
model = models.Sequential()
model.add(layers.LSTM(1, input_shape=(1,400)))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
 
 
# train the model
loss = model.fit(xtrain, ytrain, batch_size=10, epochs=100)

My problem with this is: if my Spark data uses millions of rows and thousands of columns, then when the # create train/test array program line tries to transform the data, it causes a memory overflow.

How do I use Tempo with LSTM to solve the 2D - > 3D problem?

The text was updated successfully, but these errors were encountered:

tnixon added the question Further information is requested label Apr 5, 2022

tnixon self-assigned this Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tempo with LSTM #165

Tempo with LSTM #165

korosig commented Mar 10, 2022

Tempo with LSTM #165

Tempo with LSTM #165

Comments

korosig commented Mar 10, 2022